non-monolithic behavior in CPU parallel with joblib
See https://code.itp.ac.cn/Osgood/crystalformer/-/blob/joblib/speedtests/README.md#futher-optimization
The code is https://code.itp.ac.cn/Osgood/crystalformer/-/blob/joblib/simple_loss.py#L63
What could be the cause of this and suggestion for fixing?