multiple parallel jobs on one machine
I have some users who are here and running into a really odd problem. We have a 600+ core cluster that they are running on, and parallel MPI jobs run fine most of the time. But, if a user submits two jobs that have some of their slots on the same nodes end up behaving very strangely. As a more detailed example, let's say that a user submits 2 10-core jobs and they get divided up across the nodes like this
node1 - 5-cores for A
node2 - 5-cores for A and 5-cores for B
node3 - 5-cores for B
If something like this happens, then weird things happen to the runs. If, instead, the two jobs don't share any nodes at all, they both run fine. Any ideas of what may be happening? This is using the bundled HP-MPI, by the way.
|All times are GMT -4. The time now is 07:43.|