Unconsistent parallel jobs running time
Hello all!
I keep posting on this forum as I find it really useful. I have recently come up with some issues regarding parallel jobs. I am running potentialFoam and simpleFoam on several cluster nodes. I am experiencing really different running times depending on the nodes selected. The times can be multiplied by *5 or even be stuck on the cluster depending on the nodes selected ! I am running with openfoam-2.3.1 and mpirun-1.6.5 and using InfiniBand. Before I give you more information, does anyone has seen those kind of problems ? I would like to know if there is a software or an openfoam utility to output the amount of data transferred between the processors ? I know there is something on fluent to obtain the parallel data transfer. I have tried to set the Pstream debug switches to 1 in openfoam but the output is so low level that it is impossible to draw any conclusions with this... |
I'm not aware of any utility to meassure the parallel data transfer.
Couple of hints/questions:
Armin |
Thanks for your reply Armin,
To answer your questions, 1) No I am using the standard openfam solvers, utilities, etc coming from openfoam-2.3.1 2)Between 300k and 1M which I think should be ok 3)I don't write any data neither do I read it (I start from steady boundary conditions)! 4)I am running this test at the moment, I will let you know ! 5) Execution time and cloktime are very similar, should I see a major difference ? |
Quote:
Quote:
Meaning, the closer they are the more time you are actually computing something and the less time is spend with other stuff like IO. At least that's how it typically goes, there are exceptions though. |
Hello I am coming back to you with more information.
I have run the Test-Parallel of OpenFOAM and the output looks fine for me. Here is an example of the log file PHP Code:
Just as a quick reminder, we observe this behaviour: Running on a single switch, the case is running as expected with let's say 80 seconds per iteration. Running the same job across multiple switches, each iteration takes 250 sec, so 3 times more. I want to emphasize that the IB fabric seems to work correctly as we don't observe any issue running commercial grade CFD applications. We have built mpich3.1.3 from source and we observe exactly the same behaviour as using openmpi (slow across switches and fast in a single switch) so this suggests it is not mpi-related. Has anyone experienced this behaviour running parallel openfoam jobs ? Any pointer would be greatly appreciated ! |
All times are GMT -4. The time now is 15:37. |