High system load running OF-2.2 on cluster, but not OF-2.1
Recently we updated our compute cluster from CentOS 5.5 to Debian 7, mostly because we had trouble compiling OpenFOAM 2.2.x with the GCC version available.
However, we are now seeing a large drop-off in performance, but only for cases running OF-2.2.x; OF-2.1.x performs, roughly, as before. We suspect there is a communication problem between the nodes, as jobs using only one (multi-cpu, multi-core) node do not show this behaviour. Gigabit ethernet is used for the interconnect.
Is anyone else experiencing this?
ps. For both OF versions the system GCC was used (version 4.7.2) and the third party openmpi (version 1.5.3 for OF-2.1.x and version 1.6.3 for OF-2.2.x). No changes were made to any OpenFOAM settings or compile flags.
did you solve this problem? Have you tried other MPI libraries?
Unfortunately we have, despite considerable effort, not been able to solve our problems.
Besides the OpenFOAM supplied MPI, we have tried several system MPI's. None of them showed a difference in performance.
Good luck in your troubles and please let us know if you find a solution.
here is my setup:
To test the runtime I used the motorbike tutorial:
(the version of 2.1.1 to test with both OpenFOAM versions, because the newer one does not work with the 2.1.1 version). The solver was simpleFoam (different version for 2.1.1 and 2.2.2)
For both version of OpenFOAM the test case was decomposed into 6 processor domains. The test run was on two nodes of a cluster with each node using 3 processors. The nodes are connected with gigabit ethernet and Infiniband. I think mvapich2 is automatically using Infiniband.
The jobs were started from the queuing system using mpiexec 0.84 from https://www.osc.edu/~djohnson/mpiexec/:
ExecutionTime = 232.49 s ClockTime = 243 s
The runtime for OF 2.2.2 was:
ExecutionTime = 229.14 s ClockTime = 239 s
So it looks like everything is ok.
In the past I also used gcc and it seemed to work. But on the other hand, I also noticed sometimes that OpenFOAM hang to whole network of the cluster. Using top a lot of "kworker" and "migration" processes appear using lots of CPU time (on the head and the nodes of the cluster).
I would guess, that in your case there is a problem with the configuration of the networking hardware in combination with configuration problems of MPI. In my case, the above mentioned mpiexec with the ability to specify the communication channel seemed to help. (ok, everything very vague).
|All times are GMT -4. The time now is 19:15.|