OF211 with mvapich2 on redhat cluster, error when using more than 64 cores?
I am using a cluster (redhat 6.4) which does not support openmpi. So I compiled OF with mvapich2. (Gcc 4.4.7 and mvapich2-1.9) I managed to run small job using upto 64 cores (4nodes,16cores/node) and there is no error. BUT, when I tried to use 128 cores or more, there is an error coming out as was shown below. And then I re-compile OF on another cluster which support both openmpi and mvapich2 (I can run more than 512 cores on this cluster using openmpi). Similar error coming out, I can not run more than 64 cores using mvapich2!
It is really weird. You guys met this error before? How to fix this? Thanks in advance!
BTW: I compile OF-2.1.1 with mvapich2 on redhat 6.4 by:
in etc/bashrc , change WM_MPLIB=OPENMPI to WM_MPLIB=MPI
in etc/config/settings.sh, replace:
All the other compilation are similar with that with openmpi, and there is no error.
This is a known issue for Mvapich2 team when some 3rd party libraries are interacting with their internal memory (ptmalloc) library. They got similar reports earlier with MPI programs integrated with Perl and some other external libraries. This interaction causing libc.so memory functions appearing before MVAPICH2 library(libmpich.so) in dynamic shared lib ordering which is leading to Ptmalloc initialization failure.
Mvapich2 2.0a can manage this thing and only print a warning instead to crash. I know that there is an another way to avoid that by changing the order of linked library, but I don't remember exactly how it works.
For time being, can you please try with run-time parameter MV2_ON_DEMAND_THRESHOLD=<your job size>. With this parameter, your application should continue without registration cache feature but it could lead to some performance degradation. You can also try MVAPICH2 2.0a.
Many thanks for your reply! I re-compiled the OpenFoam with mvapich2 2.0a and it worked! As you said, there is just a warning as follow but no error. I made a test and openfoam runs fine with more than 256 cores!
Thanks again for your time!
hi, Ripperjack and Vienne,
I am a novice using linux and I meet analogy problems when I run fvcom(one kind of ocean model) using mvapich2_intel as following,
and my PBS script are as following :
My .bash_profile are as following:
could you give me some advise how to solve it ? thanks a lot！
[node27:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 8. MPI process died?
[node27:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
[node28:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 5. MPI process died?
[node28:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI process died?
[node27:mpispawn_0][child_handler] MPI process (rank: 0, pid: 3374) terminated with signal 11 -> abort job
[node27:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node node27 aborted: MPI process error (1)
[node28:mpispawn_1][child_handler] MPI process (rank: 4, pid: 3240) terminated with signal 11 -> abort job
|All times are GMT -4. The time now is 18:29.|