CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   OpenFOAM Installation (https://www.cfd-online.com/Forums/openfoam-installation/)
-   -   OF211 with mvapich2 on redhat cluster, error when using more than 64 cores? (https://www.cfd-online.com/Forums/openfoam-installation/125812-of211-mvapich2-redhat-cluster-error-when-using-more-than-64-cores.html)

ripperjack October 31, 2013 12:45

OF211 with mvapich2 on redhat cluster, error when using more than 64 cores?
 
Hi guys,

I am using a cluster (redhat 6.4) which does not support openmpi. So I compiled OF with mvapich2. (Gcc 4.4.7 and mvapich2-1.9) I managed to run small job using upto 64 cores (4nodes,16cores/node) and there is no error. BUT, when I tried to use 128 cores or more, there is an error coming out as was shown below. And then I re-compile OF on another cluster which support both openmpi and mvapich2 (I can run more than 512 cores on this cluster using openmpi). Similar error coming out, I can not run more than 64 cores using mvapich2!

It is really weird. You guys met this error before? How to fix this? Thanks in advance!

Regards,

Code:

erro code from the cluster
[cli_47]: aborting job:
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(436)...:
MPID_Init(371)..........: channel initialization failed
MPIDI_CH3_Init(285).....:
MPIDI_CH3I_CM_Init(1106): Error initializing MVAPICH2 ptmalloc2 library
....
.... many of these

[readline] Unexpected End-Of-File on file descriptor 9. MPI process died?
[mtpmi_processops] Error while reading PMI socket. MPI process died?
[child_handler] MPI process (rank: 43, pid: 92867) exited with status 1
][child_handler] MPI process (rank: 78, pid: 37914) exited with status 1
[readline] Unexpected End-Of-File on file descriptor 14. MPI process died?
[mtpmi_processops] Error while reading PMI socket. MPI process died?
[child_handler] MPI process (rank: 47, pid: 92871) exited with status 1
[child_handler] MPI process (rank: 69, pid: 37905) exited with status 1
][readline] Unexpected End-Of-File on file descriptor 16. MPI process died?

...
... many of these

Code:

error code from another cluster
[cli_8]: aborting job:
Fatal error in MPI_Init:
Other MPI error

[cli_7]: aborting job:
Fatal error in MPI_Init:
Other MPI error

[cli_15]: aborting job:
Fatal error in MPI_Init:
Other MPI error

[cli_6]: aborting job:
Fatal error in MPI_Init:
Other MPI error

[cli_68]: aborting job:
Fatal error in MPI_Init:
Other MPI error

[cli_66]: aborting job:
Fatal error in MPI_Init:
Other MPI error

[proxy:0:1@compute-0-72.local] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
[proxy:0:1@compute-0-72.local] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:1@compute-0-72.local] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:6@compute-0-75.local] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
[proxy:0:6@compute-0-75.local] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:6@compute-0-75.local] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:7@compute-0-76.local] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
[proxy:0:7@compute-0-76.local] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:7@compute-0-76.local] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:2@compute-0-10.local] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
[proxy:0:2@compute-0-10.local] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:2@compute-0-10.local] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:3@compute-0-37.local] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
[proxy:0:3@compute-0-37.local] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:3@compute-0-37.local] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:5@compute-0-40.local] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
[proxy:0:5@compute-0-40.local] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:5@compute-0-40.local] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@compute-0-6.local] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@compute-0-6.local] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@compute-0-6.local] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion
[mpiexec@compute-0-6.local] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion


ripperjack October 31, 2013 12:48

BTW: I compile OF-2.1.1 with mvapich2 on redhat 6.4 by:

in etc/bashrc , change WM_MPLIB=OPENMPI to WM_MPLIB=MPI
in etc/config/settings.sh, replace:
MPI)
export FOAM_MPI=mpi
export MPI_ARCH_PATH=/opt/mpi
;;
with below
MPI)
export FOAM_MPI=mpi
export MPI_HOME=/opt/apps/intel13/mvapich2/1.9
export MPI_ARCH_PATH=$MPI_HOME
_foamAddPath $MPI_ARCH_PATH/bin
_foamAddLib $MPI_ARCH_PATH/lib
;;

All the other compilation are similar with that with openmpi, and there is no error.

Vienne October 31, 2013 21:34

Hi Ripperjack,

This is a known issue for Mvapich2 team when some 3rd party libraries are interacting with their internal memory (ptmalloc) library. They got similar reports earlier with MPI programs integrated with Perl and some other external libraries. This interaction causing libc.so memory functions appearing before MVAPICH2 library(libmpich.so) in dynamic shared lib ordering which is leading to Ptmalloc initialization failure.
Mvapich2 2.0a can manage this thing and only print a warning instead to crash. I know that there is an another way to avoid that by changing the order of linked library, but I don't remember exactly how it works.
For time being, can you please try with run-time parameter MV2_ON_DEMAND_THRESHOLD=<your job size>. With this parameter, your application should continue without registration cache feature but it could lead to some performance degradation. You can also try MVAPICH2 2.0a.

Thanks,
Jerome

ripperjack October 31, 2013 21:52

Quote:

Originally Posted by Vienne (Post 460096)
Hi Ripperjack,

This is a known issue for Mvapich2 team when some 3rd party libraries are interacting with their internal memory (ptmalloc) library. They got similar reports earlier with MPI programs integrated with Perl and some other external libraries. This interaction causing libc.so memory functions appearing before MVAPICH2 library(libmpich.so) in dynamic shared lib ordering which is leading to Ptmalloc initialization failure.
Mvapich2 2.0a can manage this thing and only print a warning instead to crash. I know that there is an another way to avoid that by changing the order of linked library, but I don't remember exactly how it works.
For time being, can you please try with run-time parameter MV2_ON_DEMAND_THRESHOLD=<your job size>. With this parameter, your application should continue without registration cache feature but it could lead to some performance degradation. You can also try MVAPICH2 2.0a.

Thanks,
Jerome

Dear Jerome,

Many thanks for your reply! I re-compiled the OpenFoam with mvapich2 2.0a and it worked! As you said, there is just a warning as follow but no error. I made a test and openfoam runs fine with more than 256 cores!

Thanks again for your time!

Best regards,

Code:

WARNING: Error in initializing MVAPICH2 ptmalloc library.Continuing without InfiniBand registration cache support.

lvcheng August 30, 2014 03:47

analogy problems
 
3 Attachment(s)
hi, Ripperjack and Vienne,

I am a novice using linux and I meet analogy problems when I run fvcom(one kind of ocean model) using mvapich2_intel as following,
Attachment 33418


and my PBS script are as following :
Attachment 33419

My .bash_profile are as following:
Attachment 33420

could you give me some advise how to solve it ? thanks a lot!

lvcheng

[node27:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 8. MPI process died?
[node27:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
[node28:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 5. MPI process died?
[node28:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI process died?
[node27:mpispawn_0][child_handler] MPI process (rank: 0, pid: 3374) terminated with signal 11 -> abort job
[node27:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node node27 aborted: MPI process error (1)
[node28:mpispawn_1][child_handler] MPI process (rank: 4, pid: 3240) terminated with signal 11 -> abort job


All times are GMT -4. The time now is 06:44.