CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > OpenFOAM Installation

OF211 with mvapich2 on redhat cluster, error when using more than 64 cores?

Register Blogs Members List Search Today's Posts Mark Forums Read

Like Tree3Likes
  • 3 Post By Vienne

Reply
 
LinkBack Thread Tools Display Modes
Old   October 31, 2013, 13:45
Default OF211 with mvapich2 on redhat cluster, error when using more than 64 cores?
  #1
Member
 
Ping
Join Date: Dec 2011
Posts: 63
Rep Power: 5
ripperjack is on a distinguished road
Hi guys,

I am using a cluster (redhat 6.4) which does not support openmpi. So I compiled OF with mvapich2. (Gcc 4.4.7 and mvapich2-1.9) I managed to run small job using upto 64 cores (4nodes,16cores/node) and there is no error. BUT, when I tried to use 128 cores or more, there is an error coming out as was shown below. And then I re-compile OF on another cluster which support both openmpi and mvapich2 (I can run more than 512 cores on this cluster using openmpi). Similar error coming out, I can not run more than 64 cores using mvapich2!

It is really weird. You guys met this error before? How to fix this? Thanks in advance!

Regards,

Code:
erro code from the cluster
[cli_47]: aborting job:
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(436)...:
MPID_Init(371)..........: channel initialization failed
MPIDI_CH3_Init(285).....:
MPIDI_CH3I_CM_Init(1106): Error initializing MVAPICH2 ptmalloc2 library
....
.... many of these

[readline] Unexpected End-Of-File on file descriptor 9. MPI process died?
[mtpmi_processops] Error while reading PMI socket. MPI process died?
[child_handler] MPI process (rank: 43, pid: 92867) exited with status 1
][child_handler] MPI process (rank: 78, pid: 37914) exited with status 1
[readline] Unexpected End-Of-File on file descriptor 14. MPI process died?
[mtpmi_processops] Error while reading PMI socket. MPI process died?
[child_handler] MPI process (rank: 47, pid: 92871) exited with status 1
[child_handler] MPI process (rank: 69, pid: 37905) exited with status 1
][readline] Unexpected End-Of-File on file descriptor 16. MPI process died?

...
... many of these
Code:
error code from another cluster
[cli_8]: aborting job:
Fatal error in MPI_Init:
Other MPI error

[cli_7]: aborting job:
Fatal error in MPI_Init:
Other MPI error

[cli_15]: aborting job:
Fatal error in MPI_Init:
Other MPI error

[cli_6]: aborting job:
Fatal error in MPI_Init:
Other MPI error

[cli_68]: aborting job:
Fatal error in MPI_Init:
Other MPI error

[cli_66]: aborting job:
Fatal error in MPI_Init:
Other MPI error

[proxy:0:1@compute-0-72.local] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
[proxy:0:1@compute-0-72.local] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:1@compute-0-72.local] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:6@compute-0-75.local] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
[proxy:0:6@compute-0-75.local] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:6@compute-0-75.local] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:7@compute-0-76.local] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
[proxy:0:7@compute-0-76.local] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:7@compute-0-76.local] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:2@compute-0-10.local] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
[proxy:0:2@compute-0-10.local] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:2@compute-0-10.local] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:3@compute-0-37.local] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
[proxy:0:3@compute-0-37.local] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:3@compute-0-37.local] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:5@compute-0-40.local] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:913): assert (!closed) failed
[proxy:0:5@compute-0-40.local] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:5@compute-0-40.local] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@compute-0-6.local] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@compute-0-6.local] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@compute-0-6.local] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion
[mpiexec@compute-0-6.local] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
ripperjack is offline   Reply With Quote

Old   October 31, 2013, 13:48
Default
  #2
Member
 
Ping
Join Date: Dec 2011
Posts: 63
Rep Power: 5
ripperjack is on a distinguished road
BTW: I compile OF-2.1.1 with mvapich2 on redhat 6.4 by:

in etc/bashrc , change WM_MPLIB=OPENMPI to WM_MPLIB=MPI
in etc/config/settings.sh, replace:
MPI)
export FOAM_MPI=mpi
export MPI_ARCH_PATH=/opt/mpi
;;
with below
MPI)
export FOAM_MPI=mpi
export MPI_HOME=/opt/apps/intel13/mvapich2/1.9
export MPI_ARCH_PATH=$MPI_HOME
_foamAddPath $MPI_ARCH_PATH/bin
_foamAddLib $MPI_ARCH_PATH/lib
;;

All the other compilation are similar with that with openmpi, and there is no error.

Last edited by ripperjack; October 31, 2013 at 16:37.
ripperjack is offline   Reply With Quote

Old   October 31, 2013, 22:34
Default
  #3
New Member
 
Jerome Vienne
Join Date: Oct 2013
Posts: 2
Rep Power: 0
Vienne will become famous soon enough
Hi Ripperjack,

This is a known issue for Mvapich2 team when some 3rd party libraries are interacting with their internal memory (ptmalloc) library. They got similar reports earlier with MPI programs integrated with Perl and some other external libraries. This interaction causing libc.so memory functions appearing before MVAPICH2 library(libmpich.so) in dynamic shared lib ordering which is leading to Ptmalloc initialization failure.
Mvapich2 2.0a can manage this thing and only print a warning instead to crash. I know that there is an another way to avoid that by changing the order of linked library, but I don't remember exactly how it works.
For time being, can you please try with run-time parameter MV2_ON_DEMAND_THRESHOLD=<your job size>. With this parameter, your application should continue without registration cache feature but it could lead to some performance degradation. You can also try MVAPICH2 2.0a.

Thanks,
Jerome
lakeat, wyldckat and ripperjack like this.
__________________
Jerome Vienne, Ph.D
HPC Software Tools Group
Texas Advanced Computing Center (TACC)
viennej@tacc.utexas.edu | Phone: (512) 475-9322
Office: ROC 1.455B | Fax: (512) 475-9445
Vienne is offline   Reply With Quote

Old   October 31, 2013, 22:52
Default
  #4
Member
 
Ping
Join Date: Dec 2011
Posts: 63
Rep Power: 5
ripperjack is on a distinguished road
Quote:
Originally Posted by Vienne View Post
Hi Ripperjack,

This is a known issue for Mvapich2 team when some 3rd party libraries are interacting with their internal memory (ptmalloc) library. They got similar reports earlier with MPI programs integrated with Perl and some other external libraries. This interaction causing libc.so memory functions appearing before MVAPICH2 library(libmpich.so) in dynamic shared lib ordering which is leading to Ptmalloc initialization failure.
Mvapich2 2.0a can manage this thing and only print a warning instead to crash. I know that there is an another way to avoid that by changing the order of linked library, but I don't remember exactly how it works.
For time being, can you please try with run-time parameter MV2_ON_DEMAND_THRESHOLD=<your job size>. With this parameter, your application should continue without registration cache feature but it could lead to some performance degradation. You can also try MVAPICH2 2.0a.

Thanks,
Jerome
Dear Jerome,

Many thanks for your reply! I re-compiled the OpenFoam with mvapich2 2.0a and it worked! As you said, there is just a warning as follow but no error. I made a test and openfoam runs fine with more than 256 cores!

Thanks again for your time!

Best regards,

Code:
WARNING: Error in initializing MVAPICH2 ptmalloc library.Continuing without InfiniBand registration cache support.
ripperjack is offline   Reply With Quote

Old   August 30, 2014, 03:47
Default analogy problems
  #5
New Member
 
lvcheng
Join Date: Aug 2014
Posts: 1
Rep Power: 0
lvcheng is on a distinguished road
hi, Ripperjack and Vienne,

I am a novice using linux and I meet analogy problems when I run fvcom(one kind of ocean model) using mvapich2_intel as following,
mistake.jpg


and my PBS script are as following :
PBS.jpg

My .bash_profile are as following:
baprofile.jpg

could you give me some advise how to solve it ? thanks a lot!

lvcheng

[node27:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 8. MPI process died?
[node27:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
[node28:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 5. MPI process died?
[node28:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI process died?
[node27:mpispawn_0][child_handler] MPI process (rank: 0, pid: 3374) terminated with signal 11 -> abort job
[node27:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node node27 aborted: MPI process error (1)
[node28:mpispawn_1][child_handler] MPI process (rank: 4, pid: 3240) terminated with signal 11 -> abort job
lvcheng is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
OpenFOAM 2.1.1 installation on openSUSE 12.2 32 bit saturn_53 OpenFOAM Installation 13 February 1, 2015 05:17
Import mesh with internal face from Fluent to openfoam 2.1.1 neeraj OpenFOAM Meshing & Mesh Conversion 1 April 29, 2013 03:54
OpenFoam 2.1.1 x64 on Fedora 17 abCFD OpenFOAM Installation 2 January 14, 2013 17:29
Cross-compiling OpenFOAM 1.7.0 on Linux for Windows 32 and 64bits with Mingw-w64 wyldckat OpenFOAM Announcements from Other Sources 3 September 8, 2010 06:25
OpenFOAM 13 Intel quadcore parallel results msrinath80 OpenFOAM Running, Solving & CFD 13 February 5, 2008 06:26


All times are GMT -4. The time now is 02:03.