CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM > OpenFOAM Running, Solving & CFD

Error when running in parallel (> 512 cores) using MVAPICH2

Register Blogs Community New Posts Updated Threads Search

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   December 7, 2013, 18:21
Default Error when running in parallel (> 512 cores) using MVAPICH2
  #1
Member
 
Jack
Join Date: Dec 2011
Posts: 94
Rep Power: 14
ripperjack is on a distinguished road
Hi guys,

I am using a cluster to run OF 2.1.1 in parallel using MVAPICH2. I managed to run the job with upto 512 cores. But when I tried to run it with > 512 cores, I got this error. It is weird, could you plz help? Thanks in advance!

Code:
WARNING: Error in initializing MVAPICH2 ptmalloc library.Continuing without InfiniBand registration cache support.
[0]
[0]
[0] --> FOAM FATAL IO ERROR:
[0] error in IOstream "IOstream" for operation operator>>(Istream&, List<T>&) : reading first token
[0]
[0] file: IOstream at line 0.
[0]
[0]     From function IOstream::fatalCheck(const char*) const
[0]     in file db/IOstreams/IOstreams/IOstream.C at line 114.
[0]
FOAM parallel run exiting
[0]
[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
Regards,
ripperjack is offline   Reply With Quote

Old   December 8, 2013, 10:48
Default
  #2
New Member
 
Jerome Vienne
Join Date: Oct 2013
Posts: 2
Rep Power: 0
Vienne will become famous soon enough
Hi,

Could you please take a look at:
http://www.openfoam.org/mantisbt/view.php?id=296
__________________
Jerome Vienne, Ph.D
HPC Software Tools Group
Texas Advanced Computing Center (TACC)
viennej@tacc.utexas.edu | Phone: (512) 475-9322
Office: ROC 1.455B | Fax: (512) 475-9445
Vienne is offline   Reply With Quote

Old   December 10, 2013, 12:05
Default
  #3
Member
 
Jack
Join Date: Dec 2011
Posts: 94
Rep Power: 14
ripperjack is on a distinguished road
Quote:
Originally Posted by Vienne View Post
Hi,

Could you please take a look at:
http://www.openfoam.org/mantisbt/view.php?id=296
Hi Jeorme,

Thanks for this link. I took a look at it but it seems that the guy just can not run parallel job. For my case, I can run parallel job upto 512 cores, but when I try to run the same job with > 512 cores, I got this error..... It is weird. Same error comes up when I use the "sample" ultility in parallel upto 512 cores..

Best regards
ripperjack is offline   Reply With Quote

Old   December 10, 2013, 16:38
Default
  #4
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,974
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Greetings to all!

@ripperjack: The problem might not be with MVAPICH2, but it might be with the case itself.
I know I wrote sometime ago a post with a few questions... Ah, here we go:
Quote:
Originally Posted by wyldckat View Post
Either way, I've got a blog post where I'm gathering information on how to run in parallel with OpenFOAM (it's accessible from the link on my signature): Notes about running OpenFOAM in parallel
The ones that might interest you:
Knowing a bit more about the large case might help in trying to isolate the problem, namely:
  • Which decomposition method are you using?
  • What's the solver being used, or if it's a customized version, on which solver(s) is it based on?
  • What kinds of patches are involved? Any cyclic, wedge, baffle, etc...
  • What kind of turbulence models are being used, if any? RAS, LES, laminar or something else?
  • Have you tried gradually scaling up the size of your case? If so, did you take into account the respective calibration of the parameters in controlDict?
Last but not least, any chance of also trying OpenFOAM 2.0.1 or 2.0.x? Because if you are triggering a bug, it'll be easier to get help on this problem on the dedicated bug tracker.
In your specific case, trying with OpenFOAM 2.1.x and 2.2.2 or even 2.2.x, would make it easier to isolate the possible source of problems, or more specifically, if this was a bug that had already been fixed.

Best regards,
Bruno
__________________
wyldckat is offline   Reply With Quote

Old   April 25, 2014, 12:08
Default
  #5
Member
 
Jack
Join Date: Dec 2011
Posts: 94
Rep Power: 14
ripperjack is on a distinguished road
Quote:
Originally Posted by wyldckat View Post
Greetings to all!

@ripperjack: The problem might not be with MVAPICH2, but it might be with the case itself.
I know I wrote sometime ago a post with a few questions... Ah, here we go:

In your specific case, trying with OpenFOAM 2.1.x and 2.2.2 or even 2.2.x, would make it easier to isolate the possible source of problems, or more specifically, if this was a bug that had already been fixed.

Best regards,
Bruno
Dear Bruno,

Thanks for you suggestions. I know it is an old thread, however, the problem has not been solved....frustrated... I just compiled OF 2.2.2 and 2.3 on the cluster and the error was still there. Actually, I started another post before regarding this issue (see here). At that time, I can run jobs with 16, 32, and 64 cores, however, when > 64 cores, there came a error (different from the current one). A guy pointed out that this is a known bug in mvapich2-1.9 version and the bug was fixed in the new version mvapich2-2.0. So I recompiled OF using mvapich2-2.0, and that problem was fixed. However, when I tried to use > 512 cores, I got this error (see my first post). It is really weird, I don't know if this is about MPI or OF. Did anyone manage to run a large job (> 1024 cores) using mvapich2? I need to run some big jobs (~100 million meshes), so 512 cores parallel run is not enough. It will be greatly appreciated if anyone can fix this issue for me, tons of thanks in advance.

Best regards,
Ping
ripperjack is offline   Reply With Quote

Old   April 25, 2014, 12:36
Default
  #6
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,974
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Greetings Ping,

Since I don't have any experience with MVAPICH, here's what I can suggest.
I had a look at their website and my first suggestion would be to contact their mailing list: http://mailman.cse.ohio-state.edu/ma...apich-discuss/

I had a look into their User Guide and the following environment variables seem suspicious (default values of 256 and 512) and might take advantage from using larger values:
  • MV2_VBUF_POOL_SIZE
  • MV2_VBUF_SECONDARY_POOL_SIZE
  • MV2_SRQ_SIZE
Any other buffer related values could probably use larger values as well.

Beyond that, I would try to test large values with the any other variables as well.

Also, check the chapter "8 Scalability features and Performance Tuning for Large Scale Clusters" in the User Guide.

Best regards,
Bruno
__________________
wyldckat is offline   Reply With Quote

Old   April 25, 2014, 13:29
Default
  #7
Member
 
Dan Kokron
Join Date: Dec 2012
Posts: 33
Rep Power: 13
dkokron is on a distinguished road
RipperJack

First thing to try is running a simple HelloWorld with 512p. I just confirmed that can scale this simple code to at least 1280p under my build of mv2-2.0b.

http://www.dartmouth.edu/~rc/classes..._world_ex.html

Dan
dkokron is offline   Reply With Quote

Old   June 25, 2015, 03:12
Default
  #8
New Member
 
Lee Howe
Join Date: Dec 2014
Posts: 2
Rep Power: 0
leehowe is on a distinguished road
Quote:
Originally Posted by ripperjack View Post
Dear Bruno,

Thanks for you suggestions. I know it is an old thread, however, the problem has not been solved....frustrated... I just compiled OF 2.2.2 and 2.3 on the cluster and the error was still there. Actually, I started another post before regarding this issue (see here). At that time, I can run jobs with 16, 32, and 64 cores, however, when > 64 cores, there came a error (different from the current one). A guy pointed out that this is a known bug in mvapich2-1.9 version and the bug was fixed in the new version mvapich2-2.0. So I recompiled OF using mvapich2-2.0, and that problem was fixed. However, when I tried to use > 512 cores, I got this error (see my first post). It is really weird, I don't know if this is about MPI or OF. Did anyone manage to run a large job (> 1024 cores) using mvapich2? I need to run some big jobs (~100 million meshes), so 512 cores parallel run is not enough. It will be greatly appreciated if anyone can fix this issue for me, tons of thanks in advance.

Best regards,
Ping
Hi, have you ever solved this weird problem? I met exactly the same error when i want to run more than 128 processors with MVAPICH2-2.0 in OF 2.3.
leehowe is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
OF211 with mvapich2 on redhat cluster, error when using more than 64 cores? ripperjack OpenFOAM Installation 4 August 30, 2014 03:47
Problem in Running OpenFoam in Parallel himanshu28 OpenFOAM Running, Solving & CFD 1 July 11, 2013 09:19
problem with running in parallel dhruv OpenFOAM 3 November 25, 2011 05:06
Statically Compiling OpenFOAM Issues herzfeldd OpenFOAM Installation 21 January 6, 2009 09:38
Kubuntu uses dash breaks All scripts in tutorials platopus OpenFOAM Bugs 8 April 15, 2008 07:52


All times are GMT -4. The time now is 03:06.