CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   OpenFOAM Running, Solving & CFD (https://www.cfd-online.com/Forums/openfoam-solving/)
-   -   Error when running in parallel (> 512 cores) using MVAPICH2 (https://www.cfd-online.com/Forums/openfoam-solving/127280-error-when-running-parallel-512-cores-using-mvapich2.html)

ripperjack December 7, 2013 18:21

Error when running in parallel (> 512 cores) using MVAPICH2
 
Hi guys,

I am using a cluster to run OF 2.1.1 in parallel using MVAPICH2. I managed to run the job with upto 512 cores. But when I tried to run it with > 512 cores, I got this error. It is weird, could you plz help? Thanks in advance!

Code:

WARNING: Error in initializing MVAPICH2 ptmalloc library.Continuing without InfiniBand registration cache support.
[0]
[0]
[0] --> FOAM FATAL IO ERROR:
[0] error in IOstream "IOstream" for operation operator>>(Istream&, List<T>&) : reading first token
[0]
[0] file: IOstream at line 0.
[0]
[0]    From function IOstream::fatalCheck(const char*) const
[0]    in file db/IOstreams/IOstreams/IOstream.C at line 114.
[0]
FOAM parallel run exiting
[0]
[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

Regards,

Vienne December 8, 2013 10:48

Hi,

Could you please take a look at:
http://www.openfoam.org/mantisbt/view.php?id=296

ripperjack December 10, 2013 12:05

Quote:

Originally Posted by Vienne (Post 465341)
Hi,

Could you please take a look at:
http://www.openfoam.org/mantisbt/view.php?id=296

Hi Jeorme,

Thanks for this link. I took a look at it but it seems that the guy just can not run parallel job. For my case, I can run parallel job upto 512 cores, but when I try to run the same job with > 512 cores, I got this error..... It is weird. Same error comes up when I use the "sample" ultility in parallel upto 512 cores..

Best regards

wyldckat December 10, 2013 16:38

Greetings to all!

@ripperjack: The problem might not be with MVAPICH2, but it might be with the case itself.
I know I wrote sometime ago a post with a few questions... Ah, here we go:
Quote:

Originally Posted by wyldckat (Post 324934)
Either way, I've got a blog post where I'm gathering information on how to run in parallel with OpenFOAM (it's accessible from the link on my signature): Notes about running OpenFOAM in parallel
The ones that might interest you:
Knowing a bit more about the large case might help in trying to isolate the problem, namely:
  • Which decomposition method are you using?
  • What's the solver being used, or if it's a customized version, on which solver(s) is it based on?
  • What kinds of patches are involved? Any cyclic, wedge, baffle, etc...
  • What kind of turbulence models are being used, if any? RAS, LES, laminar or something else?
  • Have you tried gradually scaling up the size of your case? If so, did you take into account the respective calibration of the parameters in controlDict?
Last but not least, any chance of also trying OpenFOAM 2.0.1 or 2.0.x? Because if you are triggering a bug, it'll be easier to get help on this problem on the dedicated bug tracker.

In your specific case, trying with OpenFOAM 2.1.x and 2.2.2 or even 2.2.x, would make it easier to isolate the possible source of problems, or more specifically, if this was a bug that had already been fixed.

Best regards,
Bruno

ripperjack April 25, 2014 12:08

Quote:

Originally Posted by wyldckat (Post 465801)
Greetings to all!

@ripperjack: The problem might not be with MVAPICH2, but it might be with the case itself.
I know I wrote sometime ago a post with a few questions... Ah, here we go:

In your specific case, trying with OpenFOAM 2.1.x and 2.2.2 or even 2.2.x, would make it easier to isolate the possible source of problems, or more specifically, if this was a bug that had already been fixed.

Best regards,
Bruno

Dear Bruno,

Thanks for you suggestions. I know it is an old thread, however, the problem has not been solved....frustrated... I just compiled OF 2.2.2 and 2.3 on the cluster and the error was still there. Actually, I started another post before regarding this issue (see here). At that time, I can run jobs with 16, 32, and 64 cores, however, when > 64 cores, there came a error (different from the current one). A guy pointed out that this is a known bug in mvapich2-1.9 version and the bug was fixed in the new version mvapich2-2.0. So I recompiled OF using mvapich2-2.0, and that problem was fixed. However, when I tried to use > 512 cores, I got this error (see my first post). It is really weird, I don't know if this is about MPI or OF. Did anyone manage to run a large job (> 1024 cores) using mvapich2? I need to run some big jobs (~100 million meshes), so 512 cores parallel run is not enough. It will be greatly appreciated if anyone can fix this issue for me, tons of thanks in advance.

Best regards,
Ping

wyldckat April 25, 2014 12:36

Greetings Ping,

Since I don't have any experience with MVAPICH, here's what I can suggest.
I had a look at their website and my first suggestion would be to contact their mailing list: http://mailman.cse.ohio-state.edu/ma...apich-discuss/

I had a look into their User Guide and the following environment variables seem suspicious (default values of 256 and 512) and might take advantage from using larger values:
  • MV2_VBUF_POOL_SIZE
  • MV2_VBUF_SECONDARY_POOL_SIZE
  • MV2_SRQ_SIZE
Any other buffer related values could probably use larger values as well.

Beyond that, I would try to test large values with the any other variables as well.

Also, check the chapter "8 Scalability features and Performance Tuning for Large Scale Clusters" in the User Guide.

Best regards,
Bruno

dkokron April 25, 2014 13:29

RipperJack

First thing to try is running a simple HelloWorld with 512p. I just confirmed that can scale this simple code to at least 1280p under my build of mv2-2.0b.

http://www.dartmouth.edu/~rc/classes..._world_ex.html

Dan

leehowe June 25, 2015 03:12

Quote:

Originally Posted by ripperjack (Post 488214)
Dear Bruno,

Thanks for you suggestions. I know it is an old thread, however, the problem has not been solved....frustrated... I just compiled OF 2.2.2 and 2.3 on the cluster and the error was still there. Actually, I started another post before regarding this issue (see here). At that time, I can run jobs with 16, 32, and 64 cores, however, when > 64 cores, there came a error (different from the current one). A guy pointed out that this is a known bug in mvapich2-1.9 version and the bug was fixed in the new version mvapich2-2.0. So I recompiled OF using mvapich2-2.0, and that problem was fixed. However, when I tried to use > 512 cores, I got this error (see my first post). It is really weird, I don't know if this is about MPI or OF. Did anyone manage to run a large job (> 1024 cores) using mvapich2? I need to run some big jobs (~100 million meshes), so 512 cores parallel run is not enough. It will be greatly appreciated if anyone can fix this issue for me, tons of thanks in advance.

Best regards,
Ping

Hi, have you ever solved this weird problem? I met exactly the same error when i want to run more than 128 processors with MVAPICH2-2.0 in OF 2.3.


All times are GMT -4. The time now is 02:21.