stop when I run in parallel
Hello everyone,
When I run a parallel case it stops (or sometimes it succeeds) without any error message. Its seems to be busy (all cpu at 100%) but there is no progress. It happens at the beginnig or later, a kind of random error. I'm using OpenFoam1.6.x with Ubuntu 9.10 and gcc 4.4.1 as compiler. I have no problem when I run a case with a single processor. Has anyone an idea of what happen? Here is a case which run and stop. I just modify the number of processors from the tutorial case. Thank you for your help. Nolwenn OpenFOAM sourced mecaflu@monarch01:~$ cd OpenFOAM/mecaflu-1.6.x/run/damBreak/ mecaflu@monarch01:~/OpenFOAM/mecaflu-1.6.x/run/damBreak$ mpirun -np 6 interFoam -parallel /*---------------------------------------------------------------------------*\ | ========= | | | \\ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \\ / O peration | Version: 1.6.x | | \\ / A nd | Web: www.OpenFOAM.org | | \\/ M anipulation | | \*---------------------------------------------------------------------------*/ Build : 1.6.x-605bfc578b21 Exec : interFoam -parallel Date : May 04 2010 Time : 18:46:25 Host : monarch01 PID : 23017 Case : /media/teradrive01/mecaflu-1.6.x/run/damBreak nProcs : 6 Slaves : 5 ( monarch01.23018 monarch01.23019 monarch01.23020 monarch01.23021 monarch01.23022 ) Pstream initialized with: floatTransfer : 0 nProcsSimpleSum : 0 commsType : nonBlocking SigFpe : Enabling floating point exception trapping (FOAM_SIGFPE). // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * // Create time Create mesh for time = 0 Reading g Reading field p Reading field alpha1 Reading field U Reading/calculating face flux field phi Reading transportProperties Selecting incompressible transport model Newtonian Selecting incompressible transport model Newtonian Selecting turbulence model type laminar time step continuity errors : sum local = 0, global = 0, cumulative = 0 DICPCG: Solving for pcorr, Initial residual = 0, Final residual = 0, No Iterations 0 time step continuity errors : sum local = 0, global = 0, cumulative = 0 Courant Number mean: 0 max: 0 Starting time loop |
Hi Nolwenn
I want to ask you, are you using the distributed parallelization or on silgle machine you are giving the : mpirun -np 6 ..... . Regards Jay |
Hi Jay,
I am using a single machine with mpirun -np 6 interFoam -parallel. When I run with 2 processors it seems it runs more iterations than with 4 or more... Regards Nolwenn |
Greetings Nolwenn,
It could be a memory issue. OpenFOAM is known to crash and/or freeze Linux boxes when memory isn't enough. Check this post (or the whole thread it's on) for more on it: mpirun problems post # 3 Also, try using the parallelTest utility - information available on this post: OpenFOAM updates post #19 The parallelTest utility (it's part of OpenFOAM's test utilities) can aid you in sorting out the more basic MPI problems, like communication problems or missing environment settings or libraries not found, without running any particular solver functionalities.. For example: for some weird reason, there might me something missing in the mpirun command to allow the 6 cores to work properly together! Best regards, Bruno |
Hello Bruno,
Thank you for your answer, I run parallel test and obtain this : Code:
Executing: mpirun -np 6 /home/mecaflu/OpenFOAM/OpenFOAM-1.6.x/bin/foamExec parallelTest -parallel | tee log I have 8 GiB of memory and 3GiB of swap so memory seems to be ok! Best regards Nolwenn |
Greetings Nolwenn,
That is a strange output... it seems a bit out of sync :( It has happened to me once sometime ago, but the OpenFOAM header always came first! Doesn't the script foamJob work for you? Or does it output the exact same thing? Another possibility, is that it could actually reveal a bug in OpenFOAM! So, how did you decompose the domains for each processor? Best regards, Bruno |
Hello Bruno,
Here is the result of foamJob, I can't find a lot of information! Code:
/*---------------------------------------------------------------------------*\ Code:
// The FOAM Project // File: decomposeParDict I use gcc compiler, is it possible another compiler solve this? Thank you for your help Bruno! Best regards, Nolwenn |
How many processors or cores does your machine have?
I would presume if you have 8gb that you prob only have a quad core machine, hence I would only partition the domain into 4 volumes. If you have a dual core machine then that would explain why it is ok with 2 processors, because that all you have. Please post up your machine specs so that we can try and be more helpful. Cheers, Scott |
Hello Scott!
I have 8 processors on my machine, I tried to find specs : Code:
r3@monarch01:~$ cat /proc/cpuinfo Best regards Nolwenn |
Have you tried with all 8 processors?
I dont have this problem on mine when I use all of the processors. Make sure you use decomposepar to get 8 partitions before you try. Scott |
Also are these 8 processes all on the same machine or are they on different machines? ie, is it a small cluster?
I haven't done this on a cluster setup before so can't be of any help with that. I was assuming that you had two quad core processors on a single motherboard, but I just went through it again and its either 8 dual core processors, or it is 4 dual core processers reporting a process for each core. Can you confirm exactly what it is and maybe someone else can help you. If its a cluster than you may have load issues, interconnect problems, or questionable installations on other machines. Cheers, Scott |
Sorry, I am not very familiar with machine specs !
It is a single machine with 4 dual cores processors. When I run with all processors the problem is the same : Code:
Parallel processing using OPENMPI with 8 processors Best regards Nolwenn |
I encounter the same problem as Nolwenn. I use 12 cores on a single machine. parallelTest works fine and prints out results in a reasonable order. But when I run foamJob computation hangs on solving the first UEqn. All cores are on 100% but nothing is happening.
solver: simpleFoam case: pitzDaily decomposition: simple OpenFOAM 1.6.x Here is the output from mpirun -H localhost -np 12 simpleFoam -parallel Code:
/*---------------------------------------------------------------------------*\ I tried a verbose mode of mpirun but that delivered no useful information either. Unfortunately I have no profiling tools at hand for parallel code. If anyone of you has vampir or sth similar and could try this out, that would be great. |
Greetings to all,
Well, this is quite odd. The only solutions that come to mind is to test the same working conditions with other build scenarios, namely:
Because the only reasons that come to mind for the solvers to just jam up and not do anything productive, is that something didn't get built how it is suppose to be. As for the output from parallelTest to come out with the outputs swapped, it could be an stdout buffering issue, where mpirun outputs the text from the slaves prior to the master's output, because the master's output didn't fill up fast enough to trigger the limit of number of characters before flushing. Best regards, Bruno |
Hello everyone,
Now everything seems to be ok for me! I came back to OF 1.6 (prebuilt) with Ubuntu 8.04 and I have no longer problem. Thank you again for your help Bruno! Cheers, Nolwenn |
Hello everyone,
I'm experiencing the very same problem with openSuse. I've tried the pre-compiled 1.6 version and it worked! My problem arises again when I recompile openmpi. I do this in order to add the torque (batch system) and ofed options. Since we have a small cluster, this options are necessary for running cases in more than one node. Even if I recompile openmpi without this options (and just recompile it, nothing else), I get the same problem! (calculations stop, sometimes earlier, sometimes later and sometimes at the beginning, w/o any error mssg. and keeping all CPU's at 100%). This is quite strange - I would be glad if someone has further ideas... I'll keep you informed if I make some progress. regards Gonzalo |
Greetings Gonzalo,
Let's see... here are my questions for you:
The easiest way to avoid these issues, would be to use the same version of distros as the pre-built binaries came from, namely, if I'm not mistaken, Ubuntu 9.04 and openSUSE 11.0 or 11.1, because they have gcc 4.3.3 as their system compiler. Best regards, Bruno |
Hello Bruno, hello all,
thanks for your comments. I compiled now openmpi again and it worked! I was trying to compile it with the system's gcc (4.4.1) of openSuse 11.2 first, which apparently caused the problems. Now I've tried it again with the ThirdParty gcc (4.3.3) and it works! In both cases I compiled it with Allwmake from the ThirdParty-1.6 directory, after uncommenting the openib and openib-libdir options and adding the --with-tm option for torque. Then I deleted the openmpi-1.3.3/platform dir and executed Allwmake in ThirdParty-1.6. After this, it wasn't necessary to recompile OpenFOAM again. Now I have run first tests with 2 nodes and a total of 16 processes (finer damBreak tutorial) and it seems to work fine! It still remains for me a strange task, since I made the same for 1.6.x and it didn't work! I'll try now with the system's compiler for both OpenFOAM-1.6.x and ThirdParty when I have more time. Thanks again! Gonzalo |
parallel problem
Hi,
I've got a problem running a code in parallel. (one machine, quad core). I'm using openfoam 1.6 prebuilt binaries, on fedora 12. The error I get is: /*---------------------------------------------------------------------------*\ | ========= | | | \\ / F ield | OpenFOAM: The Open Source CFD Toolbox | | \\ / O peration | Version: 1.6 | | \\ / A nd | Web: www.OpenFOAM.org | | \\/ M anipulation | | \*---------------------------------------------------------------------------*/ Build : 1.6-f802ff2d6c5a Exec : interFoam -parallel Date : May 28 2010 Time : 12:27:10 Host : blue PID : 23136 Case : /home/bunni/OpenFOAM/OpenFOAM-1.6/tutorials/quartcyl nProcs : 2 Slaves : 1 ( blue.23137 ) Pstream initialized with: floatTransfer : 0 nProcsSimpleSum : 0 commsType : nonBlocking SigFpe : Enabling floating point exception trapping (FOAM_SIGFPE). // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * // Create time Create mesh for time = 0 [blue:23137] *** An error occurred in MPI_Bsend [blue:23137] *** on communicator MPI_COMM_WORLD [blue:23137] *** MPI_ERR_BUFFER: invalid buffer pointer [blue:23137] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) -------------------------------------------------------------------------- mpirun has exited due to process rank 1 with PID 23137 on node blue exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- [blue:23135] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal [blue:23135] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages - so I take it the program is crashing in the mesh part? It seems to run fine on a single proc. (and another geometry I had ran fine for parallel jobs). I've meshed a quarter of a cylinder, with the cylinder aligned on the z-axis. I've done simple decomposition along the z-axis, thinking that the circular geometry might be causing the problem. Above, bruno mentioned the scripts: runParallel, parallelTest. Where are those scripts? Cheers |
Greetings bunni,
Quote:
Quote:
Best regards, Bruno |
DamBreak tuorial
Hi all!
I'm running through the tutorials and have problems with parallel running in the dam break tutorial. This is the error I get; Quote:
Quote:
Regards Marco |
Hi Marco!
This has happened to me more then once :rolleyes: Quote:
Code:
cd .. By the way, personally I've grown use to using the script foamJob, so in your case, I would use: Code:
foamJob -s -p interFoam Best regards, Bruno |
1 Attachment(s)
Ok, I'm back. After having installed openfoam 1.6.x, I'm having exactly the same problem running in parallel as I was before. It runs fine on a single processor.
I've tried to attach the output from the screen, which is what I posted above. I'll try to post the details of the case in another message. |
gtz with the file stuff
1 Attachment(s)
Here should be the data to recreate the case. You'll need to run blockMesh on it, but hopefully the rest of the files are there. I've saved it as quart.tgz.gz because the uploader would not take quart.tgz. Therefore :
step 1 $ mv quart.tgz.gz quart.tgz step 2 $ tar xvfz quart.tgz and, should be created a directory tree called qcyl. You can descend into this to run blockMesh, etc. It has been running for days with 1 proc, but crashes immediately with 2 or more. I've got a simple decomposition through the z-plane with 2 procs in the decomposeParDict. Anyway, thanks for any ideas. |
Hi Bunni,
Well, the same things that happen with you, have happened with me too. I've confirmed that my OpenMPI is working with OpenFOAM, by testing parallelTest and the interFoam/laminar/damBreak case with dual core parallel execution. edit: I forgot to mentioned that I used Ubuntu 8.04 i686 and OpenFOAM 1.6.x I've managed to solve in part the error you get. Just edit the file "OpenFOAM-1.6.x/etc/settings.sh", find the variable minBufferSize and increase the buffer size: Code:
# Set the minimum MPI buffer size (used by all platforms except SGI MPI) But by the tests I've made of increasing the buffer size, it only seems to postpone the crash farther in time. interFoam seems to always crashes during the preparation part of the case. Now I know why other users report that it freezes... in fact, it just takes reaaally long playing around with meshes and memory and MPI messages, at least AFAIK, and sooner or later it will just crash without doing anything useful :( edit2: yep, 400MB of buffer and it still crashes... So, Bunni, I suggest that you try increasing that buffer variable, in an attempt to avoid the crashing. But my best bet is what I've said to you previously about OpenFOAM 1.6: this seems to be a bug in OpenFOAM, which apparently is yet to be fixed! So please post a bug report on the Bug report part of the OpenFOAM forum: http://www.cfd-online.com/Forums/openfoam-bugs/ If you want to save some time, you can refer to your post #23 from here and onward! Best regards, Bruno |
thanks
Thanks for checking that. I will post a bug report. I'm running on fedora and centos. I'll check out that variable change. As for right now, it's been running on a single processor without crashing for 4 days, so the run itself is stable.
|
Hello,
I have recently encountered the same problem as Nolwenn and Gonzalo regarding the stopping of a solver at the first time loop without any error message or the job exiting the queue (procs still occupied at 100%). I am in OF1.7.1 (with gcc 4.5.1) on a cluster with RHEL 5.4. The issue only occurs when running large cases, smaller cases work perfectly fine; but, there is plenty of memory per node even for the large cases (~50GB). The parallelTest utility reports fine as suggested above. Is there any knowledge to fixing this issue besides switching compilers? If not, which compilers should I switch to for OF1.7.1, since there is no default compiler? Thanks in advance for any helpful advice, Perry |
Greetings Perry,
Before I answer you, I just want to wrap up the solution to bunni's predicament - the thread where the solution is, is this one: http://www.cfd-online.com/Forums/ope...-parallel.html Now back to you Perry: OK, when it comes to the issue of compiler version, there are two/three other libraries whose versions are also important, namely: GMP, MPFR and MPC. For example, from my experience, MPFR 3.0.0 doesn't work very well, so I still hang on to the older 2.4.2 version. As for Gcc 4.5.1, it should work just fine with OpenFOAM 1.7.1. I might on the other hand, be triggering a couple of old bugs that have been solved since then. As I vaguely remember, they were related to some issues with cyclic or wedge or some other special type of patch, that would crash the solver when used in parallel. Aside from such old bugs, one still needs to use (if I'm not mistaken) the "preservePatches" parameter in decomposeParDict. Either way, I've got a blog post where I'm gathering information on how to run in parallel with OpenFOAM (it's accessible from the link on my signature): Notes about running OpenFOAM in parallel The ones that might interest you:
Best regards, Bruno |
Quote:
Quote:
Quote:
Quote:
2) I'm using a custom solver based on simpleFoam (with an extra equation for passive scalar transport), but have also tested on simpleFoam itself with no difference. 3) I have one cyclic patch, 3 directMapped patches, and a number of inlets, outlets, and walls. 4) Right now, I'm using RAS, k-w SST. 5) I have not tried scaling the geometry, but I have run the same geometry on a coarser mesh successfully with the same boundary conditions. I only experience this problem on my fine mesh. Quote:
Thanks very much for your help, Perry |
Hi Perry,
Quote:
Oh, here is the link to the makeGcc file on ThirdParty 2.0.x: https://github.com/OpenFOAM/ThirdPar...master/makeGcc - as you can see, Gcc 4.5.x needs MPFR, GMP and MPC to build properly. Quote:
Quote:
2) OK, then the problem must be elsewhere... 3) Are the directMapped patches also protected by the preserve patches parameter? 4) OK, seems pretty standard... 5) Have you tried visualizing the sub-domains in ParaView, to check where things are being split? Have you executed checkMesh on the fine resolution mesh before decomposing, to verify if the mesh is OK? There is an environment variable that OpenMPI uses that is defined in settings.sh... ah, line 347: https://github.com/OpenCFD/OpenFOAM-...ttings.sh#L347 - try increasing that value, perhaps 10x. Although this is only a valid solution in some cases. And I know I've seen more reports like this before... and if I'm not mistaken, most were related to the patches being split between sub-domains, but my memory hasn't been very trustworthy lately :( If my memory gets better, I'll search for what I've read in the past and post here. ...Wait... maybe it's the nonBlocking flag: https://github.com/OpenCFD/OpenFOAM-...ntrolDict#L875 - have you tried with other possibilities for the parameter commsType? I know there was a bug report a while back that was fixed in 2.0.x... here we go, it's in fact related to "directMappedPatch", although it might not affect your case: http://www.openfoam.com/mantisbt/view.php?id=280 Best regard and good luck! Bruno |
Quote:
Quote:
simple stops after building the mesh and fields, while starting the first time loop: "Time = 1", as if it is taking hours to complete the U-eqn; it also stops here when running serially (just tested) Quote:
Preserving patches for directMapped does not make sense to me, since the directMapped patch is not a shared boundary situation, but rather a case where the inlet looks to the nearest interior cell to a given offset location and finds the value there. Is there a good way to visualize all of the sub-domains in one paraview session? Quote:
1) Two regions not connected by any faces (which is purposeful for my simulation, e.g. one region feeds the other via directMappedPatch). 2) 156 Non-orthogonal faces, but still says OK (max 86.4). 3) 3 skew faces, says that mesh fails, but this has not stopped me in the past. Would you think this problem could be related to 3 skewed faces (max skewness 4.66)? Doesn't seem like skew cells could prevent the solver from running, but I could be wrong...? Quote:
Quote:
Quote:
Thanks for your continued ideas, Perry |
Hi Perry,
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Best regards and good luck! Bruno |
Bruno,
Quote:
Quote:
I have narrowed it down to one of the directMapped B.C.'s (the other two are fine). When I switch it to fixedValue, everything works fine. Switch it back to direct mapped, and it stalls at the first Time loop. The checkMesh utility gives me a cellToRegions file, suggesting that I should use the splitMeshRegions utility and use two different regions. The problematic directMapped boundary pulls its data from a separate domain of the flow that is not connected geometrically. Could this explain the problems I am having? (I am completely unfamiliar with multiple regions in OF, other that the little reading I have done today.:D) Quote:
Regards, Perry |
Hi Perry,
Mmm... well, at least MPFR doesn't seem to be the one to blame... for now :) And this is getting further into details that I'm not very familiar with either. Did checkMesh on the coarser mesh give you the same information, that it should divide the mesh into two separate regions? For multiple regions, I only know about two solvers that should support this (I don't know if all other solvers support this or not); they are (and respective tutorials):
Best regards and Good luck! Bruno |
Bruno,
I really appreciate all your help with this issue, and the side-tips along the way. I narrowed it down to the influence of the directMapped boundary condition with the fine mesh I was using, so I played around with meshing until I got one to work. I seem to have resolved the issue just by using a different mesh. Sincere regards, Perry |
non-interactive parallel run
greetings all!
I have installed OF-2.4.0(with gcc-4.8.1 , gmp-5.1.2 , mpc-1.0.1 , mpfr-3.1.2 ) on a cluster with CentOS 6.5 with this instructions HTML Code:
https://openfoamwiki.net/index.php/Installation/Linux/OpenFOAM-2.3.0/CentOS_SL_RHEL I use this command Code:
$ nohup foamJob -s -p simpleFoam & nohup.out is like Code:
3 total processes killed (some possibly by mpirun during cleanup)". |
2 Attachment(s)
Quote:
Hi Bruno, I am using OF v2012 on Ubuntu 20.04 Focal Fossa. I could not find file "setting.sh" in "OpenFoam-v2012/etc" folder. I found the file "settings.sh" inside "OpenFoam-v2012/etc/config.csh" and "OpenFoam-v2012/etc/config.csh" folders. Unfortunately I could not find the string " minBufferSize". I wonder whether my OpenFoam installation is correct or not. I followed this instruction: http://openfoamwiki.net/index.php/In...M-v1806/Ubuntu I change the version string v1806 to v2012. I attach the settings files. my big appreciation if you could give me some hints. Thank you in advance.. |
All times are GMT -4. The time now is 11:14. |