CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   OpenFOAM Running, Solving & CFD (https://www.cfd-online.com/Forums/openfoam-solving/)
-   -   mpirun problems: exited on signal 11 (segmentation fault) (https://www.cfd-online.com/Forums/openfoam-solving/75740-mpirun-problems-exited-signal-11-segmentation-fault.html)

vaina74 May 4, 2010 08:01

mpirun problems: exited on signal 11 (segmentation fault)
 
I installed OpenFOAM-1.6.x and something strange happened. If I launch a parallel running:
Code:

foamJob -p -s simpleFoam
I obtain
Code:

mpirun noticed that process rank 1 with PID [4 digits] on node xxx-laptop
exited on signal 11 (segmentation fault)

and the Ubuntu freezes!
Then I followed a test procedure (see here, post 19-20) and the output seemed correct. I runned the case in parallel mode again and all was ok. A heisenbug, it was suggested.
Now the problem came back, the parallel test output is:
Code:

Parallel processing using OPENMPI with 2 processors
Executing: mpirun -np 2 /home/giulia/OpenFOAM/OpenFOAM-1.6.x/bin/foamExec parallelTest -parallel | tee log
Building on  2  cores
Building on  2  cores
/*---------------------------------------------------------------------------*\
| =========                |                                                |
| \\      /  F ield        | OpenFOAM: The Open Source CFD Toolbox          |
|  \\    /  O peration    | Version:  1.6.x                                |
|  \\  /    A nd          | Web:      www.OpenFOAM.org                      |
|    \\/    M anipulation  |                                                |
\*---------------------------------------------------------------------------*/
Build  : 1.6.x-069803848c44
Exec  : parallelTest -parallel
Date  : May 04 2010
Time  : 13:44:38
Host  : giulia-laptop
PID    : 2150
Case  : /home/giulia/OpenFOAM/giulia-1.6.x/run/hydrofoil_0
nProcs : 2
Slaves :
1
(
giulia-laptop.2151
)

Pstream initialized with:
    floatTransfer    : 0
    nProcsSimpleSum  : 0
    commsType        : nonBlocking
SigFpe : Enabling floating point exception trapping (FOAM_SIGFPE).

// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //
Create time

[1] [0]
Starting transfers
[1]
[1] slave sending to master 0
[1] slave receiving from master 0

Starting transfers
[0]
[0] master receiving from slave 1
[0] (0 1 2)
[0] master sending to slave 1
[1] (0 1 2)
End

Finalising parallel run

but when I run my case I always obtain
Code:

mpirun noticed that process rank 1 with PID [4 digits] on node  xxx-laptop
exited on signal 11 (segmentation fault)

Please, help me!

vaina74 May 4, 2010 08:19

mh. Maybe it's an amount of memory question, but I can't understand why I had no problems before. I'm not an expert of Ubuntu, can anyone help me?

wyldckat May 4, 2010 09:41

Hello Maurizio, it's me again :)

Uhm, you didn't elaborate on what happened last time. Possibly it's a swap problem; read the Swap FAQ at help.ubuntu and increase your Ubuntu's swap size.

Then try again to crash your Ubuntu ;)

Best regards,
Bruno

PS: later in the day I'll review the post you made on how to have a side-by-side OpenFOAM 1.6 + 1.6.x installation ;)

vaina74 May 4, 2010 10:48

You are my angel, do you know it? :D
I expanded the notebook memory, adding a 512 Mb swap file. And now mpirun works! Well, I was afraid of having to install my (few and not so smart) neurones on my notebook ;)
Thank you very much, Bruno.

wyldckat May 4, 2010 17:19

You're welcome and I'm glad it actually wasn't an heisenbug :D

By the way, I don't remember seeing this written in OpenFOAM's forum, nor on the unofficial openfoamwiki.net, but by my experience, there is a minimum amount of RAM specifically required for doing a full build of OpenFOAM. The magic number is somewhere between 1.3GiB and 1.5GiB of RAM, and swap won't cover that necessity!!

Best regards,
Bruno

rgarcia May 31, 2011 06:20

Hey guys!

I want to run a simulation through a bash script. The aim is to simulate wind coming from 16 differents directions. When I run the simulation (16 cases one after another) and I use a first order scheme for divergence, there is no problem. Nevertheless, when I run the same simulation in a second order scheme, my computer stop running and I have to reboot it. The error I obtain is:

-----------------------------------------------------------------------------------------------------------------------------------------------
mpirun noticed that process rank 5 with PID 1890 on node cener-desktop exited on signal 11 (segmentation fault)
-----------------------------------------------------------------------------------------------------------------------------------------------

As I understand by your message, it could be a problem of RAM or swap memory although I have 15 GB of RAM memory and 12 GB of Swap space, so I think that the memory shouldn't be a problem!

Do you have any idea???

Thanks a lot!

PS: I don't know if it matters, but I use "mpirun -np 8 simpleFoam -parallel" to run the simulations

wyldckat May 31, 2011 06:33

Greetings rgarcia and welcome to the forum!

Mmm, have you tried running in serial to see if that order works at all?
Try monitoring how much RAM the simpleFoam processes are using and see if it crashes when they were increasing RAM ocupation. Another problem could be insufficient contiguous memory, i.e., allocating 3GB in a single matrix on RAM, when there are the RAM is loaded with various processes that occupy in various locations... although I haven't seen many problems like that lately...

Yet another possibility is that there isn't enough MPI buffer length for communication. That's definable... in "OpenFOAM-*/etc/settings.sh" if I'm not mistaken. I would have to verify the variable name, but right now I can't.

Good luck!
Bruno

rgarcia May 31, 2011 08:25

Hey Bruno!

Thanks for your quick reply!

In serial it works good... but it takes very long! The thing that I don't understand is that when I do some directions it works (in parallel) but sometimes it didn't...

It has to be a reason but it seems random! I'm becoming crazy!
Aparently, the problem is combining second order and parallel running... (any second order schemes work well for the 16 directions)

If you have any more suggestion I'll be glad to receive it! In any case, thank you very much! :)

wyldckat June 1, 2011 04:03

Hi rgarcia,

  • Edit the file OpenFOAM*/etc/settings.sh;
  • Find the lines that have this:
    Code:

    # Set the minimum MPI buffer size (used by all platforms except SGI MPI)
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    : ${minBufferSize:=20000000}


    if [ "${MPI_BUFFER_SIZE:=$minBufferSize}" -lt $minBufferSize ]
    then
        MPI_BUFFER_SIZE=$minBufferSize
    fi
    export MPI_BUFFER_SIZE

  • Change 20000000 to 200000000.
  • Save the file.
  • Start a new terminal and try running it in parallel again.
Other possibilities is to try and divide the mesh in fewer or more sub-domains.
And have you checked the sanity of the mesh, by running checkMesh?

Other than these, it could have to do with boundary conditions or some configuration you're overlooking, something like maxCo or some other thing like that ;)

Best regards,
Bruno

rgarcia June 2, 2011 04:59

Quote:

Originally Posted by wyldckat (Post 310076)
Hi rgarcia,

  • Edit the file OpenFOAM*/etc/settings.sh;
  • Find the lines that have this:
    Code:

    # Set the minimum MPI buffer size (used by all platforms except SGI MPI)
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    : ${minBufferSize:=20000000}


    if [ "${MPI_BUFFER_SIZE:=$minBufferSize}" -lt $minBufferSize ]
    then
        MPI_BUFFER_SIZE=$minBufferSize
    fi
    export MPI_BUFFER_SIZE

  • Change 20000000 to 200000000.
  • Save the file.
  • Start a new terminal and try running it in parallel again.

Hi Bruno,

Has you recommend, I'm trying to change the settings.sh file, but I can't:

rgarcia@cener-desktop:/opt/openfoam171/etc$ chmod +x settings.sh
chmod: cambiando los permisos de «settings.sh»: Operación no permitida
rgarcia@cener-desktop:/opt/openfoam171/etc$ chmod +w settings.sh
chmod: cambiando los permisos de «settings.sh»: Operación no permitida

I can copy the file to another folder and change it but then i'm not able to paste it again...

I had already tried the other suggestions you made and it doesn't seems to have anything to do with that!

Thanks again Bruno!

nlc June 3, 2011 09:31

Hi rgarcia

how many cell do you have ?
It work with first order and parallel but not with second
order and parallel that is what you say?
Are you using some custom code ? Did you try without it ?

Regards

Bruno, I'd like to ask you a question:

What is the meaning of minBufferSize:=20000000 ??
What dos it limit ?

Regards

Nicolas Lussier

wyldckat June 4, 2011 08:21

Greetings to all!

@rgarcia: you should run like this:
Code:

sudo chmod o+w settings.sh
The sudo command will request your password to run the application as superuser, namely as root. This is necessary because the /opt folder is a system folder, from where everyone can read and execute, but only the root user can make changes to the files.
As for "o+w", this will give the proper permission for you to edit the file directly without sudo. After changing it, you can use the option "o-w" to revert the change.


@Nicolas: MPI_BUFFER_SIZE indicates the minimum message size in bytes required for communications between MPI processes.
I'm suggesting this solution in an attempt to check if it's an MPI related problem or an OpenFOAM problem.


@rgarcia: In a related note, you might also want to create a small case in the mean time that reproduces this same problem, because it might be necessary to report this as a bug, after we've tried to isolate the problem.
But still, after increasing the message size, trying with fewer cores is also a good idea, in an attempt to isolate the problem.

Best regards,
Bruno

rgarcia June 6, 2011 03:40

Greetings Nicolas and Bruno!

@Bruno: Finally I could change de MPI_BUFFER_SIZE but apparently It's not a problem of message size. Where should I write a repport for my bug?

@Nicolas: I try two cases, one very simple (50000 cells) and the other 500000 cells. I wrote a bash that allows me to do a rose wind study. The study begins at 0 direction (adding the velocity components in /0/U) and after 1000 iterations, it change to the direction 22.5, etc. (16 directions in total). I'm custom model in turbulence.

The case work in second order for coarser and finest grid... but until 4 processor! If I run the case with 5, 6, 7 or 8 processor it didn't work! And the message that always appear it's "mpirun... signal 11 (Segmentation fault)".

wyldckat June 6, 2011 16:09

Hi rgarcia,

OK, you can report the possible bug here: http://www.openfoam.com/bugs/
Giving a small test case and making a full description of the problem is the best thing to do.

On a side note, OpenFOAM has some issues with patches that are divided between sub-domains. I suspect that this may be the problem that is occurring here.

I vaguely remember that there is an option for enforcing patches to not be split apart... you can start reading here: http://www.cfd-online.com/Forums/ope...tml#post305687

Best regards,
Bruno

KayGhana April 18, 2017 10:05

mpirun noticed
 
Hello All, Hello Bruno,

Sorry to ressurrect this old thread again.

Has the problem been solved?

I tried to mesh a CT data using snappyHexMesh and I get this error.

=>mpirun noticed that process rank 0 with PID 22908 on node bophy102 exited on signal 9 (Killed).

I followed the suggested procedures, thus:
changed minBufferSize :=200000000

and used only 4 cores but does not solve the problem.Please let me know if there was a solution already.

Regards,

K.

wyldckat April 24, 2017 13:24

Greetings KayGhana,

Quote:

Originally Posted by KayGhana (Post 645332)
Sorry to ressurrect this old thread again.

As long as it is on topic, it's preferred here on CFD-Online to re-use threads for the same specific topic, instead of starting a new one.

Quote:

Originally Posted by KayGhana (Post 645332)
Has the problem been solved?

I tried to mesh a CT data using snappyHexMesh and I get this error.

=>mpirun noticed that process rank 0 with PID 22908 on node bophy102 exited on signal 9 (Killed).

Unfortunately back then I didn't know as much as I do today and I forgot to ask for more details about the case, specifically what were the error messages and output that the solver gave in that situation.

Therefore, please provide more details, specification the "log" file for snappyHexMesh run that you did and that resulted in that error message.
Because mpirun only told you what it noticed that had happened to the application, it was not able to specify why exactly that happened. That's when we should look at the application output (the contents of the log file) to see what happened before the crash.

Best regards,
Bruno

Eko February 5, 2018 10:45

I do also have a problem while using mpirun. I was running the chtMultiRegionFoam solver and get following message

Code:

mpirun noticed that process rank 22 with PID 36705 on node cserver exited on signal 8 (Floating point exception).
It's my first time using mpirun so I have no clue what to do.
What is the problem and how do I solve it?

Hojjat March 30, 2023 12:53

I had a similar problem with same error.

Parallel processing for a case wasn't working. Although for other cases, it had no problems and I could easily run a parallel simulation. Also running the case on a single processer (serial simulation) was also OK and it worked with no problem.

So I thought my problem is because of decomposition. I changed my decomposition method from "scotch" to "simple" (in my decomposeParDict). And it solved my issue.


All times are GMT -4. The time now is 21:36.