CFD Online Discussion Forums

CFD Online Discussion Forums (http://www.cfd-online.com/Forums/)
-   OpenFOAM Running, Solving & CFD (http://www.cfd-online.com/Forums/openfoam-solving/)
-   -   parallel simpleFoam freezes the whole system (http://www.cfd-online.com/Forums/openfoam-solving/101704-parallel-simplefoam-freezes-whole-system.html)

vangelis May 11, 2012 07:56

parallel simpleFoam freezes the whole system
 
Dear all,

I am trying to run a parallel simpleFoam case but I am encountering serious problems.

9 out of 10 times I start the simulation it leads to a complete freeze of my system (Fedora core 12).
It seems that one (random) node gets a segmentation violation

I've placed here below the messages I get
I would appreciate any help!

Thanks!
Vangelis
__________________________________________________


[vangelis@midas OF]$ mpirun -np 12 simpleFoam -case tetra_layers -parallel > log.txt &
[1] 2758
[vangelis@midas OF]$ gnuplot residuals -
Message from syslogd@localhost at May 11 13:57:11 ...
kernel:------------[ cut here ]------------

Message from syslogd@localhost at May 11 13:57:11 ...
kernel:invalid opcode: 0000 [#1] SMP

Message from syslogd@localhost at May 11 13:57:11 ...
kernel:last sysfs file: /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq

Message from syslogd@localhost at May 11 13:57:11 ...
kernel:Stack:

Message from syslogd@localhost at May 11 13:57:11 ...
kernel:Call Trace:

Message from syslogd@localhost at May 11 13:57:11 ...
kernel:Code: 48 89 45 a0 4c 89 ff e8 75 c1 2a 00 41 8b b6 58 03 00 00 4c 89 e7 ff c6 e8 cf bb ff ff 41 8b 96 5c 03 00 00 48 98 48 39 d0 73 04 <0f> 0b eb fe 48 29 d0 48 89 45 a8 66 41 ff 07 49 8b 94 24 00 01
--------------------------------------------------------------------------
mpirun noticed that process rank 7 with PID 2766 on node midas exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

wyldckat May 11, 2012 08:37

Greetings Vangelis,

That's very little information about your system. So here are a few questions:
  1. Which OpenFOAM version are you using?
  2. Which Gcc version did you use for building OpenFOAM?
  3. Which MPI toolbox are you using with OpenFOAM?
  4. Is it a single machine?
    1. Does is really have 1 or 2 CPU(s) with 12 cores in total?
    2. Or is it a 6 core CPU with hyper-threading?
  5. Does your machine have enough RAM and does it have a working swap?
  6. Have you tried using less parallel processes?
Best regards,
Bruno

vangelis May 11, 2012 08:47

Dear Bruno,

Thank you for your reply!
Please give me some time to collect the
answers to the questions your have (correctly)
addressed to me.

Best regards
Vangelis

vangelis May 14, 2012 04:09

Dear Bruno,

Here are some details on the problem I have
(By the way I have just managed to make it run,
without changing anything. The problem is random
but happens very often)

1) OpenFOAM version 2.0
2) gcc 4.4.4
3) OpenMPI 1.4.1
4) Single machine with 2 XEON X5680 3.3GHz processors with hyperthreading enabled, 12 physical cores, 24 virtual
5) I have 96Gb and I am only solving a problem with around 8 million cells
6) I have only tried 12 processors

Hope this helps
Best regards,

Vangelis

wyldckat May 14, 2012 04:46

Hi Vangelis,

OK, which exact version of OpenFOAM 2.0? 2.0.0, 2.0.1 or 2.0.x?

It would be ideal to upgrade at least to 2.0.x or even 2.1.0 or 2.1.x, because a lot has been fixed since 2.0.0 and 2.0.1 were released.

Best regards,
Bruno

vangelis May 14, 2012 10:58

Hi Bruno

I was using 2.0.0

I have just switched to 2.1.0
but unfortunately again in a random fashion
parallel runs temd to freeze my system.
When this happens OF immediately reports something like:

mpirun noticed that process rank 7 with PID 2766 on node midas exited on signal 11 (Segmentation fault).

I have had to hardware reset my system too many times lately.

Any suggestions?

Best regards,

Vangelis

MartinB May 14, 2012 11:13

Hi Vangelis,

I had similar problems with a dual Xeon system a while ago: the memory chips got too hot... I opened the system and used simple ventilators to cool the memory banks down and simulations were stable. In the end I installed special memory coolers and everything was fine...

Martin

vangelis May 14, 2012 13:39

I am not sure if this is also related
http://www.cfd-online.com/Forums/ope...taneously.html

vangelis May 14, 2012 14:18

I believe this is the same exatct problem

http://www.cfd-online.com/Forums/ope...tml#post257519

wyldckat May 14, 2012 17:16

Hi Vangelis,

Well, with the current information:
  • It's either what Martin wrote about, namely the memory overheating, or maybe even the CPUs overheating.
  • Or it's like it is indicated in one of those posts: missing swap space. You can check how much swap space you have on your machine by running free. Example:
    Code:

    ~$ free
                total      used      free    shared    buffers    cached
    Mem:      5598500    1458128    4140372          0      86176    622508
    -/+ buffers/cache:    749444    4849056
    Swap:      8385924          0    8385924

    The last line, in particular the bold value shouldn't be zero. If it is zero, then you need swap!
For further diagnosis:
  • You should resort to an isolate-and-conquer strategy, by trying to run the case with 2, 4, 6, 8, 10 or 12 cores.
  • Monitor the amount of RAM occupied - with top or htop - and monitor the temperature of the CPUs - with sensors.
  • Are you able to run any other software successfully in that machine an using all of the CPU power, i.e. 12 cores?
  • stressapptest would be a good test to run: http://code.google.com/p/stressapptest/
Other possible issues:
  • If NUMA is configured in the machine's BIOS, but not on the Linux OS, then that could be a problem.
  • If the machine isn't properly configured in the BIOS, it might overheat or not be able to cope with running at full power.
  • The BIOS might be needing an update. I've seen one workstation have a crazy case of hyperventilation due to a mis-calibrated sensor-based fan control system: the fans were pushed to the maximum speed every 30-40s, even when everything was cool enough.
    After updating the BIOS, the controls ran a lot smoother. In your case, the fans might be running too slow.
  • If the Linux OS isn't compatible with your hardware, then that could be a real problem :(
Best regards,
Bruno

vangelis May 15, 2012 05:48

Thank you all for your replies.

I do not think it should be a temperature issue of the memories as on this machine I have run Fluent parallel simulations for days without a problem.

My swap size is 8Gb so this should neither be the issue.

I saw in another post that I may have to make this change
  • Edit the file OpenFOAM*/etc/settings.sh;
${minBufferSize:=20000000}
  • Change 20000000 to 200000000.
  • Save the file.
  • Start a new terminal and try running it in parallel again.
Do you think this will help?

vangelis May 15, 2012 09:28

Changing the Min Buffer Size did not help.

It cannot be a temperature issue
as OpenFOAM freezes immediately freezes
the system upon startup.

This is very frustrating indeed

MartinB May 15, 2012 10:11

Hi Vangelis,

another idea: disable the CPU power management options in the BIOS, such as Intel SpeedStep, Turbo-Mode, C1E... from your first posting it seems to be the frequency setting that fails.

Or bind the processes to specific CPU cores, so that they are not hopping from core to core, with:
Code:

numactl --physcpubind=0,2,4,6,8,10,12,14,16,18,20,22 mpirun -np 12 simpleFoam -case tetra_layers -parallel > log.txt &
Martin

wyldckat May 15, 2012 16:00

Hi Martin and Vangelis,

@Martin: mpirun is meant to come first, right?
Code:

mpirun numactl --physcpubind=0,2,4,6,8,10,12,14,16,18,20,22 -np 12 simpleFoam -case tetra_layers -parallel > log.txt &
@Vangelis:
  1. Taking into account Martin's comment... Vangelis, does the motherboard of your workstation support more than 2 physical CPUs? Because the error message could also make sense if one or two CPU sockets were empty, namely the 0 socket in this case...
  2. You mentioned that the Open-MPI version installed is 1.4.4, but that doesn't necessarily mean that it's the same exact version that OpenFOAM was built with! What does this command output:
    Code:

    echo $WM_MPLIB
    If it outputs "SYSTEMOPENMPI", then it should have been built with the same exact version. But if it has "OPENMPI", then the problem might be because of conflict between the default Open-MPI 1.5.3 that comes with OpenFOAM vs the 1.4.4 on your system!
  3. Another useful option to be added to the mpirun arguments/options is this:
    Code:

    --mca btl_tcp_if_exclude lo,virbr0
    With this example it excludes "lo" (loop network) and "virbr0" (I think it was infiniband on this example) from the links to be used for MPI communication. You can use one of the following commands to see the list of possible exclusions (another example is "eth0" and "wlan0"):
    Code:

    ifconfig
    /sbin/ifconfig


Best regards,
Bruno

vangelis May 16, 2012 05:12

Dear Bruno and Martin,

Thank you very much for your help.

I do not think I have a system MPI.

The output of echo $VM_MPLIB
is OPENMPI

When I run OF2.0 MPI is 1.4.1
when I run OF2.1 MPI is 1.5.3 as you have stated.

It seems that the CPU frequency setting
which is "on demand" and I cannot change it
as I do not have admin priviledges in my PC,
is the problem.

As a workaround I have tried to run something
that peaks the CPU frequency to 3.33GHz just before
executing the mpirun of OpenFOAM and,
up to now at least, it runs without freezing the system.

Vangelis


All times are GMT -4. The time now is 22:44.