CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM > OpenFOAM Running, Solving & CFD

parallel simpleFoam freezes the whole system

Register Blogs Community New Posts Updated Threads Search

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   May 11, 2012, 07:56
Default parallel simpleFoam freezes the whole system
  #1
Senior Member
 
Vangelis Skaperdas
Join Date: Mar 2009
Location: Thessaloniki, Greece
Posts: 287
Rep Power: 21
vangelis is on a distinguished road
Dear all,

I am trying to run a parallel simpleFoam case but I am encountering serious problems.

9 out of 10 times I start the simulation it leads to a complete freeze of my system (Fedora core 12).
It seems that one (random) node gets a segmentation violation

I've placed here below the messages I get
I would appreciate any help!

Thanks!
Vangelis
__________________________________________________


[vangelis@midas OF]$ mpirun -np 12 simpleFoam -case tetra_layers -parallel > log.txt &
[1] 2758
[vangelis@midas OF]$ gnuplot residuals -
Message from syslogd@localhost at May 11 13:57:11 ...
kernel:------------[ cut here ]------------

Message from syslogd@localhost at May 11 13:57:11 ...
kernel:invalid opcode: 0000 [#1] SMP

Message from syslogd@localhost at May 11 13:57:11 ...
kernel:last sysfs file: /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq

Message from syslogd@localhost at May 11 13:57:11 ...
kernel:Stack:

Message from syslogd@localhost at May 11 13:57:11 ...
kernel:Call Trace:

Message from syslogd@localhost at May 11 13:57:11 ...
kernel:Code: 48 89 45 a0 4c 89 ff e8 75 c1 2a 00 41 8b b6 58 03 00 00 4c 89 e7 ff c6 e8 cf bb ff ff 41 8b 96 5c 03 00 00 48 98 48 39 d0 73 04 <0f> 0b eb fe 48 29 d0 48 89 45 a8 66 41 ff 07 49 8b 94 24 00 01
--------------------------------------------------------------------------
mpirun noticed that process rank 7 with PID 2766 on node midas exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
vangelis is offline   Reply With Quote

Old   May 11, 2012, 08:37
Default
  #2
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,975
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Greetings Vangelis,

That's very little information about your system. So here are a few questions:
  1. Which OpenFOAM version are you using?
  2. Which Gcc version did you use for building OpenFOAM?
  3. Which MPI toolbox are you using with OpenFOAM?
  4. Is it a single machine?
    1. Does is really have 1 or 2 CPU(s) with 12 cores in total?
    2. Or is it a 6 core CPU with hyper-threading?
  5. Does your machine have enough RAM and does it have a working swap?
  6. Have you tried using less parallel processes?
Best regards,
Bruno
__________________
wyldckat is offline   Reply With Quote

Old   May 11, 2012, 08:47
Default
  #3
Senior Member
 
Vangelis Skaperdas
Join Date: Mar 2009
Location: Thessaloniki, Greece
Posts: 287
Rep Power: 21
vangelis is on a distinguished road
Dear Bruno,

Thank you for your reply!
Please give me some time to collect the
answers to the questions your have (correctly)
addressed to me.

Best regards
Vangelis
vangelis is offline   Reply With Quote

Old   May 14, 2012, 04:09
Default
  #4
Senior Member
 
Vangelis Skaperdas
Join Date: Mar 2009
Location: Thessaloniki, Greece
Posts: 287
Rep Power: 21
vangelis is on a distinguished road
Dear Bruno,

Here are some details on the problem I have
(By the way I have just managed to make it run,
without changing anything. The problem is random
but happens very often)

1) OpenFOAM version 2.0
2) gcc 4.4.4
3) OpenMPI 1.4.1
4) Single machine with 2 XEON X5680 3.3GHz processors with hyperthreading enabled, 12 physical cores, 24 virtual
5) I have 96Gb and I am only solving a problem with around 8 million cells
6) I have only tried 12 processors

Hope this helps
Best regards,

Vangelis
vangelis is offline   Reply With Quote

Old   May 14, 2012, 04:46
Default
  #5
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,975
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Hi Vangelis,

OK, which exact version of OpenFOAM 2.0? 2.0.0, 2.0.1 or 2.0.x?

It would be ideal to upgrade at least to 2.0.x or even 2.1.0 or 2.1.x, because a lot has been fixed since 2.0.0 and 2.0.1 were released.

Best regards,
Bruno
__________________
wyldckat is offline   Reply With Quote

Old   May 14, 2012, 10:58
Default
  #6
Senior Member
 
Vangelis Skaperdas
Join Date: Mar 2009
Location: Thessaloniki, Greece
Posts: 287
Rep Power: 21
vangelis is on a distinguished road
Hi Bruno

I was using 2.0.0

I have just switched to 2.1.0
but unfortunately again in a random fashion
parallel runs temd to freeze my system.
When this happens OF immediately reports something like:

mpirun noticed that process rank 7 with PID 2766 on node midas exited on signal 11 (Segmentation fault).

I have had to hardware reset my system too many times lately.

Any suggestions?

Best regards,

Vangelis
vangelis is offline   Reply With Quote

Old   May 14, 2012, 11:13
Default
  #7
Senior Member
 
Martin
Join Date: Oct 2009
Location: Aachen, Germany
Posts: 255
Rep Power: 21
MartinB will become famous soon enough
Hi Vangelis,

I had similar problems with a dual Xeon system a while ago: the memory chips got too hot... I opened the system and used simple ventilators to cool the memory banks down and simulations were stable. In the end I installed special memory coolers and everything was fine...

Martin
MartinB is offline   Reply With Quote

Old   May 14, 2012, 13:39
Default
  #8
Senior Member
 
Vangelis Skaperdas
Join Date: Mar 2009
Location: Thessaloniki, Greece
Posts: 287
Rep Power: 21
vangelis is on a distinguished road
I am not sure if this is also related
http://www.cfd-online.com/Forums/ope...taneously.html
vangelis is offline   Reply With Quote

Old   May 14, 2012, 14:18
Default
  #9
Senior Member
 
Vangelis Skaperdas
Join Date: Mar 2009
Location: Thessaloniki, Greece
Posts: 287
Rep Power: 21
vangelis is on a distinguished road
I believe this is the same exatct problem

http://www.cfd-online.com/Forums/ope...tml#post257519
vangelis is offline   Reply With Quote

Old   May 14, 2012, 17:16
Default
  #10
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,975
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Hi Vangelis,

Well, with the current information:
  • It's either what Martin wrote about, namely the memory overheating, or maybe even the CPUs overheating.
  • Or it's like it is indicated in one of those posts: missing swap space. You can check how much swap space you have on your machine by running free. Example:
    Code:
    ~$ free
                 total       used       free     shared    buffers     cached
    Mem:       5598500    1458128    4140372          0      86176     622508
    -/+ buffers/cache:     749444    4849056
    Swap:      8385924          0    8385924
    The last line, in particular the bold value shouldn't be zero. If it is zero, then you need swap!
For further diagnosis:
  • You should resort to an isolate-and-conquer strategy, by trying to run the case with 2, 4, 6, 8, 10 or 12 cores.
  • Monitor the amount of RAM occupied - with top or htop - and monitor the temperature of the CPUs - with sensors.
  • Are you able to run any other software successfully in that machine an using all of the CPU power, i.e. 12 cores?
  • stressapptest would be a good test to run: http://code.google.com/p/stressapptest/
Other possible issues:
  • If NUMA is configured in the machine's BIOS, but not on the Linux OS, then that could be a problem.
  • If the machine isn't properly configured in the BIOS, it might overheat or not be able to cope with running at full power.
  • The BIOS might be needing an update. I've seen one workstation have a crazy case of hyperventilation due to a mis-calibrated sensor-based fan control system: the fans were pushed to the maximum speed every 30-40s, even when everything was cool enough.
    After updating the BIOS, the controls ran a lot smoother. In your case, the fans might be running too slow.
  • If the Linux OS isn't compatible with your hardware, then that could be a real problem
Best regards,
Bruno
__________________
wyldckat is offline   Reply With Quote

Old   May 15, 2012, 05:48
Default
  #11
Senior Member
 
Vangelis Skaperdas
Join Date: Mar 2009
Location: Thessaloniki, Greece
Posts: 287
Rep Power: 21
vangelis is on a distinguished road
Thank you all for your replies.

I do not think it should be a temperature issue of the memories as on this machine I have run Fluent parallel simulations for days without a problem.

My swap size is 8Gb so this should neither be the issue.

I saw in another post that I may have to make this change
  • Edit the file OpenFOAM*/etc/settings.sh;
${minBufferSize:=20000000}
  • Change 20000000 to 200000000.
  • Save the file.
  • Start a new terminal and try running it in parallel again.
Do you think this will help?
vangelis is offline   Reply With Quote

Old   May 15, 2012, 09:28
Default
  #12
Senior Member
 
Vangelis Skaperdas
Join Date: Mar 2009
Location: Thessaloniki, Greece
Posts: 287
Rep Power: 21
vangelis is on a distinguished road
Changing the Min Buffer Size did not help.

It cannot be a temperature issue
as OpenFOAM freezes immediately freezes
the system upon startup.

This is very frustrating indeed
vangelis is offline   Reply With Quote

Old   May 15, 2012, 10:11
Default
  #13
Senior Member
 
Martin
Join Date: Oct 2009
Location: Aachen, Germany
Posts: 255
Rep Power: 21
MartinB will become famous soon enough
Hi Vangelis,

another idea: disable the CPU power management options in the BIOS, such as Intel SpeedStep, Turbo-Mode, C1E... from your first posting it seems to be the frequency setting that fails.

Or bind the processes to specific CPU cores, so that they are not hopping from core to core, with:
Code:
numactl --physcpubind=0,2,4,6,8,10,12,14,16,18,20,22 mpirun -np 12 simpleFoam -case tetra_layers -parallel > log.txt &
Martin
MartinB is offline   Reply With Quote

Old   May 15, 2012, 16:00
Default
  #14
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,975
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Hi Martin and Vangelis,

@Martin: mpirun is meant to come first, right?
Code:
mpirun numactl --physcpubind=0,2,4,6,8,10,12,14,16,18,20,22 -np 12 simpleFoam -case tetra_layers -parallel > log.txt &
@Vangelis:
  1. Taking into account Martin's comment... Vangelis, does the motherboard of your workstation support more than 2 physical CPUs? Because the error message could also make sense if one or two CPU sockets were empty, namely the 0 socket in this case...
  2. You mentioned that the Open-MPI version installed is 1.4.4, but that doesn't necessarily mean that it's the same exact version that OpenFOAM was built with! What does this command output:
    Code:
    echo $WM_MPLIB
    If it outputs "SYSTEMOPENMPI", then it should have been built with the same exact version. But if it has "OPENMPI", then the problem might be because of conflict between the default Open-MPI 1.5.3 that comes with OpenFOAM vs the 1.4.4 on your system!
  3. Another useful option to be added to the mpirun arguments/options is this:
    Code:
    --mca btl_tcp_if_exclude lo,virbr0
    With this example it excludes "lo" (loop network) and "virbr0" (I think it was infiniband on this example) from the links to be used for MPI communication. You can use one of the following commands to see the list of possible exclusions (another example is "eth0" and "wlan0"):
    Code:
    ifconfig
    /sbin/ifconfig

Best regards,
Bruno
__________________
wyldckat is offline   Reply With Quote

Old   May 16, 2012, 05:12
Default
  #15
Senior Member
 
Vangelis Skaperdas
Join Date: Mar 2009
Location: Thessaloniki, Greece
Posts: 287
Rep Power: 21
vangelis is on a distinguished road
Dear Bruno and Martin,

Thank you very much for your help.

I do not think I have a system MPI.

The output of echo $VM_MPLIB
is OPENMPI

When I run OF2.0 MPI is 1.4.1
when I run OF2.1 MPI is 1.5.3 as you have stated.

It seems that the CPU frequency setting
which is "on demand" and I cannot change it
as I do not have admin priviledges in my PC,
is the problem.

As a workaround I have tried to run something
that peaks the CPU frequency to 3.33GHz just before
executing the mpirun of OpenFOAM and,
up to now at least, it runs without freezing the system.

Vangelis
vangelis is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Random machine freezes when running several OpenFoam jobs simultaneously 2bias OpenFOAM Installation 5 July 2, 2010 07:40
Own boundary condition modified simpleFoam erorr in parallel execution sponiar OpenFOAM Running, Solving & CFD 1 August 27, 2008 09:16
IcoFoam parallel woes msrinath80 OpenFOAM Running, Solving & CFD 9 July 22, 2007 02:58
plz rply urgent regrding vof model for my system garima chaudhary FLUENT 1 July 20, 2007 08:37
Need ideas-fuel discharge system Jan CFX 1 October 9, 2006 08:16


All times are GMT -4. The time now is 16:34.