CFD Online Discussion Forums

CFD Online Discussion Forums (http://www.cfd-online.com/Forums/)
-   OpenFOAM (http://www.cfd-online.com/Forums/openfoam/)
-   -   stop when I run in parallel (http://www.cfd-online.com/Forums/openfoam/75760-stop-when-i-run-parallel.html)

Nolwenn May 4, 2010 13:06

stop when I run in parallel
 
Hello everyone,

When I run a parallel case it stops (or sometimes it succeeds) without any error message. Its seems to be busy (all cpu at 100%) but there is no progress. It happens at the beginnig or later, a kind of random error.
I'm using OpenFoam1.6.x with Ubuntu 9.10 and gcc 4.4.1 as compiler.
I have no problem when I run a case with a single processor.

Has anyone an idea of what happen?

Here is a case which run and stop. I just modify the number of processors from the tutorial case.

Thank you for your help.

Nolwenn

OpenFOAM sourced
mecaflu@monarch01:~$ cd OpenFOAM/mecaflu-1.6.x/run/damBreak/
mecaflu@monarch01:~/OpenFOAM/mecaflu-1.6.x/run/damBreak$ mpirun -np 6 interFoam -parallel
/*---------------------------------------------------------------------------*\
| ========= | |
| \\ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \\ / O peration | Version: 1.6.x |
| \\ / A nd | Web: www.OpenFOAM.org |
| \\/ M anipulation | |
\*---------------------------------------------------------------------------*/
Build : 1.6.x-605bfc578b21
Exec : interFoam -parallel
Date : May 04 2010
Time : 18:46:25
Host : monarch01
PID : 23017
Case : /media/teradrive01/mecaflu-1.6.x/run/damBreak
nProcs : 6
Slaves :
5
(
monarch01.23018
monarch01.23019
monarch01.23020
monarch01.23021
monarch01.23022
)

Pstream initialized with:
floatTransfer : 0
nProcsSimpleSum : 0
commsType : nonBlocking
SigFpe : Enabling floating point exception trapping (FOAM_SIGFPE).

// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //
Create time

Create mesh for time = 0


Reading g
Reading field p

Reading field alpha1

Reading field U

Reading/calculating face flux field phi

Reading transportProperties

Selecting incompressible transport model Newtonian
Selecting incompressible transport model Newtonian
Selecting turbulence model type laminar
time step continuity errors : sum local = 0, global = 0, cumulative = 0
DICPCG: Solving for pcorr, Initial residual = 0, Final residual = 0, No Iterations 0
time step continuity errors : sum local = 0, global = 0, cumulative = 0
Courant Number mean: 0 max: 0

Starting time loop

jayrup May 6, 2010 00:19

Hi Nolwenn
I want to ask you, are you using the distributed parallelization or on silgle machine you are giving the : mpirun -np 6 ..... .
Regards
Jay

Nolwenn May 6, 2010 03:36

Hi Jay,

I am using a single machine with mpirun -np 6 interFoam -parallel. When I run with 2 processors it seems it runs more iterations than with 4 or more...

Regards

Nolwenn

wyldckat May 6, 2010 20:40

Greetings Nolwenn,

It could be a memory issue. OpenFOAM is known to crash and/or freeze Linux boxes when memory isn't enough. Check this post (or the whole thread it's on) for more on it: mpirun problems post # 3

Also, try using the parallelTest utility - information available on this post: OpenFOAM updates post #19
The parallelTest utility (it's part of OpenFOAM's test utilities) can aid you in sorting out the more basic MPI problems, like communication problems or missing environment settings or libraries not found, without running any particular solver functionalities.. For example: for some weird reason, there might me something missing in the mpirun command to allow the 6 cores to work properly together!

Best regards,
Bruno

Nolwenn May 7, 2010 04:33

Hello Bruno,

Thank you for your answer, I run parallel test and obtain this :


Code:

Executing: mpirun -np 6 /home/mecaflu/OpenFOAM/OpenFOAM-1.6.x/bin/foamExec parallelTest -parallel | tee log
[0]
Starting transfers
[0]
[0] master receiving from slave 1
[0] (0 1 2)
[0] master receiving from slave 2
[0] (0 1 2)
[0] master receiving from slave 3
[0] (0 1 2)
[0] master receiving from slave 4
[0] (0 1 2)
[0] master receiving from slave 5
[0] (0 1 2)
[0] master sending to slave 1
[0] master sending to slave 2
[0] master sending to slave 3
[0] master sending to slave 4
[0] master sending to slave 5
[1]
Starting transfers
[1]
[1] slave sending to master 0
[1] slave receiving from master 0
[1] (0 1 2)
[2]
Starting transfers
[2]
[2] slave sending to master 0
[2] slave receiving from master 0
[2] (0 1 2)
[3]
Starting transfers
[3]
[3] slave sending to master 0
[3] slave receiving from master 0
[3] (0 1 2)
[4]
Starting transfers
[4]
[4] slave sending to master 0
[4] slave receiving from master 0
[4] (0 1 2)
/*---------------------------------------------------------------------------*\
| =========                |                                                |
| \\      /  F ield        | OpenFOAM: The Open Source CFD Toolbox          |
|  \\    /  O peration    | Version:  1.6.x                                |
|  \\  /    A nd          | Web:      www.OpenFOAM.org                      |
|    \\/    M anipulation  |                                                |
\*---------------------------------------------------------------------------*/
Build  : 1.6.x-605bfc578b21
Exec  : parallelTest -parallel
Date  : May 07 2010
Time  : 10:09:41
Host  : monarch01
PID    : 4344
Case  : /media/teradrive01/mecaflu-1.6.x/run/mine/7
nProcs : 6
Slaves :
5
(
monarch01.4345
monarch01.4346
monarch01.4347
monarch01.4348
monarch01.4349
)

Pstream initialized with:
    floatTransfer    : 0
    nProcsSimpleSum  : 0
    commsType        : nonBlocking
SigFpe : Enabling floating point exception trapping (FOAM_SIGFPE).

// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //
Create time

End

Finalising parallel run
[5]
Starting transfers
[5]
[5] slave sending to master 0
[5] slave receiving from master 0
[5] (0 1 2)

Test on processor 5 comes after the end, I don't know if it could be a reason for stopping...
I have 8 GiB of memory and 3GiB of swap so memory seems to be ok!

Best regards

Nolwenn

wyldckat May 7, 2010 14:34

Greetings Nolwenn,

That is a strange output... it seems a bit out of sync :( It has happened to me once sometime ago, but the OpenFOAM header always came first!

Doesn't the script foamJob work for you? Or does it output the exact same thing?

Another possibility, is that it could actually reveal a bug in OpenFOAM! So, how did you decompose the domains for each processor?

Best regards,
Bruno

Nolwenn May 10, 2010 03:57

Hello Bruno,

Here is the result of foamJob, I can't find a lot of information!

Code:

/*---------------------------------------------------------------------------*\
| =========                |                                                |
| \\      /  F ield        | OpenFOAM: The Open Source CFD Toolbox          |
|  \\    /  O peration    | Version:  1.6.x                                |
|  \\  /    A nd          | Web:      www.OpenFOAM.org                      |
|    \\/    M anipulation  |                                                |
\*---------------------------------------------------------------------------*/
Build  : 1.6.x-605bfc578b21
Exec  : parallelTest -parallel
Date  : May 07 2010
Time  : 10:09:41
Host  : monarch01
PID    : 4344
Case  : /media/teradrive01/mecaflu-1.6.x/run/mine/7
nProcs : 6
Slaves :
5
(
monarch01.4345
monarch01.4346
monarch01.4347
monarch01.4348
monarch01.4349
)

Pstream initialized with:
    floatTransfer    : 0
    nProcsSimpleSum  : 0
    commsType        : nonBlocking
SigFpe : Enabling floating point exception trapping (FOAM_SIGFPE).

// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //
Create time

End

Finalising parallel run

For decomposition I didn't create anything, it is the one of a tutorial.

Code:

// The FOAM Project // File: decomposeParDict
/*
-------------------------------------------------------------------------------
 =========        | dictionary
 \\      /        |
  \\    /          | Name:  decomposeParDict
  \\  /          | Family: FoamX configuration file
    \\/            |
    F ield        | FOAM version: 2.1
    O peration    | Product of Nabla Ltd.
    A and          |
    M anipulation  | Email: Enquiries@Nabla.co.uk
-------------------------------------------------------------------------------
*/
// FoamX Case Dictionary.

FoamFile
{
    version        2.0;
    format          ascii;
    class          dictionary;
    object          decomposeParDict;
}

// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //


numberOfSubdomains 6;

method          hierarchical;
//method          metis;
//method          parMetis;

simpleCoeffs
{
    n              (2 1 2);
    delta          0.001;
}

hierarchicalCoeffs
{
    n              (3 1 2);
    delta          0.001;
    order          xyz;
}

manualCoeffs
{
    dataFile        "cellDecomposition";
}

metisCoeffs
{
    //n                  (5 1 1);
    //cellWeightsFile    "constant/cellWeightsFile";
}


// ************************************************************************* //

When I first have this problem I re-install OF-1.6.x but the problem is the same.

I use gcc compiler, is it possible another compiler solve this?

Thank you for your help Bruno!

Best regards,

Nolwenn

scott May 10, 2010 22:58

How many processors or cores does your machine have?

I would presume if you have 8gb that you prob only have a quad core machine, hence I would only partition the domain into 4 volumes.

If you have a dual core machine then that would explain why it is ok with 2 processors, because that all you have.

Please post up your machine specs so that we can try and be more helpful.

Cheers,

Scott

Nolwenn May 11, 2010 04:28

Hello Scott!

I have 8 processors on my machine, I tried to find specs :

Code:

r3@monarch01:~$ cat /proc/cpuinfo
processor    : 0
vendor_id    : AuthenticAMD
cpu family    : 15
model        : 33
model name    : Dual Core AMD Opteron(tm) Processor 865
stepping    : 0
cpu MHz        : 1799.670
cache size    : 1024 KB
physical id    : 0
siblings    : 2
core id        : 0
cpu cores    : 2
apicid        : 0
initial apicid    : 0
fpu        : yes
fpu_exception    : yes
cpuid level    : 1
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow rep_good extd_apicid pni lahf_lm cmp_legacy
bogomips    : 3599.34
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor    : 1
vendor_id    : AuthenticAMD
cpu family    : 15
model        : 33
model name    : Dual Core AMD Opteron(tm) Processor 865
stepping    : 0
cpu MHz        : 1799.670
cache size    : 1024 KB
physical id    : 0
siblings    : 2
core id        : 1
cpu cores    : 2
apicid        : 1
initial apicid    : 1
fpu        : yes
fpu_exception    : yes
cpuid level    : 1
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow rep_good extd_apicid pni lahf_lm cmp_legacy
bogomips    : 3600.11
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor    : 2
vendor_id    : AuthenticAMD
cpu family    : 15
model        : 33
model name    : Dual Core AMD Opteron(tm) Processor 865
stepping    : 0
cpu MHz        : 1799.670
cache size    : 1024 KB
physical id    : 1
siblings    : 2
core id        : 0
cpu cores    : 2
apicid        : 2
initial apicid    : 2
fpu        : yes
fpu_exception    : yes
cpuid level    : 1
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow rep_good extd_apicid pni lahf_lm cmp_legacy
bogomips    : 3600.10
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor    : 3
vendor_id    : AuthenticAMD
cpu family    : 15
model        : 33
model name    : Dual Core AMD Opteron(tm) Processor 865
stepping    : 0
cpu MHz        : 1799.670
cache size    : 1024 KB
physical id    : 1
siblings    : 2
core id        : 1
cpu cores    : 2
apicid        : 3
initial apicid    : 3
fpu        : yes
fpu_exception    : yes
cpuid level    : 1
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow rep_good extd_apicid pni lahf_lm cmp_legacy
bogomips    : 3600.11
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor    : 4
vendor_id    : AuthenticAMD
cpu family    : 15
model        : 33
model name    : Dual Core AMD Opteron(tm) Processor 865
stepping    : 0
cpu MHz        : 1799.670
cache size    : 1024 KB
physical id    : 2
siblings    : 2
core id        : 0
cpu cores    : 2
apicid        : 4
initial apicid    : 4
fpu        : yes
fpu_exception    : yes
cpuid level    : 1
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow rep_good extd_apicid pni lahf_lm cmp_legacy
bogomips    : 3600.10
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor    : 5
vendor_id    : AuthenticAMD
cpu family    : 15
model        : 33
model name    : Dual Core AMD Opteron(tm) Processor 865
stepping    : 0
cpu MHz        : 1799.670
cache size    : 1024 KB
physical id    : 2
siblings    : 2
core id        : 1
cpu cores    : 2
apicid        : 5
initial apicid    : 5
fpu        : yes
fpu_exception    : yes
cpuid level    : 1
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow rep_good extd_apicid pni lahf_lm cmp_legacy
bogomips    : 3600.10
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor    : 6
vendor_id    : AuthenticAMD
cpu family    : 15
model        : 33
model name    : Dual Core AMD Opteron(tm) Processor 865
stepping    : 0
cpu MHz        : 1799.670
cache size    : 1024 KB
physical id    : 3
siblings    : 2
core id        : 0
cpu cores    : 2
apicid        : 6
initial apicid    : 6
fpu        : yes
fpu_exception    : yes
cpuid level    : 1
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow rep_good extd_apicid pni lahf_lm cmp_legacy
bogomips    : 3600.10
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor    : 7
vendor_id    : AuthenticAMD
cpu family    : 15
model        : 33
model name    : Dual Core AMD Opteron(tm) Processor 865
stepping    : 0
cpu MHz        : 1799.670
cache size    : 1024 KB
physical id    : 3
siblings    : 2
core id        : 1
cpu cores    : 2
apicid        : 7
initial apicid    : 7
fpu        : yes
fpu_exception    : yes
cpuid level    : 1
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow rep_good extd_apicid pni lahf_lm cmp_legacy
bogomips    : 3600.11
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

When I run the parallel test with 4 proc I have the same problem : OpenFOAM header comes almost at the end of the test ...

Best regards

Nolwenn

scott May 11, 2010 17:34

Have you tried with all 8 processors?

I dont have this problem on mine when I use all of the processors. Make sure you use decomposepar to get 8 partitions before you try.

Scott

scott May 11, 2010 17:42

Also are these 8 processes all on the same machine or are they on different machines? ie, is it a small cluster?

I haven't done this on a cluster setup before so can't be of any help with that. I was assuming that you had two quad core processors on a single motherboard, but I just went through it again and its either 8 dual core processors, or it is 4 dual core processers reporting a process for each core.

Can you confirm exactly what it is and maybe someone else can help you.

If its a cluster than you may have load issues, interconnect problems, or questionable installations on other machines.

Cheers,

Scott

Nolwenn May 12, 2010 04:34

Sorry, I am not very familiar with machine specs !
It is a single machine with 4 dual cores processors.

When I run with all processors the problem is the same :

Code:

Parallel processing using OPENMPI with 8 processors
Executing: mpirun -np 8 /home/mecaflu/OpenFOAM/OpenFOAM-1.6.x/bin/foamExec parallelTest -parallel | tee log
[0]
Starting transfers
[0]
[0] master receiving from slave 1
[0] (0 1 2)
[0] master receiving from slave 2
[0] (0 1 2)
[0] master receiving from slave 3
[0] (0 1 2)
[0] master receiving from slave 4
[0] (0 1 2)
[0] master receiving from slave 5
[0] (0 1 2)
[0] master receiving from slave 6
[0] (0 1 2)
[0] master receiving from slave 7
[0] (0 1 2)
[0] master sending to slave 1
[0] master sending to slave 2
[0] master sending to slave 3
[0] master sending to slave 4
[0] master sending to slave 5
[0] master sending to slave 6
[0] master sending to slave 7
[1]
Starting transfers
[1]
[1] slave sending to master 0
[1] slave receiving from master 0
[1] (0 1 2)
[2]
Starting transfers
[2]
[2] slave sending to master 0
[2] slave receiving from master 0
[2] (0 1 2)
[3]
Starting transfers
[3]
[3] slave sending to master 0
[3] slave receiving from master 0
[3] (0 1 2)
[4]
Starting transfers
[4]
[4] slave sending to master 0
[4] slave receiving from master 0
[4] (0 1 2)
[5]
Starting transfers
[5]
[5] slave sending to master 0
[5] slave receiving from master 0
[5] (0 1 2)
[6]
Starting transfers
[6]
[6] slave sending to master 0
[6] slave receiving from master 0
[6] (0 1 2)
[7]
Starting transfers
[7]
[7] slave sending to master 0
[7] slave receiving from master 0
[7] (0 1 2)
/*---------------------------------------------------------------------------*\
| =========                |                                                |
| \\      /  F ield        | OpenFOAM: The Open Source CFD Toolbox          |
|  \\    /  O peration    | Version:  1.6.x                                |
|  \\  /    A nd          | Web:      www.OpenFOAM.org                      |
|    \\/    M anipulation  |                                                |
\*---------------------------------------------------------------------------*/
Build  : 1.6.x-605bfc578b21
Exec  : parallelTest -parallel
Date  : May 12 2010
Time  : 10:22:27
Host  : monarch01
PID    : 4894
Case  : /media/teradrive01/mecaflu-1.6.x/run/mine/9
nProcs : 8
Slaves :
7
(
monarch01.4895
monarch01.4896
monarch01.4899
monarch01.4919
monarch01.4922
monarch01.4966
monarch01.4980
)

Pstream initialized with:
    floatTransfer    : 0
    nProcsSimpleSum  : 0
    commsType        : nonBlocking
SigFpe : Enabling floating point exception trapping (FOAM_SIGFPE).

// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //
Create time

End

Finalising parallel run

And in my log file I have just the end (from OF header).

Best regards

Nolwenn

bfa May 12, 2010 05:03

I encounter the same problem as Nolwenn. I use 12 cores on a single machine. parallelTest works fine and prints out results in a reasonable order. But when I run foamJob computation hangs on solving the first UEqn. All cores are on 100% but nothing is happening.
solver: simpleFoam
case: pitzDaily
decomposition: simple
OpenFOAM 1.6.x

Here is the output from mpirun -H localhost -np 12 simpleFoam -parallel
Code:

/*---------------------------------------------------------------------------*\
| =========                |                                                |
| \\      /  F ield        | OpenFOAM: The Open Source CFD Toolbox          |
|  \\    /  O peration    | Version:  1.6.x                                |
|  \\  /    A nd          | Web:      www.OpenFOAM.org                      |
|    \\/    M anipulation  |                                                |
\*---------------------------------------------------------------------------*/
Build  : 1.6.x-1d1db32a12b0
Exec  : simpleFoam -parallel
Date  : May 12 2010
Time  : 09:31:45
Host  : brahms
PID    : 11694
Case  : /home/fabritius/OpenFOAM/OpenFOAM-1.6.x/tutorials/incompressible/simpleFoam/pitzDaily
nProcs : 12
Slaves :
11
(
brahms.11695
brahms.11696
brahms.11697
brahms.11698
brahms.11699
brahms.11700
brahms.11701
brahms.11702
brahms.11703
brahms.11704
brahms.11705
)

Pstream initialized with:
    floatTransfer    : 0
    nProcsSimpleSum  : 0
    commsType        : nonBlocking
SigFpe : Enabling floating point exception trapping (FOAM_SIGFPE).

// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //
Create time

Create mesh for time = 0

Reading field p

Reading field U

Reading/calculating face flux field phi

Selecting incompressible transport model Newtonian
Selecting RAS turbulence model kEpsilon
kEpsilonCoeffs
{
    Cmu            0.09;
    C1              1.44;
    C2              1.92;
    sigmaEps        1.3;
}


Starting time loop

Time = 1

...and there it stops!

I tried a verbose mode of mpirun but that delivered no useful information either. Unfortunately I have no profiling tools at hand for parallel code. If anyone of you has vampir or sth similar and could try this out, that would be great.

wyldckat May 15, 2010 12:12

Greetings to all,

Well, this is quite odd. The only solutions that come to mind is to test the same working conditions with other build scenarios, namely:
  • use the system's OpenMPI, which is a valid option in $WM_PROJECT_DIR/etc/bashrc in OpenFOAM 1.6.x.
  • trying using the pre-built OpenFOAM 1.6 available on www.openfoam.com;
  • building OpenFOAM 1.6.x with gcc 4.3.3, which comes in the ThirdParty folder.
If you can manage to getting it running with one of the above or with some other solution, please tell us about it.

Because the only reasons that come to mind for the solvers to just jam up and not do anything productive, is that something didn't get built how it is suppose to be.


As for the output from parallelTest to come out with the outputs swapped, it could be an stdout buffering issue, where mpirun outputs the text from the slaves prior to the master's output, because the master's output didn't fill up fast enough to trigger the limit of number of characters before flushing.


Best regards,
Bruno

Nolwenn May 17, 2010 03:50

Hello everyone,

Now everything seems to be ok for me! I came back to OF 1.6 (prebuilt) with Ubuntu 8.04 and I have no longer problem.

Thank you again for your help Bruno!

Cheers,

Nolwenn

gtampier May 27, 2010 15:29

Hello everyone,

I'm experiencing the very same problem with openSuse. I've tried the pre-compiled 1.6 version and it worked! My problem arises again when I recompile openmpi. I do this in order to add the torque (batch system) and ofed options. Since we have a small cluster, this options are necessary for running cases in more than one node. Even if I recompile openmpi without this options (and just recompile it, nothing else), I get the same problem! (calculations stop, sometimes earlier, sometimes later and sometimes at the beginning, w/o any error mssg. and keeping all CPU's at 100%). This is quite strange - I would be glad if someone has further ideas... I'll keep you informed if I make some progress.

regards
Gonzalo

wyldckat May 27, 2010 20:41

Greetings Gonzalo,

Let's see... here are my questions for you:
  • How did you rebuild OpenMPI? Did you add the build options to the Allwmake script available in the folder $HOME/OpenFOAM/ThirdParty-1.6? Or did you rebuild the library by hand (./configure then make)?
  • What version of gcc did you use to rebuild OpenMPI? OpenFOAM's gcc 4.3.3 or openSUSE's version?
  • Then, after rebuilding the OpenMPI library, did you rebuild OpenFOAM as well? If not, something may be miss-linked somehow.
  • Do you know if your installed openSUSE version has a version of OpenMPI available in YaST (the Software Package Manager part of it) that has the characteristics you need? Because if it does, OpenFOAM 1.6.x has an option to use the system's OpenMPI version instead of the one that comes with OpenFOAM! And even if you need to stick to OpenFOAM 1.6, it should as easy as copying bashrc and settings.sh from the OpenFOAM-1.6.x/etc folder to OpenFOAM-1.6/etc folder!
I personally haven't had the time to try and reproduce these odd MPI freezing issues, but I also think it won't be very easy to reproduce them :(

The easiest way to avoid these issues, would be to use the same version of distros as the pre-built binaries came from, namely, if I'm not mistaken, Ubuntu 9.04 and openSUSE 11.0 or 11.1, because they have gcc 4.3.3 as their system compiler.

Best regards,
Bruno

gtampier May 28, 2010 03:09

Hello Bruno, hello all,

thanks for your comments. I compiled now openmpi again and it worked! I was trying to compile it with the system's gcc (4.4.1) of openSuse 11.2 first, which apparently caused the problems. Now I've tried it again with the ThirdParty gcc (4.3.3) and it works!
In both cases I compiled it with Allwmake from the ThirdParty-1.6 directory, after uncommenting the openib and openib-libdir options and adding the --with-tm option for torque. Then I deleted the openmpi-1.3.3/platform dir and executed Allwmake in ThirdParty-1.6. After this, it wasn't necessary to recompile OpenFOAM again.
Now I have run first tests with 2 nodes and a total of 16 processes (finer damBreak tutorial) and it seems to work fine!
It still remains for me a strange task, since I made the same for 1.6.x and it didn't work! I'll try now with the system's compiler for both OpenFOAM-1.6.x and ThirdParty when I have more time.
Thanks again!
Gonzalo

bunni May 28, 2010 15:39

parallel problem
 
Hi,

I've got a problem running a code in parallel. (one machine, quad core). I'm using openfoam 1.6 prebuilt binaries, on fedora 12.

The error I get is:

/*---------------------------------------------------------------------------*\
| ========= | |
| \\ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \\ / O peration | Version: 1.6 |
| \\ / A nd | Web: www.OpenFOAM.org |
| \\/ M anipulation | |
\*---------------------------------------------------------------------------*/
Build : 1.6-f802ff2d6c5a
Exec : interFoam -parallel
Date : May 28 2010
Time : 12:27:10
Host : blue
PID : 23136
Case : /home/bunni/OpenFOAM/OpenFOAM-1.6/tutorials/quartcyl
nProcs : 2
Slaves :
1
(
blue.23137
)

Pstream initialized with:
floatTransfer : 0
nProcsSimpleSum : 0
commsType : nonBlocking
SigFpe : Enabling floating point exception trapping (FOAM_SIGFPE).

// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //
Create time

Create mesh for time = 0

[blue:23137] *** An error occurred in MPI_Bsend
[blue:23137] *** on communicator MPI_COMM_WORLD
[blue:23137] *** MPI_ERR_BUFFER: invalid buffer pointer
[blue:23137] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 23137 on
node blue exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[blue:23135] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[blue:23135] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

- so I take it the program is crashing in the mesh part? It seems to run fine on a single proc. (and another geometry I had ran fine for parallel jobs). I've meshed a quarter of a cylinder, with the cylinder aligned on the z-axis. I've done simple decomposition along the z-axis, thinking that the circular geometry might be causing the problem.

Above, bruno mentioned the scripts: runParallel, parallelTest. Where are those scripts?

Cheers

wyldckat May 28, 2010 21:41

Greetings bunni,

Quote:

Originally Posted by bunni (Post 260763)
- so I take it the program is crashing in the mesh part? It seems to run fine on a single proc. (and another geometry I had ran fine for parallel jobs). I've meshed a quarter of a cylinder, with the cylinder aligned on the z-axis. I've done simple decomposition along the z-axis, thinking that the circular geometry might be causing the problem.

You might be hitting an existing bug in OpenFOAM 1.6, that could already be solved in OpenFOAM 1.6.x. For building OpenFOAM 1.6.x in Fedora 12, check this post: Problem Installing OF 1.6 Ubuntu 9.10 (64 bit) - How to use GCC 4.4.1 post #11

Quote:

Originally Posted by bunni (Post 260763)
Above, bruno mentioned the scripts: runParallel, parallelTest. Where are those scripts?

Check my post #4 in this current thread.

Best regards,
Bruno


All times are GMT -4. The time now is 23:46.