CFD Online Discussion Forums

CFD Online Discussion Forums (http://www.cfd-online.com/Forums/)
-   OpenFOAM Running, Solving & CFD (http://www.cfd-online.com/Forums/openfoam-solving/)
-   -   decomposed case to 2-cores (Not working) (http://www.cfd-online.com/Forums/openfoam-solving/84084-decomposed-case-2-cores-not-working.html)

pkr January 19, 2011 18:18

decomposed case to 2-cores (Not working)
 
I am working on interFoam/laminar/damBreak case. The number of cells in the generated mesh is 955000. To run in parallel, the mesh decomposition is done using metis decomposition.

When decomposed and running for 4 cores (quad-core Xeon E5620 ), it works perfectly fine.

On changing the decomposed case to 2-cores, After some time of running, the systems hangs and displays the following error: "mpirun noticed that process rank 0 with PID 3758 on node exited on signal 11 (Segmentation fault)."


Please suggest.Thanks

romant January 20, 2011 04:37

did you delete all the previous processor* folders before decomposing again?

pkr January 20, 2011 10:01

Oh yes, I deleted the processor* directories. Any other clues?

santiagomarquezd January 20, 2011 10:22

Hi, sometimes changing the decomposition method fixes the problem, try with Simple.

Best.

pkr January 20, 2011 10:31

Thanks for your response. You are right, changing the decomposition method to simple fix this problem. But my requirement is to reduce the boundary faces shared between the cores, so that I can reduce the communication cost. This is the reason I shifted to metis from simple.

Do you think the problem is related to the MPI buffer size?

wyldckat January 22, 2011 13:53

Greetings to all!

@pkr: Wow, you've been asking about this in a few places!
Like Santiago implied, the problem is related to how the cells are distributed through the sub-domains. And this isn't a new issue, it has been happening for quite a while now. On one case that I accompanied a bit led to this bug report: http://www.cfd-online.com/Forums/ope...-parallel.html - If your follow up the story on that thread and related link, you'll learn about an issue with cyclic patches and so on.

Nonetheless, if your case is just a high-resolution of the damBreak case, without adding anything in particular - like cyclic patches, wedges and so on - then the problem should be related to a few cells that are split up between processors, when they should be kept together. Also, if it's just the damBreak case, AFAIK decomposing with Metis will not be more minimized than using simple or hierarchical methods. The proof is the face count returned by decomposePar, which always leads to Metis having something like 50% more faces interfacing between domains, than with simple or hierarchical methods. My experiences with high-resolution versions of the damBreak and cavity cases, in attempts to do some benchmarks with OpenFOAM in a multi-core machine, have led me to conclude that both simple and hierarchical methods are more than enough and also better for situations like these, where the meshes are so simple. Metis and Scotch are for the more complex meshes, with no clear indication of where are the most likely and best places to split the mesh.

Now, if you still want to use Metis, then also try scotch, which usually is available with the latest versions of OpenFOAM. It's conceptually similar to Metis, but has a far more permissive software license than Metis. It will likely produce a different way of distributing the cells between sub-domains; with luck, the wrong cells wont end up apart from each other.
Also, if you run the following command in the tutorials and applications folders of OpenFOAM, you can find out a bit more about decomposition options from other dictionaries:
Code:

find . -name "decomposeParDict"
Best regards and good luck!
Bruno

PS: By the way, another conclusion was that with a single machine, with multiple cores, sometimes over-scheduling the processors leads to higher processing power; such case was with a modified cavity 3D using icoFoam solver and about 1 million cells, where on my AMD 1055T x6 cores, 16 sub-domains lead to a rather better run time, rather than 4 and 6 sub-domains! But still, I have yet to be able to achieve linear speed-up or anything near :( even from a CPU computation power point of view (i.e. 6x times the power with 6 core machine, no matter how many sub-domains).

santiagomarquezd January 22, 2011 16:41

Hey Bruno, thx for the explanation. I have a related problem, working with interFoam and METIS too. We've a parallel facility with a server and diskless nodes which reads the SO trough the net via NFS. When I use METIS and run for example a) 2 threads each one in one core in the server things go well. Then, if I do the same in b) a node (server and nodes have 8 cores) problem is decomposed correctly but only one core has load and the problem runs at lesser load core, it is very slowly.
Other case, c) launching from the server, but sending a thread to node1 and the other one to node2. Correct decomposition, balanced load. All OK.
Finally d) launching from server sending two threads to the same node, same problem as a). It is very weird, sounds like nodes don't like multicore processing with OpenFOAM.

Regards.

wyldckat January 22, 2011 20:34

Hi Santiago,

Yes, I know, it's really weird! Here's another proof I picked up from this forum, a draft report by Gijsbert Wierink: Installation of OpenFOAM on the Rosa cluster
If you see the Figure 1 in that document, you'll see the case can't speed up unless it's unleashed into more than one machine! I've replicated the case used and the timings with my AMD 1055T x6 are roughly the same. It was with that information that lead me to try do over-scheduling of 16 processes into the 6 processors and managed to get a rather better performance than using only 6 processes.
Basically, the timings reported on that draft indicate a lousy speed up of almost 4 times in a 8 core machine (4 core per socket, dual socket machine if I'm not mistaken), but when 16 and 32 cores (3-4 nodes) are used, the speed up is of 10 and 20 times! Above that, it saturates due to the cell/core count dropping too much under the 50k cell/core estimate.

With this information, along with the information in the report "OpenFOAM Performance Benchmark and Profiling" and the estimated minimum limit of 50k cells/core, my deductions are:
  1. They cheated on the benchmark report (the second report, not the first one), by using the 2D cavity case, since in a 1 million cell case, they have 1000x less cells interfacing between processors when compared with a 3D cavity case with the same number of cells.
  2. I also tested with the 2D cavity case with my single machine and ironically it ends up with the same speed up scenario as the 3D case! Sooo... maybe it's not the amount communicated between each processor that is the bottleneck.
  3. There might be some calibration missing for shared memory transactions between MPI based processes, because the automatic calibration made for multiple machines works just fine.
  4. The other possibility is that the CPU L1/2/3 caches play a rather constricting role when the communication has to be done only locally, therefore not allowing for proper data interlacing when MPI_Allreduce (see second paper, table in page 16) has to gather data together. When more than one machine is used, it seems that there is a better knitting capability when MPI_Allreduce is called.
    This theory also plays well with over scheduling, since there might be a crazy alignment of cache<->memory accesses, therefore reducing the overall timings.
  5. Load balancing (using the coeffs option for processor weights in the metis method) in a single machine, doesn't seem to be worth anything. But in old Pentium 4 HyperThreading CPUs this might have been played a key role.
  6. Finally: there must be something that no one is telling us...
People that go to the OpenFOAM workshops might be given information about this issue, but personally I've never went to any. But I would really like to know what on digital earth is going on here...

edit: I forgot to mention, if my memory isn't failing me, that here in the forum is some limited information about configuring the shared memory defined by the kernel, which can play a rather important key in local runs, but I've never been able to actually be successful in doing a proper tuning of these parameters.

Best regards,
Bruno

pkr January 24, 2011 11:57

Thanks Bruno. I am working on your suggestions.

I am also trying to make the parallel case running across the machines. To test the parallel solution, I followed the steps mentioned in other post http://www.cfd-online.com/Forums/ope...tml#post256927.

If I run the parallel case with 2 processes on a single machine. The parallelTest utility works fine:
Quote:

Parallel processing using OPENMPI with 2 processors
Executing: mpirun -np 2 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest -parallel | tee log
[0]
Starting transfers
[0]
[0] master receiving from slave 1
[0] (0 1 2)
[0] master sending to slave 1
[1]
Starting transfers
[1]
[1] slave sending to master 0
[1] slave receiving from master 0
[1] (0 1 2)
/*---------------------------------------------------------------------------*\
| ========= | |
| \\ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \\ / O peration | Version: 1.6 |
| \\ / A nd | Web: www.OpenFOAM.org |
| \\/ M anipulation | |
\*---------------------------------------------------------------------------*/
Build : 1.6-f802ff2d6c5a
Exec : parallelTest -parallel
Date : Jan 24 2011
Time : 11:52:13
Host : fire1
PID : 25559
Case : /home/rphull/OpenFOAM/OpenFOAM-1.6/tutorials/multiphase/interFoam/laminar/damBreak
nProcs : 2
Slaves :
1
(
fire1.25560
)

Pstream initialized with:
floatTransfer : 0
nProcsSimpleSum : 0
commsType : nonBlocking
SigFpe : Enabling floating point exception trapping (FOAM_SIGFPE).

// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //
Create time

End

On the other hand, If I split the processing across 2 machines then the system hangs after create time:
Quote:

Parallel processing using OPENMPI with 2 processors
Executing: mpirun -np 2 -hostfile machines /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest -parallel | tee log
/*---------------------------------------------------------------------------*\
| ========= | |
| \\ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \\ / O peration | Version: 1.6 |
| \\ / A nd | Web: www.OpenFOAM.org |
| \\/ M anipulation | |
\*---------------------------------------------------------------------------*/
Build : 1.6-f802ff2d6c5a
Exec : parallelTest -parallel
Date : Jan 24 2011
Time : 11:54:36
Host : fire1
PID : 26049
Case : /home/rphull/OpenFOAM/OpenFOAM-1.6/tutorials/multiphase/interFoam/laminar/damBreak
nProcs : 2
Slaves :
1
(
fire3.31626
)

Pstream initialized with:
floatTransfer : 0
nProcsSimpleSum : 0
commsType : nonBlocking
SigFpe : Enabling floating point exception trapping (FOAM_SIGFPE).

// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //
Create time
Please suggest what might be going wrong. Thanks.


P.S. OpenFoam version 1.6 is used in both the machines.

wyldckat January 24, 2011 17:41

Hi pkr,

Wow, I love it when the parallelTest application does some crazy time travel and flushes the buffer in a crazy order :D

As for the second run:
  • Is the path "/home/rphull/OpenFOAM/OpenFOAM-1.6" visible in both machines?
  • Is either one of the folders on this path shared via NFS or sshfs or something like that?
  • Or at the very least, is the folder structure pretty much identical in both machines?
If neither one of these is true, then the problem is the missing files.
On the other hand, if at least one of them is true, then you should check how the firewall is configured on those two machines. The other possibility is that the naming convention for the IP addresses isn't being respected in both machines. For example, if the first machine has defined in "/etc/hosts" that:
  • 192.192.192.1 machine1
  • 192.192.192.2 machine2
But then the second machine has:
  • 192.192.192.10 machine1
  • 192.192.192.2 machine2
In that case, something nasty is going to happen :(

My usual trick to try to isolate these cases is to:
  1. Run parallelTest with mpirun, but without the "-parallel" argument. If it runs both of them, then it's a port accessing problem, i.e. firewall.
  2. Try and run with the same scenario (path, case and command), but starting from the other machine.
Last but not least: your user name and password might not be identical in both machines, although it should have complained at least a bit after launching mpirun :confused:

Best regards and good luck!
Bruno

santiagomarquezd January 24, 2011 20:40

Bruno, some comments,

Quote:

Originally Posted by wyldckat (Post 291725)
If you see the Figure 1 in that document, you'll see the case can't speed up unless it's unleashed into more than one machine! I've replicated the case used and the timings with my AMD 1055T x6 are roughly the same. It was with that information that lead me to try do over-scheduling of 16 processes into the 6 processors and managed to get a rather better performance than using only 6 processes.
[*]There might be some calibration missing for shared memory transactions between MPI based processes, because the automatic calibration made for multiple machines works just fine.
[*]The other possibility is that the CPU L1/2/3 caches play a rather constricting role when the communication has to be done only locally, therefore not allowing for proper data [*]Load balancing (using the coeffs option for processor weights in the metis method) in a single machine, doesn't seem to be worth anything. But in old Pentium 4 HyperThreading CPUs this might have been played a key role.
[*]Finally: there must be something that no one is telling us...

problem is even a little bit more weird because as I posted, asymmetrical CPU load stands only for two o more threads running in the same node, nevertheless in the server I can do multicore runnings without problems. If it is a problem OpenMPI, it only appears in the nodes.

Regards.

pkr January 24, 2011 22:08

Thanks for your response Bruno. I tried your suggestions, but still no progress in solving the problem.

Quote:

Is the path "/home/rphull/OpenFOAM/OpenFOAM-1.6" visible in both machines?
Yes, it's visible.



Quote:

Is either one of the folders on this path shared via NFS or sshfs or something like that?
Not sure on this.



Quote:

Or at the very least, is the folder structure pretty much identical in both machines?
Folder structure is identical.



Quote:

On the other hand, if at least one of them is true, then you should check how the firewall is configured on those two machines.
I even tried with Firewall disabled. Still not working.



Quote:

The other possibility is that the naming convention for the IP addresses isn't being respected in both machines. For example, if the first machine has defined in "/etc/hosts" that:
  • 192.192.192.1 machine1
  • 192.192.192.2 machine2
But then the second machine has:
  • 192.192.192.10 machine1
  • 192.192.192.2 machine2
In that case, something nasty is going to happen :(
I checked this, the machine IPs are fine.



Apart from this I tried a simple OpenMPI program which works fine. The code and output is as follows:
Quote:

Code:
#include <stdio.h>
#include <mpi.h>

int main(int argc, char *argv[]) {
int numprocs, rank, namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];

MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);

printf("Process %d on %s out of %d\n", rank, processor_name, numprocs);

MPI_Finalize();
}


Execution:
rphull@fire1:~$ mpirun -np 2 -machinefile machine mpi_prog
Process 0 on fire1 out of 2
Process 1 on fire3 out of 2


Quote:

Run parallelTest with mpirun, but without the "-parallel" argument. If it runs both of them, then it's a port accessing problem, i.e. firewall.
I guess you asked me to execute
mpirun -np 2 parallelTest ==> Works
mpirun --hostfile machines -np 2 parallelTest ==> Not working



Quote:

Try and run with the same scenario (path, case and command), but starting from the other machine.
Tried, but still not working.


Do you think it might be a problem with the version I am using? I am currently working on OpenFoam1.6. Shall I move to OpenFoam1.6.X?
Please suggest some other things I can check upon.

Thanks.

pkr January 25, 2011 01:29

Hi Bruno,

Another query: Please comment on the process I am following to execute parallelTest across the machines.

1. machine1 as master and machine2 as slave.
2. In machine1, change system/decomposeParDict for 2 processes
3. Execute decomposePar on machine1 which creates two directories as processor1 and processor2.
4. Create machines file in machine1 to contain machine1 and machine2 as entries.
5. Copy processor1 and processor2 directory from machine1 to machine2. (Directory: OpenFOAM/OpenFOAM-1.6/tutorials/multiphase/interFoam/laminar/damBreak)
6. Launch "foamjob -p -s parallelTest" on machine1


After following these steps, the output stucks at create time as follows:
Quote:

Parallel processing using OPENMPI with 2 processors
Executing: mpirun -np 2 -hostfile machines /home/rphull/OpenFOAM/OpenFOAM-1.6.x/bin/foamExec parallelTest -parallel | tee log
/*---------------------------------------------------------------------------*\
| ========= | |
| \\ / F ield | OpenFOAM: The Open Source CFD Toolbox |
| \\ / O peration | Version: 1.6.x |
| \\ / A nd | Web: www.OpenFOAM.org |
| \\/ M anipulation | |
\*---------------------------------------------------------------------------*/
Build : 1.6.x-5e99c7bac54c
Exec : parallelTest -parallel
Date : Jan 25 2011
Time : 01:25:31
Host : fire2
PID : 30454
Case : /home/rphull/OpenFOAM/OpenFOAM-1.6.x/tutorials/multiphase/interFoam/laminar/damBreak
nProcs : 2
Slaves :
1
(
fire3.18782
)

Pstream initialized with:
floatTransfer : 0
nProcsSimpleSum : 0
commsType : nonBlocking
SigFpe : Enabling floating point exception trapping (FOAM_SIGFPE).

// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //
Create time

Please comment if I am following the right process for executing parallelTest application across the machines.

pkr January 25, 2011 15:10

Hi Bruno,

Yet another query:
It seems that the problem might be due to setting of some environment variables at the slave. Please suggest?

The OpenFoam project directory is visible on the slave side:
rphull@fire3:~$ echo $WM_PROJECT_DIR
/home/rphull/OpenFOAM/OpenFOAM-1.6

1. When complete path of executable is not specified:
Quote:

rphull@fire2:~/OpenFOAM/OpenFOAM-1.6/tutorials/multiphase/interFoam/laminar/damBreak$ mpirun --hostfile machines -np 2 interFoam -parallel
--------------------------------------------------------------------------
mpirun was unable to launch the specified application as it could not find an executable:

Executable: interFoam
Node: fire3

while attempting to start process rank 1.
--------------------------------------------------------------------------


2. The case when the complete executable path is specified:
Quote:

rphull@fire2:~/OpenFOAM/OpenFOAM-1.6/tutorials/multiphase/interFoam/laminar/damBreak$ mpirun --hostfile machines -np 2 /home/rphull/OpenFOAM/OpenFOAM-1.6/applications/bin/linux64GccDPOpt/interFoam -parallel
/home/rphull/OpenFOAM/OpenFOAM-1.6/applications/bin/linux64GccDPOpt/interFoam: error while loading shared libraries: libinterfaceProperties.so: cannot open shared object file: No such file or directory

From the second case, it looks like the machine tried to launch the application but failed as it was not able to figure out the path for the shared object (libinterfaceProperties.so in this case). Any suggestions to fix this?

wyldckat January 25, 2011 18:29

Hi pkr,

Mmm, that's a lot of testing you've been doing :)

OK, let's see if I don't forget anything:
  • Creating a test application was a very good call! I didn't suggest that because I don't know any test case that can be retrieved from the internet for quick testing.
    There is only one more testing missing from your test application: it's missing data transfer from one machine to the other using MPI!
  • Using NFS to share the OpenFOAM folder between the two machines is the quickest deployment on various machines. But for the best operation conditions, at least the folder where the case is should be share between the machines. I suggest you search the internet on how to setup NFS on the Linux distribution you are using. A more generic guide is available here: Linux NFS-HOWTO
    If you still don't know what NFS is: Network File System (protocol)
  • The reason why simply giving the application's path won't work is simple: the OpenFOAM environment is only activated on your terminal, it's not activated automatically when you launch the application remotely on the other machine! That's why foamJob uses foamExec with mpirun:
    Code:

    mpirun -np 2 -hostfile machines /home/rphull/OpenFOAM/OpenFOAM-1.6.x/bin/foamExec parallelTest -parallel | tee log
  • So, without "-parallel" things still don't work. Try running similarly to the code above, but like this:
    Code:

    mpirun -np 2 -hostfile machines /home/rphull/OpenFOAM/OpenFOAM-1.6.x/bin/foamExec parallelTest | tee log
    If this still doesn't work, then it's foamExec who isn't working properly.
  • Quote:

    mpirun -np 2 parallelTest ==> Works
    mpirun --hostfile machines -np 2 parallelTest ==> Not working
    Quote:

    Try and run with the same scenario (path, case and command), but starting from the other machine.
    Tried, but still not working.
    Did you test this on both machines? Because by what you wrote, it feels that either you only tested the second command line on the slave, or both command lines failed on the slave machine.
Here is two other tests you can try:
Code:

mpirun -np 1 -host machine1 /home/rphull/OpenFOAM/OpenFOAM-1.6.x/bin/foamExec parallelTest -case $PWD
mpirun -np 1 -host machine2 /home/rphull/OpenFOAM/OpenFOAM-1.6.x/bin/foamExec parallelTest -case $PWD

Run both lines on each machine, when you're inside the test case damBreak, as you have done so far.
These two tests can help you isolate where the problem is, since they only launch one instance of parallelTest on the remote machine, without the need for explicit communication between processes via MPI.

OK, I hope I didn't forget anything.

Best regards and good luck!
Bruno

pkr January 27, 2011 12:48

Hi Bruno,

Quote:

Creating a test application was a very good call! I didn't suggest that because I don't know any test case that can be retrieved from the internet for quick testing. There is only one more testing missing from your test application: it's missing data transfer from one machine to the other using MPI!
I created a test application to exchange data between 2 processed on different machines, the test case works absolutely fine.



Quote:

Using NFS to share the OpenFOAM folder between the two machines is the quickest deployment on various machines. But for the best operation conditions, at least the folder where the case is should be share between the machines. I suggest you search the internet on how to setup NFS on the Linux distribution you are using. A more generic guide is available here: Linux NFS-HOWTO
If you still don't know what NFS is: Network File System (protocol)
The machines I am using does not involve any sharing of OPENFOAM folder through NFS. Is it must to have folders shared for the parallel case to work?



Quote:

Code:
mpirun -np 2 -hostfile machines /home/rphull/OpenFOAM/OpenFOAM-1.6.x/bin/foamExec parallelTest | tee log
If this still doesn't work, then it's foamExec who isn't working properly.
This experiment works.
Does putting "-parallel" makes it to run in master-slave framework?



Quote:

Here is two other tests you can try:
Code:
mpirun -np 1 -host machine1 /home/rphull/OpenFOAM/OpenFOAM-1.6.x/bin/foamExec parallelTest -case $PWD
mpirun -np 1 -host machine2 /home/rphull/OpenFOAM/OpenFOAM-1.6.x/bin/foamExec parallelTest -case $PWD
Run both lines on each machine, when you're inside the test case damBreak, as you have done so far. These two tests can help you isolate where the problem is, since they only launch one instance of parallelTest on the remote machine, without the need for explicit communication between processes via MPI.
Here is an interesting observation:
Both the cases works fine on machine1.
When I try the same on machine2, the following case fails:
rphull@fire3:~/OpenFOAM/OpenFOAM-1.6/tutorials/multiphase/interFoam/laminar/damBreak$ mpirun -np 1 -host fire2 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest -case $PWD
exec: 128: parallelTest: not found

Is this the root cause for the problem? Any suggestions to fix this?

wyldckat January 27, 2011 13:14

Hi pkr,

Quote:

Originally Posted by pkr (Post 292512)
I created a test application to exchange data between 2 processed on different machines, the test case works absolutely fine.

Nice! Then with this can application you can sort out if the problem was related to the network and MPI. In your case, it's not a network issue nor MPI issue.

Quote:

Originally Posted by pkr (Post 292512)
The machines I am using does not involve any sharing of OPENFOAM folder through NFS. Is it must to have folders shared for the parallel case to work?

It's not 100% necessary, but as time goes by, you'll start feeling the annoyance of having to copy the "processor*" folders back and forth for each machine. For example, if you have 4 sub-domains of your mesh and 2 go to the slave machine, you'll have to copy the folders "processor2" and "processor3" from the slave machine to the master, when the simulation is complete.
When using NFS or sshfs, this is sort of done automatically for you, except that the real files should only reside on a single machine, instead of being physically replicated on both machines.

Quote:

Originally Posted by pkr (Post 292512)
This experiment works.
Does putting "-parallel" makes it to run in master-slave framework?

Yes, the "-parallel" flag makes them work together. It's valid for any solver/utility in OpenFOAM which is "parallel aware". Those that aren't, will complain if you tell them to try to use this argument ;)

Quote:

Originally Posted by pkr (Post 292512)
Here is an interesting observation:
Both the cases works fine on machine1.
When I try the same on machine2, the following case fails:
rphull@fire3:~/OpenFOAM/OpenFOAM-1.6/tutorials/multiphase/interFoam/laminar/damBreak$ mpirun -np 1 -host fire2 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest -case $PWD
exec: 128: parallelTest: not found

Is this the root cause for the problem? Any suggestions to fix this?

:eek: Yes I would say that we are getting closer to the root of the problem!
OK, the first thing that comes to my mind is that "parallelTest" is only available on one of the machines. To confirm this, run on both machines:
Code:

which parallelTest
The returned path should be the same.

Now, when I try to think more deeply about this, I get the feeling that there is something else that is slightly different on one of the machines, but I can't put my finger on it... it feels that it's either the Linux distribution version that isn't identical... or something about bash not working the same exact way. Perhaps it's how "~/.bashrc" is defined on both machines... check if there are any big differences between the two files. Any changes to the variables "PATH" and "LD_LIBRARY_PATH" inside "~/.bashrc" which are different in some particular way, can lead to very different working environments!
The other possibility would be how ssh is configured on both machines...

Best regards,
Bruno

pkr January 27, 2011 14:05

Thanks Bruno.

Quote:

Yes I would say that we are getting closer to the root of the problem!
OK, the first thing that comes to my mind is that "parallelTest" is only available on one of the machines. To confirm this, run on both machines:
Code:
which parallelTest
The returned path should be the same.
In one of the machine, the parallelTest was being picked up from the debug version of OpenFoam. Now have changed that both of the below mentioned tests work fine.
mpirun -np 1 -host machine1 /home/rphull/OpenFOAM/OpenFOAM-1.6.x/bin/foamExec parallelTest -case $PWD
mpirun -np 1 -host machine2 /home/rphull/OpenFOAM/OpenFOAM-1.6.x/bin/foamExec parallelTest -case $PWD

But still the system hangs when I try to run the openmpi parallelTest with -parallel keyword for 2 machines :(


Quote:

Now, when I try to think more deeply about this, I get the feeling that there is something else that is slightly different on one of the machines, but I can't put my finger on it... it feels that it's either the Linux distribution version that isn't identical... or something about bash not working the same exact way. Perhaps it's how "~/.bashrc" is defined on both machines... check if there are any big differences between the two files. Any changes to the variables "PATH" and "LD_LIBRARY_PATH" inside "~/.bashrc" which are different in some particular way, can lead to very different working environments!
The other possibility would be how ssh is configured on both machines...
echo $PATH and echo $LD_LIBRARY_PATH produces same results for both the machines, which is as follows:
Quote:

/home/rphull/OpenFOAM/ThirdParty-1.6/paraview-3.6.1/platforms/linux64Gcc/bin:/home/rphull/OpenFOAM/ThirdParty-1.6/openmpi-1.3.3/platforms/linux64GccDPOpt/bin:/home/rphull/OpenFOAM/rphull-1.6/applications/bin/linux64GccDPOpt:/home/rphull/OpenFOAM/site/1.6/bin/linux64GccDPOpt:/home/rphull/OpenFOAM/OpenFOAM-1.6/applications/bin/linux64GccDPOpt:/home/rphull/OpenFOAM/OpenFOAM-1.6/wmake:/home/rphull/OpenFOAM/OpenFOAM-1.6/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games

/home/rphull/OpenFOAM/OpenFOAM-1.6/lib/linux64GccDPOpt/openmpi-1.3.3:/home/rphull/OpenFOAM/ThirdParty-1.6/openmpi-1.3.3/platforms/linux64GccDPOpt/lib:/home/rphull/OpenFOAM/rphull-1.6/lib/linux64GccDPOpt:/home/rphull/OpenFOAM/site/1.6/lib/linux64GccDPOpt:/home/rphull/OpenFOAM/OpenFOAM-1.6/lib/linux64GccDPOpt:/usr/local/cuda/lib64/

wyldckat January 27, 2011 18:38

Hi pkr,

Quote:

Originally Posted by pkr (Post 292519)
In one of the machine, the parallelTest was being picked up from the debug version of OpenFoam. Now have changed that both of the below mentioned tests work fine.
mpirun -np 1 -host machine1 /home/rphull/OpenFOAM/OpenFOAM-1.6.x/bin/foamExec parallelTest -case $PWD
mpirun -np 1 -host machine2 /home/rphull/OpenFOAM/OpenFOAM-1.6.x/bin/foamExec parallelTest -case $PWD

But still the system hangs when I try to run the openmpi parallelTest with -parallel keyword for 2 machines :(

:confused: What the... OK, we will have to review the commands you are using and what tests have you done, to confirm if everything is being done correctly!
  1. Have you tried running your test application from/to both machines?
  2. What are the commands you use to launch your test application that only depends on OpenMPI? Including for launching between machines.
  3. What are the commands you use to launch parallelTest with the "-parallel" argument?
  4. Do you use foamJob to launch it or the whole command line we've been using on the latest tests? Either way, try both! And write here what were the commands you used.
Last but not least, also try running the latest tests without the argument "-case $PWD". You might already know this, but this argument tells the OpenFOAM application that the desired case folder is located at "$PWD", which expands to the current folder.

Another test you can try, which is launching a parallel case to work solely on the remote machine:
Code:

mpirun -np 2 -host machine1 /home/rphull/OpenFOAM/OpenFOAM-1.6.x/bin/foamExec parallelTest -parallel -case $PWD
mpirun -np 2 -host machine2 /home/rphull/OpenFOAM/OpenFOAM-1.6.x/bin/foamExec parallelTest -parallel -case $PWD

This way the machine only has to communicate with itself!

It feels we are really close to getting this to work and yet it seems so far...

OK, another test for trying to isolate what on digital earth is going on - run this using the two machines, from/to either one of them and also only locally and only remotely:
Code:

foamJob -s -p bash -c export
cat log | sort -u > log2
export | sort -u > loglocal
diff -Nur loglocal log2 > log.diff

mpirun -np 4 bash -c export > log.simple
cat log.simple | sort -u > log2.simple
diff -Nur loglocal log2.simple > log.simple.diff

Now, what these commands do:
  • Gather a list of the variables defined on each remote environment;
  • Sort and keep only the unique lines of the contents of each list, namely the remote list and the local list;
  • Compare the two lists and save the differences on the file "log.diff".
  • Then gather the list of variables defined on each remote machine, without activating the OpenFOAM environment remotely.
  • For this list, also sort it out.
  • Finally, compare the clean remote environments with the local activated environment and save the differences in the file "log.simple.diff".
OK, now open these two "*.diff" files in a text editor, preferably gedit or kate or some other that gives you a coloured display of the lines on the differences indicated on each file. Lines that are added to the file on the right start with "+" and the lines removed start with "-".
This is like the last resort, which should help verify what the environment looks like on the remote machine, when using mpirun to launch the process remotely. The things to keep an eye out for are:
  • if when running only remotely, if there are any OpenFOAM variables missing remotely that exist locally but not remotely.
  • the "log.simple.diff" file will help you discovering what exactly is activated on your local environment that isn't activated remotely or that is indeed activated remotely, but only because the remote is actually the local machine.
    In other words, if your two machines picked for mpirun are the local machine, then the environment is possibly automagically copied from one launch command to the other! And such magic won't happen when launching to a remote machine, as you already know.
Hopefully it won't be necessary to go this far, but if you do need to get this far, then I hope you can figure out what's missing.
Right now I'm too tired to figure out any more tests and/or possibilities.

Best regards and good luck!
Bruno

pkr January 28, 2011 00:01

Thanks for your response. I am mentioning all the commands in this post. I still have to try the commands for difference in remote machine configuration. I will get back to you soon on that.

Quote:

Have you tried running your test application from/to both machines?
What are the commands you use to launch your test application that only depends on OpenMPI? Including for launching between machines.
What are the commands you use to launch parallelTest with the "-parallel" argument?
Do you use foamJob to launch it or the whole command line we've been using on the latest tests? Either way, try both! And write here what were the commands you used.
Last but not least, also try running the latest tests without the argument "-case $PWD". You might already know this, but this argument tells the OpenFOAM application that the desired case folder is located at "$PWD", which expands to the current folder.
Yes, I have tried running it from/to both the machines i.e I change to damBreak directory of machine 1 and launch my test application. In the similar way I try it from machine 2.
Commands to launch:
1. Without -parallel but with machine file
mpirun -hostfile machines -np 2 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest ====> Works fine from both the machines

2. With -parallel but without any machine file
mpirun -np 2 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest -parallel ====> Works fine from both the machines

3. With -parallel and with machine file
mpirun -hostfile machines -np 2 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest -parallel ====> Does not work from any of the machine

With Foam Job, I try the following:
foamJob -p -s parallelTest ==> This works when machines file is not present in the current directory, otherwise it fails

All of the following commands work fine from both the machines.
mpirun -np 1 -host fire3 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest -case $PWD
mpirun -np 1 -host fire2 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest -case $PWD
mpirun -np 1 -host fire3 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest
mpirun -np 1 -host fire2 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest
mpirun -np 2 -host fire3 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest -parallel
mpirun -np 2 -host fire2 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest -parallel




When running with -parallel across the machines, once in a while I am seeing following error message. Have you seen it before?
Quote:

Executing: mpirun -np 2 -hostfile machines /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest -parallel | tee log
[fire2][[42609,1],1][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_comple te_connect] connect() to 192.168.122.1 failed: Connection refused (111)


I also tried debugging with gdb. Here is the call stack where program gets stuck when running with -parallel and across the machines:
Quote:

(gdb) where
#0 0x00007f1ee2f3f8a8 in *__GI___poll (fds=0x19b6040, nfds=7, timeout=<value optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:83
#1 0x00007f1ee20eed23 in poll_dispatch () from /home/rphull/OpenFOAM/ThirdParty-1.6/openmpi-1.3.3/platforms/linux64GccDPOpt/lib/libopen-pal.so.0
#2 0x00007f1ee20edc5b in opal_event_base_loop () from /home/rphull/OpenFOAM/ThirdParty-1.6/openmpi-1.3.3/platforms/linux64GccDPOpt/lib/libopen-pal.so.0
#3 0x00007f1ee20e1041 in opal_progress () from /home/rphull/OpenFOAM/ThirdParty-1.6/openmpi-1.3.3/platforms/linux64GccDPOpt/lib/libopen-pal.so.0
#4 0x00007f1ee25be92d in ompi_request_default_wait_all () from /home/rphull/OpenFOAM/ThirdParty-1.6/openmpi-1.3.3/platforms/linux64GccDPOpt/lib/libmpi.so.0
#5 0x00007f1eddb30838 in ompi_coll_tuned_allreduce_intra_recursivedoubling ()
from /home/rphull/OpenFOAM/ThirdParty-1.6/openmpi-1.3.3/platforms/linux64GccDPOpt/lib/openmpi/mca_coll_tuned.so
#6 0x00007f1ee25d3762 in MPI_Allreduce () from /home/rphull/OpenFOAM/ThirdParty-1.6/openmpi-1.3.3/platforms/linux64GccDPOpt/lib/libmpi.so.0
#7 0x00007f1ee2c683d4 in Foam::reduce (Value=@0x7fff4742deb8, bop=<value optimized out>) at Pstream.C:276
#8 0x00007f1ee3d083e7 in Foam::Time::setControls (this=0x7fff4742e070) at db/Time/Time.C:147
#9 0x00007f1ee3d0990e in Time (this=0x7fff4742e070, controlDictName=..., rootPath=<value optimized out>, caseName=<value optimized out>,
systemName=<value optimized out>, constantName=<value optimized out>) at db/Time/Time.C:240
#10 0x0000000000401c01 in main (argc=2, argv=0x7fff4742e8c8) at /home/rphull/OpenFOAM/OpenFOAM-1.6/src/OpenFOAM/lnInclude/createTime.H:12
(gdb) n
93 in ../sysdeps/unix/sysv/linux/poll.c
(gdb) n
0x00007f1ee20eed23 in poll_dispatch () from /home/rphull/OpenFOAM/ThirdParty-1.6/openmpi-1.3.3/platforms/linux64GccDPOpt/lib/libopen-pal.so.0
Current language: auto
The current source language is "auto; currently asm".
(gdb) n
Single stepping until exit from function poll_dispatch,
which has no line number information.
0x00007f1ee20edc5b in opal_event_base_loop () from /home/rphull/OpenFOAM/ThirdParty-1.6/openmpi-1.3.3/platforms/linux64GccDPOpt/lib/libopen-pal.so.0
(gdb) n
Single stepping until exit from function opal_event_base_loop,
which has no line number information.
0x00007f1ee20e1041 in opal_progress () from /home/rphull/OpenFOAM/ThirdParty-1.6/openmpi-1.3.3/platforms/linux64GccDPOpt/lib/libopen-pal.so.0
(gdb) n
Single stepping until exit from function opal_progress,
which has no line number information.
0x00007f1ee25be92d in ompi_request_default_wait_all () from /home/rphull/OpenFOAM/ThirdParty-1.6/openmpi-1.3.3/platforms/linux64GccDPOpt/lib/libmpi.so.0
(gdb) n
Single stepping until exit from function ompi_request_default_wait_all,
which has no line number information.


All times are GMT -4. The time now is 07:16.