CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   OpenFOAM Running, Solving & CFD (https://www.cfd-online.com/Forums/openfoam-solving/)
-   -   Something weird encountered when running OpenFOAM in parallel on multiple nodes (https://www.cfd-online.com/Forums/openfoam-solving/117096-something-weird-encountered-when-running-openfoam-parallel-multiple-nodes.html)

xpqiu May 2, 2013 04:03

Something weird encountered when running OpenFOAM in parallel on multiple nodes
 
Hello everyone,
Recently I have encountered something weird when I tried to run OpenFOAM in parallel on multiple nodes. I got two nodes (A123 and A122),each with 8 cores. After a lot of trial and error, my case ran successfully in parallel with 16 processes. However, when I tried to listen network traffic between A123 and A122 during the running, something I think weird happened.
Now let me discribe my situation in detail, and hope you to give me some advices.


The following steps depict how I implemented parallel running on multiple nodes:

1
Both A123 and A122 have installed 64 bit CentOS 5.4 ,with OpenFOAM 2.1.1 installed(the openfoam package was download from centFOAM :http://sourceforge.net/projects/centfoam/files/5.x/, I just decompressed the tar ball and source the bashrc in ../etc/ . The thirdParty package was used, with openMPI version of 1.5.3). After those settings, OpenFOAM works fine, both serial case and parallel case on single node ran successfully.

2
Then I changed the ssh settings so as to let A123 and A122 can visit each other without typing password.(i.e. on A123(or A122), I can log in A122(or A123) by typing "ssh A122(or A123)" ,password inputting is not needed. )

3
I made a directory named "shares" on both A123 and A122 (mkdir -p ~/shares), and on A122 , I utilized "mount" to mount the "shares directory on A123" to "~/shares"( mount -t nfs A123:$HOME/shares ~/shares (as a root)). So the dicrectory "~/shares" on A123 are shared by A123 and A122.

4
I made a hostfile (named "hosts_2-8"), filled with the following words:
A123 cpu=8
A122 cpu=8

5
After that , I copied my case to "~/shares" on A123, ran blockMesh , setted up decomposeParDict ,ran decomposePar ,and finally ran "mpirun --hostfile hosts_2-8 -np 16 pisoFoam -parallel".

So far everything looked fine. From the log ,I can see that the CFD domain was split into 16 parcels, and the case was computed with 16 processes(8 on A123 and 8 on A122, one of the processes on A123 was the master ,the reset were slaves).:)

Then I tried to listen network traffic between A123 and A122 . I installed and ran "iftop"(http://www.ex-parrot.com/~pdw/iftop/) on A122, and it turned out that data packages exchange between A123 and A122 only occured at the moment when the running begun and ended , and between that ,not even a byte of data exchange could be listenned !:eek:

So far as I known, when OpenFOAM ran in parallel, there should be data exchange between processes, and then there should be network traffic between A123 and A122 all the time.

Did I make any mistake, either on the implement of paralle running on multiple nodes or on my understanding of MPI ? Or I should use another software to listen the network traffic? By the way, I also tried wireshark, and the result are the same. I even tried pullling out the Ethernet cable during the running, and the running aborted ,as expected.

I hope you to give me some advices,thank you !:D

xpqiu

haakon May 2, 2013 04:30

I guess one possible explanation is that OpenMPI might not use the TCP/IP protocol (actually SSH over TCP/IP) for anything else than establishing and closing network links on a lower layer in the OSI model. By establishing low-level data links it is possible to gain much better performance than using some IP-based protocols. Such communications might not be "spotted" by the iftop or similar software, which probably stack on top of the IP protocol.

Does the lights on your network interfaces (in the back of the machine) blink? If so, I guess everything is working. If not, you should check that the computations are running on both machines, and not just one. Try to do a ps -A | grep Foam on both machines to see all OpenFOAM applications running. Both machines should display the same number of applications (8).

xpqiu May 2, 2013 04:59

Thanks haakon,
The network works fine,and after typing "ps -A |grep Foam ", I can see 8 processes named "pisoFoam" on both A123 and A122,so I think the running is just normal.
As regards protocol,I am not familiar with that, I will search on the net.Thank you for your information.

xpqiu


All times are GMT -4. The time now is 13:58.