CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM

Segmentation fault in interFoam run through openMPI

Register Blogs Members List Search Today's Posts Mark Forums Read

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   September 22, 2011, 03:28
Default Segmentation fault in interFoam run through openMPI
  #1
Member
 
Luca Giannelli
Join Date: Jun 2010
Location: Kobe, Japan
Posts: 58
Rep Power: 16
voingiappone is on a distinguished road
Hello everybody...

here I go with a hard question. I hope somebody there would like to help me to figure out a solution.

Long (loooong) story short: I have 2 boxes with OpenFOAM 1.6 installed and I want them to run in parallel:

kumori ---> i386
PS3 ---> PPC64

It took me a long time to compile the FOAM over the PS3 and now it is working like a charm. I'm still trying to run the damBreak tutorial, however when I launch the mpirun command it spits out an error like this:
Code:
 
mpirun -np 1 -host kumori interFoam -parallel : -np 1 -host ps3 interFoam -parallel


/*---------------------------------------------------------------------------*\
| =========                 |                                                 |
| \\      /  F ield         | OpenFOAM: The Open Source CFD Toolbox           |
|  \\    /   O peration     | Version:  1.6                                   |
|   \\  /    A nd           | Web:      www.OpenFOAM.org                      |
|    \\/     M anipulation  |                                                 |
\*---------------------------------------------------------------------------*/
Build  : 1.6-53b7f692aa41
Exec   : interFoam -parallel
Date   : Sep 22 2011
Time   : 16:13:07
Host   : kumori
PID    : 32267
[PS3:29504] *** Process received signal ***
[PS3:29504] Signal: Segmentation fault (11)
[PS3:29504] Signal code: Address not mapped (1)
[PS3:29504] Failing at address: 0xa28c2c7a
[PS3:29504] [ 0] [0xfff82960418]
[PS3:29504] [ 1] /home/piota/OpenFOAM/OpenFOAM-1.6/lib/linuxPPC64GccDPOpt/libOpenFOAM.so(_ZN4Foam8IPstream4readERNS_5tokenE-0x3a21ec) [0xfff80e947d4]
[PS3:29504] [ 2] /home/piota/OpenFOAM/OpenFOAM-1.6/lib/linuxPPC64GccDPOpt/libOpenFOAM.so(_ZN4Foam5tokenC1ERNS_7IstreamE-0x3b7e4c) [0xfff80e7d2e4]
[PS3:29504] [ 3] /home/piota/OpenFOAM/OpenFOAM-1.6/lib/linuxPPC64GccDPOpt/libOpenFOAM.so(_ZN4FoamrsERNS_7IstreamERNS_6stringE-0x3d24d0) [0xfff80e613d0]
[PS3:29504] [ 4] /home/piota/OpenFOAM/OpenFOAM-1.6/lib/linuxPPC64GccDPOpt/libOpenFOAM.so(_ZN4FoamrsINS_6stringEEERNS_7IstreamES3_RNS_4ListIT_EE-0x3e1908) [0xfff80e50ae0]
[PS3:29504] [ 5] /home/piota/OpenFOAM/OpenFOAM-1.6/lib/linuxPPC64GccDPOpt/libOpenFOAM.so(_ZN4Foam7argListC1ERiRPPcbb-0x3eb320) [0xfff80e471b8]
[PS3:29504] [ 6] interFoam() [0x1001f7fc]
[PS3:29504] [ 7] /lib64/libc.so.6(+0x4f5e8) [0xfff808875e8]
[PS3:29504] [ 8] /lib64/libc.so.6(__libc_start_main-0x1534f8) [0xfff80887800]
[PS3:29504] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 29504 on node ps3 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
The OF installation on the PS3 has been compiled against the lib64/libc.so.6 so this should not be the problem.

Final HINTS... maybe the problem is here:

1)
both machines' openmpi installation had to be recompiled including the option "--enable-heterogeneous" because of the difference between the architectures. Maybe the library included in the openfoam.so during previous compile is outated and it gets a SIGSEGV?

2)
if in the above command I remove the "-parallel" switch, the program runs flawlessly but only on the remote pc.

Please help me to get out from this trap

Thank you!

Luca

Last edited by voingiappone; September 22, 2011 at 03:37. Reason: typo in the title
voingiappone is offline   Reply With Quote

Old   September 24, 2011, 07:32
Default
  #2
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,980
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Greetings Luca,

Nice! PS3 for running OpenFOAM... too bad it doesn't had much memory by default...

Sadly I don't have experience (yet) with running OpenFOAM on a hybrid platform parallel execution, but here are a few links that may help you:
The test application parallelTest from the first link should help you do some testing first! The "Notes" can then help you further.

If there is something you don't understand about the content of some of the links that I've provided here, feel free to ask!

Best regards,
Bruno
__________________
wyldckat is offline   Reply With Quote

Old   September 25, 2011, 23:34
Default
  #3
Member
 
Luca Giannelli
Join Date: Jun 2010
Location: Kobe, Japan
Posts: 58
Rep Power: 16
voingiappone is on a distinguished road
Howdy Bruno

and thanks for your message. I read the posts you mentioned and I followed all the suggestions there. What I did:

- compile the parallelTest app on both pc (kumori and ps3)
- run it separately ---> it works flawlessly, with the output that you pointed out in those threads.
- run it through foamJob including in the folder the original "machines" I was passing before and the output is *EXACTLY* the same error as above.

To be more exact thanks to this test I found a problem that is probably linked to the different arch. It was looking for the "orted" file in the wrong directory so I made a symlink to the right one and it went fine.
Even if it is using the right file now, it still spits out the very same error.


I just came to the lab and I'm planning to make my "next move".... I suppose (totally basing on my vastly unpredictable fantasy) that the problem depends on the lack of the proper libraries during compilation. Why do I think so?

1) Years of Linux compilation against wrong libs (my fault of course)
2) The i386 version is the precompiled binary distribution
3) The ps3 version is compiled using the boundled openMPI

----> To include the heterogeneous arch support I had to recompile the openMPI and thus the libraries can be changed (maybe rendering the previous installation binaries useless for parallel).

I cannot, however, understand why it works in standalone machines.... maybe because no massages are there to exchange on different archs? Boh....


I don't really want to recompile everything from scratch but I think I need to.... めんどうくさい!I'll report back if I get it working this way but meanwhile, if you have any idea (or you believe that what I said is stupid), please drop a line.


Thanks
Luca


P.S. Yes... PS3 has low memory and is quite slow. It is a real pity that we cannot access the CELL...
voingiappone is offline   Reply With Quote

Old   September 26, 2011, 01:45
Default
  #4
Member
 
Luca Giannelli
Join Date: Jun 2010
Location: Kobe, Japan
Posts: 58
Rep Power: 16
voingiappone is on a distinguished road
Run the compilation again.... but without results. The error is still the same. BTW, the libs where up to date, so there's non need to recompile when changing heterogeneous support in openMPI.


I'm stuck.
voingiappone is offline   Reply With Quote

Old   September 26, 2011, 02:50
Default
  #5
Member
 
Luca Giannelli
Join Date: Jun 2010
Location: Kobe, Japan
Posts: 58
Rep Power: 16
voingiappone is on a distinguished road
Wow... 3rd post in a row! I should think better before writing.


BTW
I actually noticed something *REALLY* weird.... I describe all the things I have done for making it easy to understand:

1) I run the interFoam app on both the machines separately appending the time command to check the result.
2) I run the mpirun command but forgot to write -parallel on both nodes and....


== MAGIC ==


__IT IS RUNNING!__


==END OF MAGIC==


The awful thing is that it is actually sharing the interFoam command on both the machines without executing the calculations for the decomposed case but the complete one! Even more interesting.... without any further request by the user it is automatically selecting all the cores from all the CPUs to have the calculations done (I mapped them with my xfce4 taskmanager). So, I practically run a case without breaking the mesh on a cluster made of 3 cores on 2 pc.... without knowing neither why, nor how. Quite exciting/sad.


Btw, it still saves 30% of time (roughly) over the standalone execution so I can get some advantage from the parallelization however I would like to know why I do see this behavior.
voingiappone is offline   Reply With Quote

Old   September 26, 2011, 03:42
Default
  #6
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,980
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Hi Luca,

It should only run in parallel if you use the "-parallel" argument, along with mpirun. If you use:
Code:
foamJob -s -p interFoam
The "-p" argument with activate the "-parallel" argument automatically.

Both OpenFOAM installations should be recompiled with the dedicated options. Otherwise, it's unlikely it'll work.

Post #21 on the thread http://www.cfd-online.com/Forums/ope...tml#post292700 shows how to avoid using certain network connections. It might be useful for your case.

Another thing that's worrying me: the fact that the PC is 32bit and the PS3/PPC is 64bit; that's just adding to the confusion of architectures If the PC was 64bit...

Additionally, how is the case folder being shared between the machines? Is it placed in the same exact path?

Best regards and good luck!
Bruno
__________________

Last edited by wyldckat; September 26, 2011 at 03:42. Reason: typo
wyldckat is offline   Reply With Quote

Old   September 26, 2011, 22:37
Default
  #7
Member
 
Luca Giannelli
Join Date: Jun 2010
Location: Kobe, Japan
Posts: 58
Rep Power: 16
voingiappone is on a distinguished road
Hello Bruno,


thanks for your precious help... I made a "small" step forward thanks to the info in the thread that you posted. I removed the wireless interface and the segfault error disappeared (exactly as the original poster suggested. The command was:


Code:
mpirun --mca btl_tcp_if_exclude eth1 -hostfile machines -np 2 "executable file"-parallel
Whether I run parallelTest or interFoam, a new situations is created. The command launches the Openfoam header (with the nabla etc) gives some infrmation and then hangs there, virtually forever...
Code:
piota@ps3's password: 
/*---------------------------------------------------------------------------*\
| =========                 |                                                 |
| \\      /  F ield         | OpenFOAM: The Open Source CFD Toolbox           |
|  \\    /   O peration     | Version:  1.6                                   |
|   \\  /    A nd           | Web:      www.OpenFOAM.org                      |
|    \\/     M anipulation  |                                                 |
\*---------------------------------------------------------------------------*/
Build  : 1.6-53b7f692aa41
Exec   : parallelTest -parallel
Date   : Sep 27 2011
Time   : 10:49:38
Host   : kumori
PID    : 2366
If I kill it with ctrl+c it eventually turns up in a ghost (but only on the remote machine).


Quote:
Originally Posted by wyldckat View Post
It should only run in parallel if you use the "-parallel" argument, along with mpirun. If you use:
Code:
foamJob -s -p interFoam
The "-p" argument with activate the "-parallel" argument automatically.
I know that what I wrote sounds strange but I still have the feeling that executing it without the -parallel the pc will still cooperate in some way. Of course it is not running in parallel as the decomposed case is not executed but it seems like mpi is "sharing" the executable on both machines and executes the complete case. For this reason it gets faster then the single machines alone. I am not an expert and I cannot figure out what is going on there....


Quote:
Both OpenFOAM installations should be recompiled with the dedicated options. Otherwise, it's unlikely it'll work.
Do you know the switches for enabling those options you mention? I googled and didn't come out with anything relevant...

Quote:
Another thing that's worrying me: the fact that the PC is 32bit and the PS3/PPC is 64bit; that's just adding to the confusion of architectures If the PC was 64bit...
Do you think that this is relevant? Don't forget that I had to recompile the opneMPI including the --enable-heterogenous option which *should* avoid issues from different archs....

Quote:
Additionally, how is the case folder being shared between the machines? Is it placed in the same exact path?
Exactly the same.... same /home/piota/OpenFOAM/..... same user with same password on the very same OpenFOAM installation (Except of course that I had to compile the PPC64 one...).


Onother thing I was thinking about.... kumori is single cpu, single core but the PS3 is double core.... the case is decomposed in two parts and I am asking to MPI to run it separated on the two pc.... maybe the fact that on the PS3 only one core is working may be an issue? I saw a lot of threads whith multicore CPUs giving problems when executed without using all the cores....


I will try to decompose it in 3 parts with scotch.... I hope it will succeed!


Luca
voingiappone is offline   Reply With Quote

Old   October 1, 2011, 11:23
Default
  #8
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,980
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Hi Luca,

Sorry for the late reply, but here goes in a somewhat random order of replies:
1. I've never tested the "--enable-heterogenous option" so I don't know what limitations there are for it. This is why I think that both machines should be 64bit, but hopefully that won't be necessary.

2.
Quote:
Both OpenFOAM installations should be recompiled with the dedicated options. Otherwise, it's unlikely it'll work.
I think I meant that both needed to be properly compiled, which I guess it's what you must have done, since you managed to build the PPC64 version The idea was simply that the default values wouldn't work, since the defaults are usually set for "x86_64".
The other detail was about environment variables, which I'll write about later on in this post.


3. When parallelTest and other OpenFOAM applications lock up during parallel execution, it's likely related to a problem with the backwards lookup for the master machine. In other words: kumori has an IP, ps3 has another IP; but while kumori might know the IP associated to ps3, ps3 might not know the IP addressed to kumori!
Basically, check the file "/etc/hosts" on each machine, where both machines should have something like this:
Code:
10.11.12.1 kumori
10.11.12.2 ps3
Namely, the alias name for each IP.


4. Since you have a multi-architecture setup, a certain detail is very important: the environment shell variables must be properly set for each remote process. A way to check this is by running something like this:
Code:
mpirun --mca btl_tcp_if_exclude eth1 -hostfile machines -np 2 bash -c "echo \$HOSTNAME; export > $HOME/log.\$HOSTNAME"
This will generate a log file "log.hostname" in your home folder on each machine. This way you can check how each environment is setup on each machine. Both should show you the respective OpenFOAM environment, namely "linuxGcc" on kumori and "linuxPPC64" on ps3.

Another test, a bit more simple and exact is this:
Code:
mpirun --mca btl_tcp_if_exclude eth1 -hostfile machines -np 2 bash -c "echo \$HOSTNAME; which icoFoam"
It should show you the right places for each architecture build.

Yet another test, for checking if all machines have the necessary case files, run this from within the case folder:
Code:
mpirun --mca btl_tcp_if_exclude eth1 -hostfile machines -np 2 bash -c "echo \$HOSTNAME; ls -l \$PWD"
You can also try this:
Code:
mpirun --mca btl_tcp_if_exclude eth1 -hostfile machines -np 2 bash -c "echo \$HOSTNAME; ls -l $PWD"
The difference is just that the first one will should you the current folder on each machine, while the second will show you the contents for the path on the local folder. Or in other words:
  1. The first one can show you the contents of different folders.
  2. The second one must show you the contents for the path specified in the mpirun command. I.e., show you the contents for the path defined in $PWD locally:
    Code:
    echo $PWD


5. The paths for running in parallel must be the same, but for the simulation case itself. Sorry about not making myself clearer on my previous post Well, they could be in different places, but that would be complicating a bit the tests that are being made right now.
OK, what happens with running OpenFOAM applications in parallel is this (by default, but can be changed in "decomposeParDict"):
  • The case folder must be located in the same path in all machines;
  • The case folder must at least have the respective processor? folder on each machine.
This means that either one of the following scenarios must be true:
  • You decompose the case and then manually copy the case folder to each machine, on the same path; e.g.: $HOME/OpenFOAM/$USER-1.6.x/run/myparallelcase
  • Or the path where the case is, is shared via NFS, SAMBA or SSHFS. Example of such a mount via sshfs (source), execute on the slave machine this command prior to running in parallel:
    Code:
    mount sshfs user@mastermachine:~/OpenFOAM/$USER-1.6.x/run/ $HOME/OpenFOAM/$USER-1.6.x/run/



6. When problems arise, usually using a divide-and-conquer method is the best way to go. This is what I've been writing so far. So, when the details above have been checked and/or solved, I would first test running two parallel processes on each independent machine. This would isolate the problem to either being a problem with different architectures, or it being a problem with the general setup.
To test this, modify the "machines" file you've been using to run mpirun with, for the following scenarios:
  1. Run directly on kumori with two parallel processes.
  2. Run directly on ps3 with two parallel processes.
  3. Launch from kumori the parallel run on ps3.
  4. Launch from ps3 the parallel run on kumori.
Only then should you test the hybrid launch!

_____________________________

OK, I'm not sure, but I think I've answered most of the problems that were described... Now it's up to you to do the tests

Best regards and good luck!
Bruno
__________________

Last edited by wyldckat; October 1, 2011 at 11:26. Reason: typos...
wyldckat is offline   Reply With Quote

Old   October 2, 2011, 22:29
Default
  #9
Member
 
Luca Giannelli
Join Date: Jun 2010
Location: Kobe, Japan
Posts: 58
Rep Power: 16
voingiappone is on a distinguished road
Hello Bruno,

thank you very much for the complete and detailed explaination. That's a lot of testing that I have to do, so I will try and then report back.... I hope that one of this suggestions will do the trick!
voingiappone is offline   Reply With Quote

Old   October 9, 2011, 07:33
Default
  #10
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,980
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Hi Luca,

Apparently there was a minor detail I forgot/didn't know before: one must be very careful about where mpirun is located!

For example, when you're running in a single machine, the path is the same for all processes, so the following way would be a quick fix:
Code:
`which mpirun` -np 2 `which foamExec` icoFoam -parallel
This is already automatically done by the foamJob script. Not providing the full path to mpirun can lead to a lock up of communications, since mpirun is not actually launched in the remote machine!

But since you are using machines with different builds, then a few possible rules apply:
  • Either you have installed, and are building OpenFOAM with, a system wide OpenMPI installation (WM_MPLIB=SYSTEMOPENMPI, which is available since OpenFOAM 1.7.0, if I'm not mistaken).
    In other words, mpirun would be in the same place in both machines, such as "/usr/bin/mpirun". This would allow the previous command to work just fine.
  • Or you will have to use the "--prefix" option in mpirun, as described here: http://linux.die.net/man/1/mpirun
    Now, I'm not 100% certain about this but I believe that you can add "--prefix" options on each line in the machines file. Example:
    Code:
    kumori slots=1 prefix=/usr/local
    ps3 slots=2 prefix=/usr
Mental note: I've got to try and find some time to compile all of this information into a single page at openfoamwiki.net...

Best regards,
Bruno
__________________
wyldckat is offline   Reply With Quote

Old   October 10, 2011, 23:30
Default
  #11
Member
 
Luca Giannelli
Join Date: Jun 2010
Location: Kobe, Japan
Posts: 58
Rep Power: 16
voingiappone is on a distinguished road
Hello Bruno,

first of all let me thank you for the efforts you are making in helping me out with this issue.
I was a bit late in answering as I was trying to tame some Python scripts for Paraview that didn't want to do what I wanted..... I won.




I have executed all the tests that you suggested and the results are:


3) Hosts file
I have modified it in all the ways I could... I found that on one machine (kumori) the name of the other host was lowercase... I changed this to be UPPERCASE in both the machines; you never know...
It didn't help though.


4) Environmental variables
I did run the command you specified and it indeed generated the log files where the variables are listed. I don't see anything wrong there but it may be hiding somewhere (or even missing) rendering the identification of the problem quite tough.
Of course all the needed files are there and they show up with the command that you pointed out.


Cross running on the hosts
If I run the parralel process (I am using the dambreak case) starting from a machine on the other, it finalizes without problems:


kumori ----> ps3 (executes all the threads on the remote PS3 and succeedes)
PS3 ----> kumori (executes all the threads on the remote kumori and succeedes)
Of course "on machine" parallelization works flawlessly...


Still multi-arch runs are not working. I suppose that the problem is related to the openfoam executable (interFoam in this case). I say this as I see that the program hangs after being launched:


Code:
piota@kumori:~/OpenFOAM/OpenFOAM-1.6/tutorials/multiphase/interFoam/laminar/damBreak$ `which mpirun` --mca btl_tcp_if_exclude eth1 -hostfile machines -np 3 `which foamExec` interFoam -parallel
piota@ps3's password: 
/*------------------------------------------------------------------------------*\
| =========                 |                                                           |
| \\      /  F ield                  | OpenFOAM: The Open Source CFD Toolbox |
|  \\    /   O peration           | Version:  1.6                                         |
|   \\  /    A nd                   | Web:      www.OpenFOAM.org                  |
|    \\/     M anipulation       |                                                           |
\*------------------------------------------------------------------------------*/
Build  : 1.6-53b7f692aa41
Exec   : interFoam -parallel
Date   : Oct 11 2011
Time   : 12:19:29
Host   : kumori
PID    : 26361
Moreover you can see that interFoam is running on both CPUs on the PS3 as you can see it in the task manager (see attachment). It just doesn't execute anything and hangs there.


Probably the problem comes from the mixed arch that *maybe* is not supported by openfoam
Attached Images
File Type: jpg Screenshot.jpg (57.1 KB, 31 views)
voingiappone is offline   Reply With Quote

Old   October 11, 2011, 16:34
Default
  #12
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,980
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Hi Luca,

Quote:
Originally Posted by voingiappone View Post
first of all let me thank you for the efforts you are making in helping me out with this issue.
You're welcome! And I'm also interested in knowing if this works or not

Quote:
Originally Posted by voingiappone View Post
I was a bit late in answering as I was trying to tame some Python scripts for Paraview that didn't want to do what I wanted..... I won.



Quote:
Originally Posted by voingiappone View Post
kumori ----> ps3 (executes all the threads on the remote PS3 and succeedes)
PS3 ----> kumori (executes all the threads on the remote kumori and succeedes)
Of course "on machine" parallelization works flawlessly...
OK, this is good to know and can be considered as good news!


Quote:
Originally Posted by voingiappone View Post
Still multi-arch runs are not working. I suppose that the problem is related to the openfoam executable (interFoam in this case). I say this as I see that the program hangs after being launched:
OK, this problem is very similar to the one I get, even when I run things between two nearly identical machines, but specifically when I forget to tell is where mpirun is. So, the question is this: is OpenMPI installed in the exact same path on both machines? If not, check my previous post.

Another thing, what about parallelTest? What happens when you use parallelTest instead of interFoam?

Another test that should be made would be a small and simple program built to work with Open-MPI only, i.e. without any links to OpenFOAM. Problem is that I still don't know of any good test app for Open-MPI I know they have some examples in Open-MPI's source code, but I haven't tested any of them.

Best regards,
Bruno
__________________
wyldckat is offline   Reply With Quote

Old   October 31, 2011, 03:17
Default
  #13
Member
 
Luca Giannelli
Join Date: Jun 2010
Location: Kobe, Japan
Posts: 58
Rep Power: 16
voingiappone is on a distinguished road
OMG.... I forgot to reply after executing the tests!

The result is exactly the same as above with the program stuck where it assigns the PID. I can however tell you that you cannot add the "prefix" in the hosts file as it complains about an unknown option and you have to manually specify it in the command line.


Getting back on the PS3 I decided to launch "which mpirun" directly on the ps3 itself (not remotely) and I obviously found out that it is in a different position than that on the "kumori" node. I was naive as I thought that simply executing a link to a dir with the same name on both machines could solve the problem which probably is indeed this one. I however understand that the PATH to Open Mpi is automatically set by the bashrc script from OF and I don't know how to change it. That can be the last try before giving up with the whole thing....


Any suggestions?
voingiappone is offline   Reply With Quote

Old   October 31, 2011, 07:11
Default
  #14
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,980
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Hi Luca,

If you examine the file "OpenFOAM-1.6/etc/settings.sh" and search for "OPENMPI", you'll find out how the path for it is set.

My advice is to copy the build you have on each node to each respective local global folder, such as "/usr/local/OpenMPI".
In other words, on each node, move/copy the build of OpenMPI you've gotten somewhere in "ThirdParty-1.6/platforms" on that node onto "/usr/local/OpenMPI".

Let me know if you can figure out how to follow these instructions I've written above.


I'm also getting very curious on how to make this work in general as well... I'm going to have to setup two virtual machines - one with 32bit and the other with 64bit - and build the hybrid OpenMPI on each and see for myself how to make things happen
I'll write about it after I make the tests...

Best regards,
Bruno
__________________
wyldckat is offline   Reply With Quote

Old   November 1, 2011, 08:16
Default
  #15
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,980
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Hi Luca,

Well... I've done some tests with the latest OpenFOAM 2.0.x and reached the conclusion that although I have two builds in Double Precision, the simple fact remains that having one in 32bit and another in 64bit will lead to a messy issue.
I haven't looked deep enough to figure out the problem, but my guess is that the Integers are to be blamed here They don't have the same number of bytes and therefore won't jump well from one side to the other.

Anyway, since you are using two 64bit builds, it might work as intended (edit: I didn't remember when I wrote this part). I've unpacked the old 1.6 packages to check the details of how folders were organized back then and the following trick should work to allow you to have different Open-MPI builds look like they are on the same path.

Go to the "ThirdParty-1.6" folder on each machine and run something like this:
Code:
for a in */platforms; do ln -s $WM_OPTIONS $a/linuxOtherDPOpt ; done

for a in */platforms; do ll $a; done
The first command will create symbolic links of the folders for the local build of Open-MPI to look like the ones in the other machine, hence the linuxOtherDPOpt implies that this name is to be changed to the desired name.
The second command will list the folders where the links were created, so you can confirm if it's all OK.

So, just in case my explanation wasn't very clear, here's what I think you should run:
  • At kumori:
    Code:
    for a in */platforms; do ln -s $WM_OPTIONS $a/linuxPPC64DPOpt ; done
    
    for a in */platforms; do ll $a; done
  • At ps3:
    Code:
    for a in */platforms; do ln -s $WM_OPTIONS $a/linuxGccDPOpt ; done
         
     for a in */platforms; do ll $a; done
Er, wait... now I remembered after briefly browsing this thread again... the PC is in 32bit mode and the PS3 in 64bit mode. Although I tested with OpenFOAM 2.0.x and you are using OpenFOAM 1.6, I'm lead to believe that this might actually not work as intended Not unless one of your future steps is to change the code on the 32bit side to enhance variable size coherence.

Any chance you can build a "PPC32" version to run on the PS3? Or get a 64bit PC with a 64bit Linux?

Best regards,
Bruno
__________________
wyldckat is offline   Reply With Quote

Old   November 2, 2011, 02:17
Default
  #16
Member
 
Luca Giannelli
Join Date: Jun 2010
Location: Kobe, Japan
Posts: 58
Rep Power: 16
voingiappone is on a distinguished road
Thank you Bruno,

The linking part is one thing that I already did when orted was complaining about that and it did not work out as it should.


BTW, I have actually decided to do like you suggest, moving the 32 bit to a 64 bit linux pc.... It make damn sense that the different length in the numbers can screw all the things up even with the multiarch feature enabled. I wanted to just make a test with a simple case (the DamBreak) and opted to go for the most unconvenient set-up ever seen .... Blame on me....


Now I am trying to compile OF 2.0.x on the new 64bit pc and it is refusing to compile due to an unknown error.... worst: it compiles 90% of the bins and not those I want... but this is another story.


So, what should we be doing now? Close the thread stating that it is not possible to mix 32/64 bit and intel/ppc architectures?
voingiappone is offline   Reply With Quote

Old   November 2, 2011, 06:49
Default
  #17
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,980
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Hi Luca,

I think that since you're probably still going to try and work with the PS3, we can keep using this thread

As for the building problems with 2.0.x, tell me a few things:
  • What's the commit of OpenFOAM you are using?
    Code:
    git log -n1
  • Which Linux distribution and version are you using?
  • If you can't figure out what's missing or wrong, run all Allwmake like this:
    Code:
    ./Allmake > make.log 2>&1
    Then edit out anything you might not want on the web and then pack the file:
    Code:
    tar -czf make.log.tar.gz make.log
    Finally attach to your next post.
Best regards,
Bruno
__________________
wyldckat is offline   Reply With Quote

Reply

Tags
openfoam, openmpi, parallel, segmentation fault

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Working directory via command line Luiz CFX 4 March 6, 2011 20:02
forrtl: severe (174): SIGSEGV, segmentation fault occurred therockyy FLOW-3D 7 January 19, 2011 22:52
Customized code based on DieselEngineFoamSolver, always getting segmentation fault dipling OpenFOAM Programming & Development 5 July 30, 2009 09:33
Grid resolution - Segmentation fault George Main CFD Forum 0 September 4, 2007 17:38
Workbench on Linux Segmentation Fault John Smith CFX 6 January 3, 2007 13:45


All times are GMT -4. The time now is 11:26.