CFD Online Discussion Forums

CFD Online Discussion Forums (
-   Main CFD Forum (
-   -   MPI and parallel computation (

Wang April 13, 2004 04:35

MPI and parallel computation
Hi All,

When I run my code on Origin 3700/2000 manchine, everything is ok if total cells were below 200x200x200. However, when the size is 256x256x256, the code is crashed. Analysis using Totalview shows that the crashed point was at MPI_Wait for non-block message passing. If using block message passing, the crashed point was at MPI_Recv. The error information is as followings:

MPI: Program ./lbm, Rank 15, Process 11849507 received signal SIGSEGV(11)

MPI: --------stack traceback------- 11849507(5):






FATAL: Protocol version of Server /merged/2.9.1 does not match version of Client /merged/2.1 Your command referenced dbx but env var TOOLROOT is set to

/sanopt/dbx/7.3.3 Perhaps try $TOOLROOT/usr/bin/dbx MPI: dbx version 7.3.2 73509_May21 MR May 21 2001 17:15:31

MPI: -----stack traceback ends----- MPI: Program ./lbm, Rank 15, Process 11849507: Dumping core on signal SIGSEGV(11) into directory /sanhp/scrijw/lbm/lbm16mar MPI: MPI_COMM_WORLD rank 15 has terminated without calling MPI_Finalize() MPI: aborting job MPI: Received signal 11

I can not understand the outfile about the case. I attached the outfile of the result. Could you explain it for me if you have similar experience?

Thank you very much in advance

Tom April 13, 2004 04:44

Re: MPI and parallel computation

A very simple question. Are you trying to use more memory than is available? the error messages can be quite criptical in that case.


Wang April 13, 2004 05:08

Re: MPI and parallel computation
Hi Tom,

Thank you very much for your reply. I am confusing about that. If it is memory problem, the code should be crashed when the matrices are initialised. However, now the crash occurs after all of the matrices have been initialised. In addition, I can not understand what means the error information. Could you help me to explain it?

Tom April 13, 2004 06:55

Re: MPI and parallel computation

The error does not have to be related to MPI. The first error message says there is some segmentation violation (bad memory access). It can be some bug in the program. Try some array checking or use the command stop to find out where the error occurs. You can also the command size to see if the executable is not too large for the machine.


Wang April 14, 2004 06:34

Re: MPI and parallel computation
Hi Tom,

Thank you very much for your help. I try to debug using Totalview. Before MPI_Send or MPI_Recv, the array variales are ok. The code can not pass the MPI_Send or MPI_Recv. That is to say, the code stopes during message passing. Have you some idea how to check the variables like this case? Furthermore, how to see if the executable is too large for the machine?

Guest April 14, 2004 10:35

Re: MPI and parallel computation
Perhaps it would be worth seeing if the looping structures are looking to send more informaiton can is available. See what the arguments that go into MPI_SEND nad MPI_RECV are, and if they are what you expect. It possibly stops when there is informaiton that is being sent or received that does not 'exist'.

tippo April 14, 2004 11:17

Re: MPI and parallel computation
i would suggest the problem is either a memory issue or as 'guest' says a problem with the array size that is being sent/recieved...the problem is how to distingush between these 2 problems.

i would say you should make one of your 225x225x225 dimensions smaller ie. 225x225x20. this will make the problem much smaller and elminiate memory as an issue.

if you are sending/recieving information seperately one dimension at a time make the problem 225x225x20. you can then see if the send and recieve works in each direction.

if you are sending information as one 225x225x225 block you can try sending many 225x225x20 blocks at once i.e. 12 of them, to check the buffer size is above your 225x225x225 limit.

Wang April 15, 2004 11:25

Re: MPI and parallel computation
Hi All,

Thank all of you very much for your help. Now this problem have solved. It is the machine memory problem. When the array size is smaller and the code is no problems.

After I rm some the arraies and optimise the memory, the code is running.

All times are GMT -4. The time now is 14:37.