999999 (../../src/mpsystem.c@1123):mpt_read: failed:errno = 11

UDS_rambler · November 21, 2011, 09:51

Hi everybody!

I'm facing a serious problem trying to simulate a complex multiphase species transport model within an axialsymmetric domain. To model such a complex problem I'm using 3 different UDFs: 2 imposed as boundary conditions (consumption terms) and 1 executed at the end of each time-step which computes variables to apply to the other two UDFs.
These UDF are correctly compiled (with no mistake) and when I start the simulation, in serial, they work efficiently. The error arises when I start the simulation in parallel. In particular at the end of the first time-step an error pops out:
================================================== ============================
Stack backtrace generated for node id 4 on signal 11 :

================================================== ============================
Stack backtrace generated for node id 5 on signal 11 :
MPI Application rank 4 killed before MPI_Finalize() with signal 11
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
[....]
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....

999999 (../../src/mpsystem.c@1123): mpt_read: failed: errno = 11

999999: mpt_read: error: read failed trying to read 4 bytes: Resource temporarily unavailable

I'm running Fluent on a 64-bit linux cluster on 8-processors (lnamd64 architecture) and trying to run the same simulation on a 32-bit linux cluster on 4 processors the error doesn't occur.

The "mpt_read: error: read failed trying to read 4 bytes" message makes me think of a problem of 32 Vs 64 -bit libraries (since 4 bytes are 32 bit) ..
Could you help me??
Simulation is really heavy and I need to run it on more processors I can...

I thank you in advance

UDS_rambler

ronaldalau · November 22, 2011, 09:23

We've seen this error.

Our cluster is a 64bit windows HPC system on a GigE network. We've been told by cluster experts that the MPI system is dependent on network Latency, not Bandwidth. A GigE network will have a Latency of ~5ms. An Infiniband network has a latency of microseconds.

We've also been told to use 'Message Passing' for the DPM parallel scheme.

And if you haven't done so, compile your UDFs for 64 bit when running on the 64 bit cluster.

Hope this helps

R.

UDS_rambler · November 22, 2011, 09:46

Thank you so much Ronald.

I was going mental because of this problem. Now I know I should ask to the personnel in charge of the maintenance of the network in order to obtain the information you cited above. I'll give you a feedback when I learn more about it.

Thank you again

G.

November 21, 2011, 09:51	999999 (../../src/mpsystem.c@1123):mpt_read: failed:errno = 11	#1
UDS_rambler New Member Giuse Join Date: Jul 2010 Location: Italy Posts: 21 Rep Power: 16	Hi everybody! I'm facing a serious problem trying to simulate a complex multiphase species transport model within an axialsymmetric domain. To model such a complex problem I'm using 3 different UDFs: 2 imposed as boundary conditions (consumption terms) and 1 executed at the end of each time-step which computes variables to apply to the other two UDFs. These UDF are correctly compiled (with no mistake) and when I start the simulation, in serial, they work efficiently. The error arises when I start the simulation in parallel. In particular at the end of the first time-step an error pops out: ================================================== ============================ Stack backtrace generated for node id 4 on signal 11 : ================================================== ============================ Stack backtrace generated for node id 5 on signal 11 : MPI Application rank 4 killed before MPI_Finalize() with signal 11 node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... [....] node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... 999999 (../../src/mpsystem.c@1123): mpt_read: failed: errno = 11 999999: mpt_read: error: read failed trying to read 4 bytes: Resource temporarily unavailable I'm running Fluent on a 64-bit linux cluster on 8-processors (lnamd64 architecture) and trying to run the same simulation on a 32-bit linux cluster on 4 processors the error doesn't occur. The "mpt_read: error: read failed trying to read 4 bytes" message makes me think of a problem of 32 Vs 64 -bit libraries (since 4 bytes are 32 bit) .. Could you help me?? Simulation is really heavy and I need to run it on more processors I can... I thank you in advance UDS_rambler

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
parallel fluent runs being killed at partitioing	Ben Aga	FLUENT	3	June 8, 2012 10:40
Error in parallel fluent	federica	Main CFD Forum	0	November 20, 2011 05:21

November 22, 2011, 09:23		#2
ronaldalau Member Ronald A. Lau Join Date: Jul 2009 Location: Chicago Posts: 30 Rep Power: 17	We've seen this error. Our cluster is a 64bit windows HPC system on a GigE network. We've been told by cluster experts that the MPI system is dependent on network Latency, not Bandwidth. A GigE network will have a Latency of ~5ms. An Infiniband network has a latency of microseconds. We've also been told to use 'Message Passing' for the DPM parallel scheme. And if you haven't done so, compile your UDFs for 64 bit when running on the 64 bit cluster. Hope this helps R.

November 22, 2011, 09:46		#3
UDS_rambler New Member Giuse Join Date: Jul 2010 Location: Italy Posts: 21 Rep Power: 16	Thank you so much Ronald. I was going mental because of this problem. Now I know I should ask to the personnel in charge of the maintenance of the network in order to obtain the information you cited above. I'll give you a feedback when I learn more about it. Thank you again G.