CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   FLUENT (https://www.cfd-online.com/Forums/fluent/)
-   -   999999 (../../src/mpsystem.c@1123):mpt_read: failed:errno = 11 (https://www.cfd-online.com/Forums/fluent/94619-999999-src-mpsystem-c-1123-mpt_read-failed-errno-11-a.html)

UDS_rambler November 21, 2011 10:51

999999 (../../src/mpsystem.c@1123):mpt_read: failed:errno = 11
 
Hi everybody!

I'm facing a serious problem trying to simulate a complex multiphase species transport model within an axialsymmetric domain. To model such a complex problem I'm using 3 different UDFs: 2 imposed as boundary conditions (consumption terms) and 1 executed at the end of each time-step which computes variables to apply to the other two UDFs.
These UDF are correctly compiled (with no mistake) and when I start the simulation, in serial, they work efficiently. The error arises when I start the simulation in parallel. In particular at the end of the first time-step an error pops out:
================================================== ============================
Stack backtrace generated for node id 4 on signal 11 :

================================================== ============================
Stack backtrace generated for node id 5 on signal 11 :
MPI Application rank 4 killed before MPI_Finalize() with signal 11
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
[....]
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....
node 999999 retrying on zero socket read.....

999999 (../../src/mpsystem.c@1123): mpt_read: failed: errno = 11

999999: mpt_read: error: read failed trying to read 4 bytes: Resource temporarily unavailable

I'm running Fluent on a 64-bit linux cluster on 8-processors (lnamd64 architecture) and trying to run the same simulation on a 32-bit linux cluster on 4 processors the error doesn't occur.

The "mpt_read: error: read failed trying to read 4 bytes" message makes me think of a problem of 32 Vs 64 -bit libraries (since 4 bytes are 32 bit) ..
Could you help me??
Simulation is really heavy and I need to run it on more processors I can...

I thank you in advance

UDS_rambler

ronaldalau November 22, 2011 10:23

We've seen this error.

Our cluster is a 64bit windows HPC system on a GigE network. We've been told by cluster experts that the MPI system is dependent on network Latency, not Bandwidth. A GigE network will have a Latency of ~5ms. An Infiniband network has a latency of microseconds.

We've also been told to use 'Message Passing' for the DPM parallel scheme.

And if you haven't done so, compile your UDFs for 64 bit when running on the 64 bit cluster.

Hope this helps

R.

UDS_rambler November 22, 2011 10:46

Thank you so much Ronald.

I was going mental because of this problem. Now I know I should ask to the personnel in charge of the maintenance of the network in order to obtain the information you cited above. I'll give you a feedback when I learn more about it.

Thank you again

G.


All times are GMT -4. The time now is 17:06.