CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   FLUENT (https://www.cfd-online.com/Forums/fluent/)
-   -   parallel fluent runs being killed at partitioing (https://www.cfd-online.com/Forums/fluent/37987-parallel-fluent-runs-being-killed-partitioing.html)

Ben Aga September 23, 2005 14:51

parallel fluent runs being killed at partitioing
 
We suddenly have started seeing parallel fluent runs on our cluster die very early in their runs, generally during or right after partitioning.

We are running red hat e3 on a 64-bit opteron cluster. We use a beefy head node to host our runs and farm out the gmpi processes to compute nodes. We user PBSPro as our scheduler. I've got a ticket in to fluent and they are concerned about the OS causing this issue but havent been too specific as to why. PBS's vendor thinks the kernel on the head nodes may be running out of memory and killing these jobs to preserve itself. We've been running fluent jobs in this way for several months with no problems. This issue cropped up Tuesday and intermittently will kill jobs. There seems to be no rhyme or reason to what can run and what cant. It seems like once a job starts iterating, its ok (unless it starts to partition again). Below is the output we get to stdout when these processes are killed. It looks pretty much identical to what happens when someone kills one or more of the mpi processes from the cl when a job is running. Has anyone here runinto this issue themselves or does anybody have any possible culprits?

Thanks, r/ben

--------------------------------------------------

Parallel variables... Building...

grid,

auto partitioning mesh by Principal Axes,

distributing mesh

parts..,

faces..,

nodes..,

cells..,

materials,

interface,

domains,

mixture

liquid-phase

vapor-phase

interaction

zones,

fluid (liquid-phase)

outlet (liquid-phase)

inlet (liquid-phase)

internal.5 (liquid-phase)

symm2 (liquid-phase)

symm1 (liquid-phase)

wall (liquid-phase)

default-interior (liquid-phase)

fluid (vapor-phase)

outlet (vapor-phase)

inlet (vapor-phase)

internal.5 (vapor-phase)

symm2 (vapor-phase)

symm1 (vapor-phase)

wall (vapor-phase)

default-interior (vapor-phase)

default-interior

wall

symm1

symm2

internal.5

inlet

outlet

fluid

parallel,

shell conduction zones, Done.

>
:

> iter continuity x-velocity y-velocity z-velocity k epsilon vf-vapor-pnode 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... ... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read.....

999999 (mpsystem.c@1228): mpt_read: failed: errno = 11

999999: mpt_read: error: read failed trying to read 8 bytes: Resource temporarily unavailable /apps/Fluent/Fluent.Inc/bin/fluent: line 3875: 6678 Killed $NO_RUN $EXE_CMD $MPI_ENABLED_OPTIONS [bt] Execution path: [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(Process_Stackframe+0x17) [0x9f6e97] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(mpt_error+0x109) [0x9e50e9] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(mpt_read+0xc6) [0x9e88b6] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(mpt_tcpip_crecv_raw+0x28) [0x9ea408] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(mpt_tcpip_crecv_all+0x28) [0x9ec948] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(MPT_crecv_double+0x112) [0x9d6ee2] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16 [0x5e81d8] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(Models_Send_update_solve+0xbe) [0x56769e] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(Flow_Iterate+0x19e) [0x4e143e] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16 [0x546788] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(eval+0x773) [0xa27403] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(eval+0x860) [0xa274f0] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(eval+0x460) [0xa270f0] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(eval+0x49a) [0xa2712a] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16 [0xa2873c] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(eval_errprotect+0x32) [0xa280d2] The fluent process could not be started.

time/iter

Vinod Dhiman September 23, 2005 17:11

Re: parallel fluent runs being killed at partitioi
 
Hi

I am getting the same kind of error, it writes mpt_connect_error. It is related to hardware, that is for sure. I use METIS and socket connection on P-IV nodes. Check with you hardware connections, since it may again start working if you make sure that all the connections are physically alright.

Vinod Dhiman

Razvan September 23, 2005 20:37

Re: parallel fluent runs being killed at partitioi
 
In my not so big experience running parallel jobs on Opterons I came across such error but only when re-reading several cases one after the other and not restarting the parallel session, or re-partitioning same case several times.

Anyway, my advice is this: since you're having Opterons and 64-bit OS, I can't accept the idea that you do not have at least one workstation capable of reading your cases by itself in serial mode (4GB of memory for one process must be enough to at least read the mesh); so, read the damn mesh in serial solver, partition it using the best method you can find, which is NOT always Metis!!!, and then write a case. Re-read it this time in parallel mode and make the rest of the settings you need.

I found this to be the best method for running parallel cases, because parallel solvers do not make the partitioning the same as serial solvers! I observed big differences between results obtained with the two ones, usually parallel partitioning using SAME algoritm gives a higher number of interface faces.

Best wishes, Razvan

Solarberiden June 8, 2012 10:40

Suspect Solution for this problem
 
Regulate up your pagefile as follow:

My computer>properties>advanced>Performance>Configura tion>
advanced>modify
Regulate up your possible pagefile scale by 1.0~2.0 time size of physical memory


All times are GMT -4. The time now is 12:26.