CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > FLUENT

parallel fluent runs being killed at partitioing

Register Blogs Members List Search Today's Posts Mark Forums Read

Like Tree2Likes
  • 2 Post By Razvan

Reply
 
LinkBack Thread Tools Display Modes
Old   September 23, 2005, 14:51
Default parallel fluent runs being killed at partitioing
  #1
Ben Aga
Guest
 
Posts: n/a
We suddenly have started seeing parallel fluent runs on our cluster die very early in their runs, generally during or right after partitioning.

We are running red hat e3 on a 64-bit opteron cluster. We use a beefy head node to host our runs and farm out the gmpi processes to compute nodes. We user PBSPro as our scheduler. I've got a ticket in to fluent and they are concerned about the OS causing this issue but havent been too specific as to why. PBS's vendor thinks the kernel on the head nodes may be running out of memory and killing these jobs to preserve itself. We've been running fluent jobs in this way for several months with no problems. This issue cropped up Tuesday and intermittently will kill jobs. There seems to be no rhyme or reason to what can run and what cant. It seems like once a job starts iterating, its ok (unless it starts to partition again). Below is the output we get to stdout when these processes are killed. It looks pretty much identical to what happens when someone kills one or more of the mpi processes from the cl when a job is running. Has anyone here runinto this issue themselves or does anybody have any possible culprits?

Thanks, r/ben

--------------------------------------------------

Parallel variables... Building...

grid,

auto partitioning mesh by Principal Axes,

distributing mesh

parts..,

faces..,

nodes..,

cells..,

materials,

interface,

domains,

mixture

liquid-phase

vapor-phase

interaction

zones,

fluid (liquid-phase)

outlet (liquid-phase)

inlet (liquid-phase)

internal.5 (liquid-phase)

symm2 (liquid-phase)

symm1 (liquid-phase)

wall (liquid-phase)

default-interior (liquid-phase)

fluid (vapor-phase)

outlet (vapor-phase)

inlet (vapor-phase)

internal.5 (vapor-phase)

symm2 (vapor-phase)

symm1 (vapor-phase)

wall (vapor-phase)

default-interior (vapor-phase)

default-interior

wall

symm1

symm2

internal.5

inlet

outlet

fluid

parallel,

shell conduction zones, Done.

>
:

> iter continuity x-velocity y-velocity z-velocity k epsilon vf-vapor-pnode 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... ... node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read.....

999999 (mpsystem.c@1228): mpt_read: failed: errno = 11

999999: mpt_read: error: read failed trying to read 8 bytes: Resource temporarily unavailable /apps/Fluent/Fluent.Inc/bin/fluent: line 3875: 6678 Killed $NO_RUN $EXE_CMD $MPI_ENABLED_OPTIONS [bt] Execution path: [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(Process_Stackframe+0x17) [0x9f6e97] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(mpt_error+0x109) [0x9e50e9] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(mpt_read+0xc6) [0x9e88b6] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(mpt_tcpip_crecv_raw+0x28) [0x9ea408] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(mpt_tcpip_crecv_all+0x28) [0x9ec948] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(MPT_crecv_double+0x112) [0x9d6ee2] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16 [0x5e81d8] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(Models_Send_update_solve+0xbe) [0x56769e] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(Flow_Iterate+0x19e) [0x4e143e] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16 [0x546788] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(eval+0x773) [0xa27403] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(eval+0x860) [0xa274f0] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(eval+0x460) [0xa270f0] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(eval+0x49a) [0xa2712a] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16 [0xa2873c] [bt] /apps/Fluent/Fluent.Inc/fluent6.2.16/lnamd64/3ddp_host/fluent.6.2.16(eval_errprotect+0x32) [0xa280d2] The fluent process could not be started.

time/iter
  Reply With Quote

Old   September 23, 2005, 17:11
Default Re: parallel fluent runs being killed at partitioi
  #2
Vinod Dhiman
Guest
 
Posts: n/a
Hi

I am getting the same kind of error, it writes mpt_connect_error. It is related to hardware, that is for sure. I use METIS and socket connection on P-IV nodes. Check with you hardware connections, since it may again start working if you make sure that all the connections are physically alright.

Vinod Dhiman
  Reply With Quote

Old   September 23, 2005, 20:37
Default Re: parallel fluent runs being killed at partitioi
  #3
Razvan
Guest
 
Posts: n/a
In my not so big experience running parallel jobs on Opterons I came across such error but only when re-reading several cases one after the other and not restarting the parallel session, or re-partitioning same case several times.

Anyway, my advice is this: since you're having Opterons and 64-bit OS, I can't accept the idea that you do not have at least one workstation capable of reading your cases by itself in serial mode (4GB of memory for one process must be enough to at least read the mesh); so, read the damn mesh in serial solver, partition it using the best method you can find, which is NOT always Metis!!!, and then write a case. Re-read it this time in parallel mode and make the rest of the settings you need.

I found this to be the best method for running parallel cases, because parallel solvers do not make the partitioning the same as serial solvers! I observed big differences between results obtained with the two ones, usually parallel partitioning using SAME algoritm gives a higher number of interface faces.

Best wishes, Razvan
SAH2006 and NasharuddinMJ like this.
  Reply With Quote

Old   June 8, 2012, 10:40
Default Suspect Solution for this problem
  #4
New Member
 
TCH
Join Date: Jul 2010
Location: Beijing City
Posts: 15
Rep Power: 6
Solarberiden is on a distinguished road
Regulate up your pagefile as follow:

My computer>properties>advanced>Performance>Configura tion>
advanced>modify
Regulate up your possible pagefile scale by 1.0~2.0 time size of physical memory
Solarberiden is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Parallel Error in ANSYS FLUENT 12 zeusxx FLUENT 25 July 17, 2015 04:40
Urgent; parallel processing in fluent 12 Mansureh FLUENT 4 September 25, 2012 11:12
Parallel fluent 4 nodes machine (Quad 6600 SUSE) Rafa FLUENT 4 June 7, 2011 06:33
Parallel fluent not using all processors specified Paul FLUENT 17 April 7, 2008 13:18
error parallel fluent session Diet FLUENT 2 January 27, 2005 13:31


All times are GMT -4. The time now is 19:34.