Parallel Fluent Error in Batch Mode
I am experiencing an error message when trying to run a fluent job in parallel via a PBS batch system on a Unix cluster.
The case will load in interactive mode, but when I try to launch it in batch mode something goes wrong and it gives an error before loading the grid.
Here is the error from the fluent output file:
Multicore processors detected. Processor affinity set!
MPI Application rank 0 killed before MPI_Finalize() with signal 9 node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... 999999 (../../src/mpsystem.c@1123): mpt_read: failed: errno = 11 999999: mpt_read: error: read failed trying to read 4 bytes: Resource temporarily unavailable
Here is the batch file: #PBS -N parallel_fluent #PBS -l walltime=1:00:00 #PBS -l nodes=1:ppn=4 #PBS -l software=fluent:fluentpar+4 #PBS -j oe #PBS -m ae #PBS -S /bin/csh set echo on hostname module load fluent cd $PBS_O_WORKDIR cat $PBS_NODEFILE | sort > pnodes set ncpus=`cat pnodes | wc -l` fluent 3ddp -t$ncpus -pinfiniband.ofed -cnf=pnodes -g < Case1Fullgrdck.input
And the input file: file/read-case Case1FullStack.cas grid/check solve/initialize/initialize-flow file/write-data Case1FullStack.dat exit yes
Any input on what I am doing wrong would be greatly appreciated. Thanks Justin
Re: Parallel Fluent Error in Batch Mode
Nevermind. I figured it out. It was using too much RAM for a single compute node, so I switched to one core per node on multiple nodes.
Hi Justin, could you elaborate how you solved this problem, I just met with the same thing.
The funny thing is that the simulation ran for 12 time steps before meeting this problem.
Hi, guys did you figure out? I am having the same problem. I guess it is also related to memory.
MPI Application rank 4 killed before MPI_Finalize() with signal 9
Node 12: Process 22312: Received signal SIGTERM.
Node 13: Process 22313: Received signal SIGTERM.
Node 14: Process 22314: Received signal SIGTERM.
Node 8: Process 22308: Received signal SIGTERM.
Node 5: Process 22305: Received signal SIGTERM.
Node 2: Process 22302: Received signal SIGTERM.
Node 11: Process 22311: Received signal SIGTERM.
===============Message from the Cortex Process================================
Fatal error in one of the compute processes.
|All times are GMT -4. The time now is 16:30.|