Parallel Fluent Error in Batch Mode
I am experiencing an error message when trying to run a fluent job in parallel via a PBS batch system on a Unix cluster.
The case will load in interactive mode, but when I try to launch it in batch mode something goes wrong and it gives an error before loading the grid. Here is the error from the fluent output file: Multicore processors detected. Processor affinity set! Reading "Case1FullStack.cas"... MPI Application rank 0 killed before MPI_Finalize() with signal 9 node 999999 retrying on zero socket read..... node 999999 retrying on zero socket read..... 999999 (../../src/mpsystem.c@1123): mpt_read: failed: errno = 11 999999: mpt_read: error: read failed trying to read 4 bytes: Resource temporarily unavailable Here is the batch file: #PBS -N parallel_fluent #PBS -l walltime=1:00:00 #PBS -l nodes=1:ppn=4 #PBS -l software=fluent:fluentpar+4 #PBS -j oe #PBS -m ae #PBS -S /bin/csh set echo on hostname module load fluent cd $PBS_O_WORKDIR cat $PBS_NODEFILE | sort > pnodes set ncpus=`cat pnodes | wc -l` fluent 3ddp -t$ncpus -pinfiniband.ofed -cnf=pnodes -g < Case1Fullgrdck.input And the input file: file/read-case Case1FullStack.cas grid/check solve/initialize/initialize-flow file/write-data Case1FullStack.dat exit yes Any input on what I am doing wrong would be greatly appreciated. Thanks Justin |
Re: Parallel Fluent Error in Batch Mode
Nevermind. I figured it out. It was using too much RAM for a single compute node, so I switched to one core per node on multiple nodes.
|
Quote:
Hi Justin, could you elaborate how you solved this problem, I just met with the same thing. The funny thing is that the simulation ran for 12 time steps before meeting this problem. Thanks! |
Hi, guys did you figure out? I am having the same problem. I guess it is also related to memory.
MPI Application rank 4 killed before MPI_Finalize() with signal 9 Node 12: Process 22312: Received signal SIGTERM. Node 13: Process 22313: Received signal SIGTERM. Node 14: Process 22314: Received signal SIGTERM. Node 8: Process 22308: Received signal SIGTERM. Node 5: Process 22305: Received signal SIGTERM. Node 2: Process 22302: Received signal SIGTERM. Node 11: Process 22311: Received signal SIGTERM. ===============Message from the Cortex Process================================ Fatal error in one of the compute processes. |
All times are GMT -4. The time now is 22:25. |