Fluent parallel in CENTOS
I'm running a case with a lot of CPU and RAM needs.
My simulation is transient, pure Eulerina with 5 granular eulerian phases (~20 UDFs).
I have two 4-core machines and i run this simulation as parallel in these two interconnented machines. I'm running my simulations through Fluent 13 text mode.
Everything is ok but after some hours of calculations i get the following message for no obvious reason (the case converges just fine and it seems there is nothing wrong with the calculations).
fluent_mpi.13.0.0: Rank 0:10: MPI_Allreduce: MPI BUG: MPI_Recv: request not done
MPI Application rank 10 exited before MPI_Finalize() with status 1
MPI Application rank 34 exited before MPI_Finalize() with status 0
bash: line 0: kill: (8481) - No such process
bash: line 0: kill: (8482) - No such process
When i read the last autosaved data and continue calculations everything is OK but again after some hours I get the same message and fluent crashes.
In the same machines with simulations that are not so CPU and RAM intensive I have no problem.
I use -mpi=hp, but i also used the -mpi=openmpi (I had the same behavior but a different error message saying that it couldn't allocate 30 Gb of RAM)
The OS is CENTOS, and my RAM is enough for this case.
Thank you in advance!
P.S. I have some user defined memory allocated (C_UDMI()) but as I said I calculated the RAM needs and it is enough even for simulations in one of the servers
I installed some updates for centos...
Any hints anyone? Even ideas in what may be wrong will help me
I didnt used fluent 13 on linux yet, but for fluent 6.3.26, I had some problems somewhat similar to yours. I had a system with 5 RAMs placed in and one slot was free. I checked and found that the free slot was slot No. 4. so I changed it to slot No. 6(the last slot). then the problem was almost solved! some similar problems with RAM were occurred in university hpc centre.
Thank you for your reply!
Following your suggestion, I tested the same case running in one machine only and there are no crash downs. I did that in both machines. So, i don't believe its a memory problem.
I'll try now deleting all UDMI. Does anybody now what is the User Defined Node Memory Locations? I run Fluent parallel in 2 machines and I set this value 0 while User Defined Memory Locations is set as 10 (11 UDMIs)
|All times are GMT -4. The time now is 00:01.|