|
[Sponsors] |
Regarding error While Running simulation using Multicore in HPCE |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
October 7, 2023, 07:36 |
Regarding error While Running simulation using Multicore in HPCE
|
#1 |
Member
Bhargav
Join Date: Aug 2022
Posts: 30
Rep Power: 4 |
Hello REEF3D team,
I am currently using REEF3D-23.03/REEF3D-release_candidate/ version and working on a wave structure interaction problem within a three-dimensional numerical wave tank. When I run the simulation on my personal computer, utilizing 12 cores, it runs without any issues. However, upon attempting to run the simulation on the High-Performance Computing Environment (HPCE), I encountered an error message, which I've provided below. I also attempted to run the simulation on HPCE with 8 cores, and it proceeded smoothly without any errors. I have attached the relevant files for your review. Could you please examine the attached files and help me identify the problem? I have also tried the latest release candidate (RC), but I am still facing the same issue when using a higher number of cores. Thank you for your assistance. [Attach relevant files here] |
|
October 7, 2023, 11:27 |
|
#2 |
Member
Bhargav
Join Date: Aug 2022
Posts: 30
Rep Power: 4 |
Please review the error I mentioned earlier.
[cn249:17924] *** Process received signal *** [cn249:17924] Signal: Segmentation fault (11) [cn249:17924] Signal code: Address not mapped (1) [cn249:17924] Failing at address: 0x357e6 [cn249:17924] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2b16af07b5d0] [cn249:17924] [ 1] /lfs/usrhome/phd/oe19d201/REEF3D-code/REEF3D-master-23.08/bin/REEF3D[0x8ec000] [cn249:17924] [ 2] /lfs/usrhome/phd/oe19d201/REEF3D-code/REEF3D-master-23.08/bin/REEF3D[0x53f81f] [cn249:17924] [ 3] /lfs/usrhome/phd/oe19d201/REEF3D-code/REEF3D-master-23.08/bin/REEF3D[0x5982c3] [cn249:17924] [ 4] /lfs/usrhome/phd/oe19d201/REEF3D-code/REEF3D-master-23.08/bin/REEF3D[0x597ef5] [cn249:17924] [ 5] /lfs/usrhome/phd/oe19d201/REEF3D-code/REEF3D-master-23.08/bin/REEF3D[0x85ce84] [cn249:17924] [ 6] /lfs/usrhome/phd/oe19d201/REEF3D-code/REEF3D-master-23.08/bin/REEF3D[0x5b3abf] [cn249:17924] [ 7] /lfs/usrhome/phd/oe19d201/REEF3D-code/REEF3D-master-23.08/bin/REEF3D[0x5b4711] [cn249:17924] [ 8] /lfs/usrhome/phd/oe19d201/REEF3D-code/REEF3D-master-23.08/bin/REEF3D[0x41c2dd] [cn249:17924] [ 9] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b16af2aa3d5] [cn249:17924] [10] /lfs/usrhome/phd/oe19d201/REEF3D-code/REEF3D-master-23.08/bin/REEF3D[0x41f55f] [cn249:17924] *** End of error message *** [cn249:17933] *** Process received signal *** [cn249:17933] Signal: Segmentation fault (11) [cn249:17933] Signal code: (128) [cn249:17933] Failing at address: (nil) [cn249:17933] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2ab5b6f2e5d0] [cn249:17933] [ 1] /lfs/usrhome/phd/oe19d201/REEF3D-code/REEF3D-master-23.08/bin/REEF3D[0x8ec0ec] [cn249:17933] [ 2] /lfs/usrhome/phd/oe19d201/REEF3D-code/REEF3D-master-23.08/bin/REEF3D[0x53f691] [cn249:17933] [ 3] /lfs/usrhome/phd/oe19d201/REEF3D-code/REEF3D-master-23.08/bin/REEF3D[0x5982c3] [cn249:17933] [ 4] /lfs/usrhome/phd/oe19d201/REEF3D-code/REEF3D-master-23.08/bin/REEF3D[0x597ef5] [cn249:17933] [ 5] /lfs/usrhome/phd/oe19d201/REEF3D-code/REEF3D-master-23.08/bin/REEF3D[0x85ce84] [cn249:17933] [ 6] /lfs/usrhome/phd/oe19d201/REEF3D-code/REEF3D-master-23.08/bin/REEF3D[0x5b3abf] [cn249:17933] [ 7] /lfs/usrhome/phd/oe19d201/REEF3D-code/REEF3D-master-23.08/bin/REEF3D[0x5b4711] [cn249:17933] [ 8] /lfs/usrhome/phd/oe19d201/REEF3D-code/REEF3D-master-23.08/bin/REEF3D[0x41c2dd] [cn249:17933] [ 9] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2ab5b715d3d5] [cn249:17933] [10] /lfs/usrhome/phd/oe19d201/REEF3D-code/REEF3D-master-23.08/bin/REEF3D[0x41f55f] [cn249:17933] *** End of error message *** -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that process rank 54 with PID 0 on node cn249 exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- |
|
October 8, 2023, 03:59 |
|
#3 |
Senior Member
Arun Kamath
Join Date: Nov 2014
Location: Trondheim, Norway
Posts: 265
Rep Power: 13 |
My guess is that this is a memory allocation problem on the HPC. Check how much total RAM you are getting allocated on the HPC for your 80 cores. It probably needs more than that.
Do you get allocated 8 cores and 80 cores on the same node? Or do you get more nodes when you increase the number of cores to 80? Generally memory is attached to the nodes. Each node generally has may be 16/32 cores depending on the machine. But you say your are using the RC, while the path shows master. Are you sure about the version you are using? Also dont need M 20 2 in ctrl.txt
__________________
Arun X years with REEF3D |
|
October 9, 2023, 07:47 |
|
#4 |
New Member
Keshav Pathak
Join Date: Jul 2022
Posts: 27
Rep Power: 4 |
We faced the same problem with REEF3D RC and master version. The Job is run on 2 nodes with 40 cores each and 192 GB ram.
no. of nodes = 2, no. of cores = 40 x 2 = 80, Total RAM = 192 x 2 = 384 GB |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
InterFoam based solver running into floating point error on restarting simulation | Venky_94 | OpenFOAM Running, Solving & CFD | 9 | November 23, 2021 17:53 |
Problem when running simulation | pak_sargon | CONVERGE | 4 | July 7, 2021 00:29 |
Running simulation with Design Parameters on HPC | AS_Aero | ANSYS | 0 | April 11, 2018 04:02 |
a transient cfx simulation suddenly stopped writing .out and then .bak while running | mona.li | CFX | 1 | March 5, 2018 05:15 |
How can I detect cavitation presence while the simulation is still running? | Stabum | CFX | 5 | May 18, 2015 19:38 |