|
[Sponsors] |
Running shape_optimization.py with MPI in a Cluster |
![]() |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
![]() |
![]() |
#1 |
Member
Ercan Umut
Join Date: Aug 2022
Posts: 72
Rep Power: 4 ![]() |
Hello everybody,
I am trying to use the Adjoint Method in SU2. I am using a slurm betik file to send the job to clusters. When I use the command below it repeats everything as the number of cores. When I delete the "-n 110" part it works on single core. Code:
.... #SBATCH -N 2 #SBATCH --ntasks-per-node=110 .... shape_optimization.py -n 110 -f unsteady_naca0012_opt.cfg Can anybody help me on how to run Adjoint with a slurm file? |
|
![]() |
![]() |
![]() |
![]() |
#2 | |
Senior Member
bigfoot
Join Date: Dec 2011
Location: Netherlands
Posts: 782
Rep Power: 21 ![]() |
Hi,
When you see everything repeated N times when running on N cores, then your SU2 executable was not correctly compiled with mpi support. make sure you do this: Code:
./meson.py build -Dwith-mpi=enabled and check in the output that the compiler is using openmpi: Quote:
|
||
![]() |
![]() |
![]() |
![]() |
#3 |
Member
Ercan Umut
Join Date: Aug 2022
Posts: 72
Rep Power: 4 ![]() |
Thanks for the answer. The thing is it actually repeats everything when I use the command on slurm file and send the job to clusters. When I run the same command on terminal everything seems to work well.
Also I run some different cases other than Adjoint with MPI and there is no problem as such. I tried to adjust processor number from the shape_optimization.py file but it repeats as before. Any idea on how can I fix this? |
|
![]() |
![]() |
![]() |
![]() |
#4 |
Member
Josh Kelly
Join Date: Dec 2018
Posts: 57
Rep Power: 8 ![]() |
Is there something in your shell script job file that is overriding the bash settings you have in the terminal environment?
Maybe a better option would be to use FADO (https://github.com/pcarruscag/FADO) instead of the python scripts. |
|
![]() |
![]() |
![]() |
![]() |
#5 |
Member
Ercan Umut
Join Date: Aug 2022
Posts: 72
Rep Power: 4 ![]() |
Code:
#!/bin/bash #SBATCH -p #SBATCH -A #SBATCH -J NEMO #SBATCH --nodes=4 #SBATCH --ntasks=216 #SBATCH --cpus-per-task=1 #SBATCH -C weka #SBATCH --time=12:10:00 #SBATCH --output=slurm-%j.out #SBATCH --error=slurm-%j.err echo "SLURM_NODELIST $SLURM_NODELIST" echo "NUMBER OF CORES $SLURM_NTASKS" module purge module load comp/gcc/14.1.0 module load lib/openmpi/5.0.0 module load lib/hdf5/1.14.3-openmpi-5.0.0 PATH=/arf/home/../pcre2/bin:$PATH PATH=/arf/home/../swig/bin:$PATH PATH=/arf/home/../swig:$PATH export PATH=/home/../cmake-3.31.2/bin:$PATH export PATH=/home/../cmake-3.31.2/:$PATH PATH=/../programs/re2c-4.0.2/bin:$PATH PATH=../programs/re2c-4.0.2:$PATH export SU2_RUN=/arf/../SU2-V8.1.0/su2-2/bin export SU2_HOME=/arf/../SU2-V8.1.0/su2-2 export PATH=$PATH:$SU2_RUN export PYTHONPATH=$PYTHONPATH:$SU2_RUN, export MPP_DATA_DIRECTORY=$SU2_HOME/subprojects/Mutationpp/data export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$SU2_HOME/build/subprojects/Mutationpp shape_optimization.py -f ramp.cfg sstat -j $SLURM_JOB_ID exit I don't know anything about FADO but I will look into it. Thank you for the answer. |
|
![]() |
![]() |
![]() |
![]() |
#6 | |
Senior Member
bigfoot
Join Date: Dec 2011
Location: Netherlands
Posts: 782
Rep Power: 21 ![]() |
Quote:
When you run this on the cluster, is there in the end an actual speedup? slurm needs to copy the python script to all the allocated cores, so you should see stuff being copied ncores times. So in the end, does the job itself run n times on 1 core, or not? Does slurm produce a single output file that contains the time per iteration? Then you can quickly find out if the speedup is correct |
||
![]() |
![]() |
![]() |
![]() |
#7 | |
Member
Ercan Umut
Join Date: Aug 2022
Posts: 72
Rep Power: 4 ![]() |
I had to delete most of the part but log_direct.out file looks something like this when I use "shape_optimization.py -n 110 -f ramp.cfg" in slurm file.
It repeats everything with ncores as you said it would happen when openmpi installation is wrong. Since I run some different cases and this does not occur I thought this shouldn't be the issue. I can see the speedup when I use 10 cores and 110 cores but I did not calculated it using say, wrt_perf parameter. When I don't use "-n 110" it clearly runs on single core because I can solve the baseline geometry within 5 hours with same ncores using "mpirun SU2_CFD" but with adjoint first DSN_0001 didn't even finish after 3 days. Quote:
![]() |
||
![]() |
![]() |
![]() |
![]() |
#8 | |
New Member
rutvik
Join Date: Mar 2024
Posts: 19
Rep Power: 3 ![]() |
I have experienced this in the past. For slurm jobs, if the mpi command is not specified, SU2 by default attempts to use srun, instead of mpiexec or mpirun. To specify the mpi command you need to add the following to you environment variables or slurm scirpt:
Quote:
|
||
![]() |
![]() |
![]() |
![]() |
#9 | |
Member
Ercan Umut
Join Date: Aug 2022
Posts: 72
Rep Power: 4 ![]() |
Quote:
Thank you Rutvik. I added that part to the slurm file but what should I write as the last command "shape_optimization.py -n 110 -f ramp.cfg". I tried some stuff but it still doesn't work properly. |
||
![]() |
![]() |
![]() |
![]() |
#10 |
New Member
rutvik
Join Date: Mar 2024
Posts: 19
Rep Power: 3 ![]() |
what do you mean "it still doesn't work properly"? Does it run in serial again? try mpirun instead of mpiexec
I typically use FADO, not much shape_optimization.py. Though, when I did I would add the environment vairables and path to the shape_optimization.py script in the slurm file: python your/path/to/shape_optimization.py -n 110 -f ramp.cfg -g DISCRETE_ADJOINT |
|
![]() |
![]() |
![]() |
![]() |
#11 |
Member
Ercan Umut
Join Date: Aug 2022
Posts: 72
Rep Power: 4 ![]() |
Thank you for the info Rutvik. I think I made it work with the way you explained. However now it doesn't move forward in solution after completing DSN0001/DIRECT. It looks like its running but no new files created. I tried to lower time iter. and inner inter. and run the case in my personal computer. There was no problem, SU2 created adjoint files as simulations move forward. Any idea on this?
|
|
![]() |
![]() |
![]() |
![]() |
#12 |
Member
Josh Kelly
Join Date: Dec 2018
Posts: 57
Rep Power: 8 ![]() |
Does your hpc file system use lustre? In the past this has caused issues for me. I would highly reccomend you use FADO, there is a tutorial on the SU2 website to demonstrate how to set up a FADO script.
|
|
![]() |
![]() |
![]() |
![]() |
#13 | |
Member
Ercan Umut
Join Date: Aug 2022
Posts: 72
Rep Power: 4 ![]() |
Quote:
Yes it looks like it is lustre. I am not familiar with FADO but looks like I have to switch to it. I didn't have too much time so I had to try the python commands first. So you are saying that FADO can solve these problems right? |
||
![]() |
![]() |
![]() |
![]() |
#14 |
Member
Josh Kelly
Join Date: Dec 2018
Posts: 57
Rep Power: 8 ![]() |
I can't say it will solve all your issues but it is definitley more powerful than the basic python scripts. The issue with symbolic links on lustre systems stems from openmpi (https://github.com/open-mpi/ompi/issues/12141).
|
|
![]() |
![]() |
![]() |
![]() |
#15 | |
Member
Ercan Umut
Join Date: Aug 2022
Posts: 72
Rep Power: 4 ![]() |
Quote:
Thank you so much for the answers. I would like to ask you another question though, so I can report it to the HPC system administrations. If I use FADO, will this OpenMPI issue be solved? I followed the link you shared and last entry says it would be fixed in 5.0.x editions. I just compiled SU2 with openmpi5.0.4 and same problem happens. |
||
![]() |
![]() |
![]() |
Thread Tools | Search this Thread |
Display Modes | |
|
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Issues with running cases in parallel mode on HPC cluster | brahmsdr | OpenFOAM Running, Solving & CFD | 0 | August 28, 2024 19:24 |
Openfoam running error in parallel in Cluster | sibo | OpenFOAM | 2 | February 25, 2017 13:26 |
OF211 with mvapich2 on redhat cluster, error when using more than 64 cores? | ripperjack | OpenFOAM Installation | 4 | August 30, 2014 03:47 |
Kubuntu uses dash breaks All scripts in tutorials | platopus | OpenFOAM Bugs | 8 | April 15, 2008 07:52 |
Running FineTurbo on a Linux Cluster | Philipp Höfliger | Fidelity CFD | 3 | May 6, 2004 08:47 |