CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > SU2

Running shape_optimization.py with MPI in a Cluster

Register Blogs Members List Search Today's Posts Mark Forums Read

Like Tree2Likes
  • 1 Post By R.K
  • 1 Post By joshkellyjak

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   January 14, 2025, 02:42
Default Running shape_optimization.py with MPI in a Cluster
  #1
Member
 
Ercan Umut
Join Date: Aug 2022
Posts: 72
Rep Power: 4
CleverBoy is on a distinguished road
Hello everybody,

I am trying to use the Adjoint Method in SU2. I am using a slurm betik file to send the job to clusters. When I use the command below it repeats everything as the number of cores. When I delete the "-n 110" part it works on single core.

Code:
....
#SBATCH -N 2 
#SBATCH --ntasks-per-node=110
....

 shape_optimization.py -n 110 -f unsteady_naca0012_opt.cfg
I tried mpirun command and some different iterations as input. I couldn't get it to work.

Can anybody help me on how to run Adjoint with a slurm file?
CleverBoy is offline   Reply With Quote

Old   January 14, 2025, 04:17
Default
  #2
Senior Member
 
bigfoot
Join Date: Dec 2011
Location: Netherlands
Posts: 782
Rep Power: 21
bigfootedrockmidget is on a distinguished road
Hi,
When you see everything repeated N times when running on N cores, then your SU2 executable was not correctly compiled with mpi support.
make sure you do this:
Code:
./meson.py build -Dwith-mpi=enabled

and check in the output that the compiler is using openmpi:
Quote:
C++ compiler for the host machine: /usr/bin/mpicxx.openmp
bigfootedrockmidget is offline   Reply With Quote

Old   January 14, 2025, 04:34
Default
  #3
Member
 
Ercan Umut
Join Date: Aug 2022
Posts: 72
Rep Power: 4
CleverBoy is on a distinguished road
Thanks for the answer. The thing is it actually repeats everything when I use the command on slurm file and send the job to clusters. When I run the same command on terminal everything seems to work well.

Also I run some different cases other than Adjoint with MPI and there is no problem as such. I tried to adjust processor number from the shape_optimization.py file but it repeats as before.

Any idea on how can I fix this?
CleverBoy is offline   Reply With Quote

Old   January 14, 2025, 08:38
Default
  #4
Member
 
Josh Kelly
Join Date: Dec 2018
Posts: 57
Rep Power: 8
joshkellyjak is on a distinguished road
Is there something in your shell script job file that is overriding the bash settings you have in the terminal environment?



Maybe a better option would be to use FADO (https://github.com/pcarruscag/FADO) instead of the python scripts.
joshkellyjak is offline   Reply With Quote

Old   January 15, 2025, 03:13
Default
  #5
Member
 
Ercan Umut
Join Date: Aug 2022
Posts: 72
Rep Power: 4
CleverBoy is on a distinguished road
Code:
#!/bin/bash

#SBATCH -p 
#SBATCH -A 
#SBATCH -J NEMO
#SBATCH --nodes=4
#SBATCH --ntasks=216
#SBATCH --cpus-per-task=1
#SBATCH -C weka
#SBATCH --time=12:10:00
#SBATCH --output=slurm-%j.out
#SBATCH --error=slurm-%j.err

echo "SLURM_NODELIST $SLURM_NODELIST"
echo "NUMBER OF CORES $SLURM_NTASKS"

module purge


module load comp/gcc/14.1.0
module load lib/openmpi/5.0.0
module load lib/hdf5/1.14.3-openmpi-5.0.0

PATH=/arf/home/../pcre2/bin:$PATH
PATH=/arf/home/../swig/bin:$PATH
PATH=/arf/home/../swig:$PATH
export PATH=/home/../cmake-3.31.2/bin:$PATH
export PATH=/home/../cmake-3.31.2/:$PATH
PATH=/../programs/re2c-4.0.2/bin:$PATH
PATH=../programs/re2c-4.0.2:$PATH



export SU2_RUN=/arf/../SU2-V8.1.0/su2-2/bin
export SU2_HOME=/arf/../SU2-V8.1.0/su2-2
export PATH=$PATH:$SU2_RUN
export PYTHONPATH=$PYTHONPATH:$SU2_RUN,
export MPP_DATA_DIRECTORY=$SU2_HOME/subprojects/Mutationpp/data
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$SU2_HOME/build/subprojects/Mutationpp


shape_optimization.py -f ramp.cfg

sstat -j $SLURM_JOB_ID
exit
My slurm file looks like this. I don't know if there is anything wrong but there was no problem for other kind of simulations this way.

I don't know anything about FADO but I will look into it. Thank you for the answer.
CleverBoy is offline   Reply With Quote

Old   January 15, 2025, 04:30
Default
  #6
Senior Member
 
bigfoot
Join Date: Dec 2011
Location: Netherlands
Posts: 782
Rep Power: 21
bigfootedrockmidget is on a distinguished road
Quote:
below it repeats everything as the number of cores
What does this mean exactly?


When you run this on the cluster, is there in the end an actual speedup?
slurm needs to copy the python script to all the allocated cores, so you should see stuff being copied ncores times.
So in the end, does the job itself run n times on 1 core, or not?
Does slurm produce a single output file that contains the time per iteration? Then you can quickly find out if the speedup is correct
bigfootedrockmidget is offline   Reply With Quote

Old   January 15, 2025, 05:13
Default
  #7
Member
 
Ercan Umut
Join Date: Aug 2022
Posts: 72
Rep Power: 4
CleverBoy is on a distinguished road
I had to delete most of the part but log_direct.out file looks something like this when I use "shape_optimization.py -n 110 -f ramp.cfg" in slurm file.
It repeats everything with ncores as you said it would happen when openmpi installation is wrong.
Since I run some different cases and this does not occur I thought this shouldn't be the issue.
I can see the speedup when I use 10 cores and 110 cores but I did not calculated it using say, wrt_perf parameter.
When I don't use "-n 110" it clearly runs on single core because I can solve the baseline geometry within 5 hours with same ncores using "mpirun SU2_CFD" but with adjoint first DSN_0001 didn't even finish after 3 days.


Quote:
So in the end, does the job itself run n times on 1 core, or not?
I am not an HPC expert so I am not sure But my guess is it is using ncores to repeat everything.
Attached Files
File Type: txt log_Direct.txt (190.0 KB, 4 views)
CleverBoy is offline   Reply With Quote

Old   January 16, 2025, 11:36
Default
  #8
R.K
New Member
 
rutvik
Join Date: Mar 2024
Posts: 19
Rep Power: 3
R.K is on a distinguished road
I have experienced this in the past. For slurm jobs, if the mpi command is not specified, SU2 by default attempts to use srun, instead of mpiexec or mpirun. To specify the mpi command you need to add the following to you environment variables or slurm scirpt:
Quote:
export SU2_MPI_COMMAND = "mpiexec -n %i %s"
For reference this is written in interface.py
CleverBoy likes this.
R.K is offline   Reply With Quote

Old   January 17, 2025, 05:24
Default
  #9
Member
 
Ercan Umut
Join Date: Aug 2022
Posts: 72
Rep Power: 4
CleverBoy is on a distinguished road
Quote:
Originally Posted by R.K View Post
I have experienced this in the past. For slurm jobs, if the mpi command is not specified, SU2 by default attempts to use srun, instead of mpiexec or mpirun. To specify the mpi command you need to add the following to you environment variables or slurm scirpt:


For reference this is written in interface.py

Thank you Rutvik. I added that part to the slurm file but what should I write as the last command "shape_optimization.py -n 110 -f ramp.cfg". I tried some stuff but it still doesn't work properly.
CleverBoy is offline   Reply With Quote

Old   January 17, 2025, 06:22
Default
  #10
R.K
New Member
 
rutvik
Join Date: Mar 2024
Posts: 19
Rep Power: 3
R.K is on a distinguished road
what do you mean "it still doesn't work properly"? Does it run in serial again? try mpirun instead of mpiexec

I typically use FADO, not much shape_optimization.py. Though, when I did I would add the environment vairables and path to the shape_optimization.py script in the slurm file:
python your/path/to/shape_optimization.py -n 110 -f ramp.cfg -g DISCRETE_ADJOINT
R.K is offline   Reply With Quote

Old   January 20, 2025, 06:31
Default
  #11
Member
 
Ercan Umut
Join Date: Aug 2022
Posts: 72
Rep Power: 4
CleverBoy is on a distinguished road
Thank you for the info Rutvik. I think I made it work with the way you explained. However now it doesn't move forward in solution after completing DSN0001/DIRECT. It looks like its running but no new files created. I tried to lower time iter. and inner inter. and run the case in my personal computer. There was no problem, SU2 created adjoint files as simulations move forward. Any idea on this?
CleverBoy is offline   Reply With Quote

Old   January 20, 2025, 06:33
Default
  #12
Member
 
Josh Kelly
Join Date: Dec 2018
Posts: 57
Rep Power: 8
joshkellyjak is on a distinguished road
Does your hpc file system use lustre? In the past this has caused issues for me. I would highly reccomend you use FADO, there is a tutorial on the SU2 website to demonstrate how to set up a FADO script.
joshkellyjak is offline   Reply With Quote

Old   January 20, 2025, 06:46
Default
  #13
Member
 
Ercan Umut
Join Date: Aug 2022
Posts: 72
Rep Power: 4
CleverBoy is on a distinguished road
Quote:
Originally Posted by joshkellyjak View Post
Does your hpc file system use lustre? In the past this has caused issues for me. I would highly reccomend you use FADO, there is a tutorial on the SU2 website to demonstrate how to set up a FADO script.

Yes it looks like it is lustre. I am not familiar with FADO but looks like I have to switch to it. I didn't have too much time so I had to try the python commands first. So you are saying that FADO can solve these problems right?
CleverBoy is offline   Reply With Quote

Old   January 20, 2025, 07:03
Default
  #14
Member
 
Josh Kelly
Join Date: Dec 2018
Posts: 57
Rep Power: 8
joshkellyjak is on a distinguished road
I can't say it will solve all your issues but it is definitley more powerful than the basic python scripts. The issue with symbolic links on lustre systems stems from openmpi (https://github.com/open-mpi/ompi/issues/12141).
CleverBoy likes this.
joshkellyjak is offline   Reply With Quote

Old   January 20, 2025, 07:48
Default
  #15
Member
 
Ercan Umut
Join Date: Aug 2022
Posts: 72
Rep Power: 4
CleverBoy is on a distinguished road
Quote:
Originally Posted by joshkellyjak View Post
I can't say it will solve all your issues but it is definitley more powerful than the basic python scripts. The issue with symbolic links on lustre systems stems from openmpi (https://github.com/open-mpi/ompi/issues/12141).

Thank you so much for the answers. I would like to ask you another question though, so I can report it to the HPC system administrations. If I use FADO, will this OpenMPI issue be solved? I followed the link you shared and last entry says it would be fixed in 5.0.x editions. I just compiled SU2 with openmpi5.0.4 and same problem happens.
CleverBoy is offline   Reply With Quote

Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Issues with running cases in parallel mode on HPC cluster brahmsdr OpenFOAM Running, Solving & CFD 0 August 28, 2024 19:24
Openfoam running error in parallel in Cluster sibo OpenFOAM 2 February 25, 2017 13:26
OF211 with mvapich2 on redhat cluster, error when using more than 64 cores? ripperjack OpenFOAM Installation 4 August 30, 2014 03:47
Kubuntu uses dash breaks All scripts in tutorials platopus OpenFOAM Bugs 8 April 15, 2008 07:52
Running FineTurbo on a Linux Cluster Philipp Höfliger Fidelity CFD 3 May 6, 2004 08:47


All times are GMT -4. The time now is 23:39.