CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

Trouble running SU2 in parallel on cluster

Register Blogs Community New Posts Updated Threads Search

Like Tree1Likes
  • 1 Post By RcktMan77

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   November 8, 2016, 16:43
Default Trouble running SU2 in parallel on cluster
  #1
New Member
 
Devin Gibson
Join Date: Nov 2016
Posts: 2
Rep Power: 0
devinmgibson is on a distinguished road
I am not sure which exact forum this is most appropriate for, but I figure the Hardware Forum works because I am having a problem with that and not the software, as far as I can tell.

I work for one of my professors and we are trying to run SU2 in parallel on a cluster owned by the university that uses slurm for its workload manager. The problem we are running into is that when we ssh into the cluster and run the command:

parallel_computation.py -f SU2.cfg

on an assigned node by slurm (using sbatch), the code hangs and wont run. The weird thing about this is if we run the same command on the login node, it works just fine. Do any of you know what could possibly be the problem?

Here is some additional information:
- We talked with the IT guy in charge of the cluster and he doesn't have enough background to know what is going on.
- On some of our output files we would get the escape key [!0134h, when we changed the terminal settings to get rid of the escape key the code behavior was consistent as above.
- We can run SU2_CFD "config file", the code in serial, on both the login node and the cluster just fine
- We have tried running an interactive session on a node (using srun), no change in behavior

Any thoughts would be appreciated! We really want to be able to run the code in-house instead of outsource.
devinmgibson is offline   Reply With Quote

Old   November 10, 2016, 18:40
Default
  #2
New Member
 
California
Join Date: Nov 2016
Posts: 10
Rep Power: 9
nomad2 is on a distinguished road
I know it's only been 2 days since this post, but did you make any progress on this? I'm trying to run this on the cluster with slurm.

I can run it fine on the login node in serial, but not sure how to submit it in parallel.
nomad2 is offline   Reply With Quote

Old   November 10, 2016, 18:58
Default
  #3
New Member
 
Devin Gibson
Join Date: Nov 2016
Posts: 2
Rep Power: 0
devinmgibson is on a distinguished road
No progress. . .

I have been doing some more tests, on the kingspeak cluster at the University of Utah to see if the error is consistent, and recently I have been getting the error:

Code:
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[47035,1],0]
  Exit code:    127
--------------------------------------------------------------------------
Traceback (most recent call last):
  File "/uufs/chpc.utah.edu/sys/installdir/su2/4.0i/bin/parallel_computation.py", line 110, in <module>
    main()
  File "/uufs/chpc.utah.edu/sys/installdir/su2/4.0i/bin/parallel_computation.py", line 61, in main
    options.compute      )
  File "/uufs/chpc.utah.edu/sys/installdir/su2/4.0i/bin/parallel_computation.py", line 88, in parallel_computation
    info = SU2.run.CFD(config)
  File "/uufs/chpc.utah.edu/sys/installdir/su2/4.0i/bin/SU2/run/interface.py", line 110, in CFD
    run_command( the_Command )
  File "/uufs/chpc.utah.edu/sys/installdir/su2/4.0i/bin/SU2/run/interface.py", line 268, in run_command
    raise exception , message
RuntimeError: Path = /uufs/chpc.utah.edu/common/home/<uNID>/SU2-Tests/Users/2118/D2602EDB-0B2F-46C7-A93C-5290D2F8DA50/,
Command = mpirun -n 2 /uufs/chpc.utah.edu/sys/installdir/su2/4.0i/bin/SU2_CFD config_CFD.cfg
SU2 process returned error '127'
/uufs/chpc.utah.edu/sys/installdir/su2/4.0i/bin/SU2_CFD: symbol lookup error: /uufs/chpc.utah.edu/sys/installdir/inte                                                                                                                      
l/impi/5.1.1.109/intel64/lib/libmpifort.so.12: undefined symbol: MPI_UNWEIGHTED
from the new cluster. I've reached out to a few people and no one has been able to tell me what this means yet. I also don't know if this error is related to the cluster at my university or not.

In reference to your question about running in parallel, here is my SLURM batch script that I am using when I use the sbatch command:

Code:
#!/bin/bash
#SBATCH --account=owner-guest
#SBATCH --partition=kingspeak-guest
#SBATCH --job-name=NACA-2412
#SBATCH --nodes=2
#SBATCH --ntasks=12
#SBATCH --time=02:00:00
#SBATCH -o slurmjob-%j.out
#SBATCH -e slurmjob-%j.err

module load openmpi
module load su2

parallel_computation.py -f SU2.cfg

#####################################################
devinmgibson is offline   Reply With Quote

Old   December 13, 2016, 17:35
Default
  #4
Senior Member
 
Zach Davis
Join Date: Jan 2010
Location: Los Angeles, CA
Posts: 101
Rep Power: 16
RcktMan77 is on a distinguished road
In order to run SU2 in parallel, the code needs a few things:
  1. The SU2 executables need to be accessible to each node at the same location on the filesystem. (A shared volume attached to each node here would be best.)
  2. The SU2 grid and configuration file need to be accessible to each node at the same location on the filesystem. (This means you may need to have a shared volume mounted on each of the nodes which you use for your run disk.)
  3. The SU2 run command, parallel_computation.py, needs to know how many processes to launch. You tell it this with the -n flag followed by an integer with the number of MPI processes to create (e.g. parallel_computation.py -n 32 -f my_su2_config_file.cfg)
  4. The actual slurm command that is used by your compiled SU2 executables can be seen at: $SU2_RUN/SU2/run/interface.py. In this file search for slurm, and modify the run command if needed for your environment.
  5. You need passwordless ssh setup between each node (i.e. you need to be able to login to each node from the head node and vice versa without being prompted for a password).

Based on the error message you're receiving, SU2 is not detecting that the SLURM_JOBID environment variable is set, so it's defaulting to use an mpirun command. Your slurm configuration file is also not passing the 12 MPI processes that you want to the parallel_computation.py command. It should look like:

parallel_computation.py -n 12 -f SU2.cfg > su2.out 2>&1

The re-direct of standard error and standard output to a file named su2.out isn't necessary, but a good practice to capture the output from SU2. It appears, you're running with:

mpirun -n 2 SU2_CFD config_CFD.cfg

which isn't what you want. This suggests that there should be another slurm header variable that you need to add to your script that indicates how many CPU cores are available on each node. Perhaps there is a machinefile that the slurm process uses to determine this, but as I mentioned above the SLURM_JOBID environment variable isn't set, so SU2 is bypassing slurm altogether.

I don't use slurm, but if there is a variable for --ntasks, then you could use it in the run command of your script instead of explicitly setting the value to 12 in this example. Doing so would make it so you don't have to modify the value twice in your script for each run. PBS has such a variable, but I'm unfamiliar whether slurm does.

Best Regards,



Zach
saladbowl likes this.
RcktMan77 is offline   Reply With Quote

Old   January 4, 2017, 14:58
Default
  #5
OVS
New Member
 
Oliver V
Join Date: Dec 2015
Posts: 17
Rep Power: 10
OVS is on a distinguished road
Hello,

I've been having the exact same error recently

Code:
 
mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:    Process name: [[47035,1],0]   Exit code:    127
Any progress on that? Which version of SU2 are you using?

Oliver
OVS is offline   Reply With Quote

Reply

Tags
cfd, cluster, parallel, slurm, su2


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
OF 2.0.1 parallel running problems moser_r OpenFOAM Running, Solving & CFD 9 July 27, 2022 03:15
Case running in serial, but Parallel run gives error atmcfd OpenFOAM Running, Solving & CFD 18 March 26, 2016 12:40
SU2 Tutorial 2 Parallel Computation CrashLaker SU2 7 April 5, 2014 16:14
parallel error with cyclic BCs for pimpleDyMFoam and trouble in resuming running sunliming OpenFOAM Bugs 21 November 22, 2013 03:38
Running in parallel Djub OpenFOAM Running, Solving & CFD 3 January 24, 2013 16:01


All times are GMT -4. The time now is 13:35.