CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > Siemens > STAR-CCM+

HPC issues--Starting STAR-CCM+ parallel server

Register Blogs Community New Posts Updated Threads Search

Like Tree3Likes
  • 1 Post By bluebase
  • 1 Post By bluebase
  • 1 Post By bluebase

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   May 5, 2018, 01:06
Post HPC issues--Starting STAR-CCM+ parallel server
  #1
New Member
 
Colin
Join Date: Mar 2018
Posts: 12
Rep Power: 8
sky101001 is on a distinguished road
Hello everyone!

I'm trying to run STAR-CCM+ in slurm. If I use one node, the software works correctly. But when I use 2 nodes, the output message only shows "Starting STAR-CCM+ parallel server" and seems no longer running, without any error message.

Does anyone know the solution? Is there any way to get more error message?

Many thanks!

here is my slurm file:
Quote:
#!/bin/bash

#SBATCH -J STAR
#SBATCH -p cpu
#SBATCH -o outputs/%j.out
#SBATCH -e outputs/%j.err
#SBATCH --ntasks-per-node=1
#SBATCH -N 2

source /usr/share/Modules/init/bash
module purge
module load icc impi

ulimit -s unlimited
ulimit -l unlimited


cat /dev/null > machinefile
scontrol show hostname $SLURM_JOB_NODELIST > machinefile

starccm+ -power -batch-report -mpi intel -machinefile machinefile -np $SLURM_NTASKS -rsh ssh -cpubind -batch 'batch/run22s.java'
sky101001 is offline   Reply With Quote

Old   May 5, 2018, 01:09
Default
  #2
New Member
 
Colin
Join Date: Mar 2018
Posts: 12
Rep Power: 8
sky101001 is on a distinguished road
Here is the output when I scancel the slurm job:
Code:
Starting local server: /bin/star12/12.02.010/STAR-CCM+12.02.010/star/bin/starccm+ -power -mpi intel -machinefile machinefile -np 2 -cpubind -server -rsh ssh
Starting STAR-CCM+ parallel server
[mpiexec@node092] HYDU_sock_write (../../utils/sock/sock.c:417): write error (Bad file descriptor)
sky101001 is offline   Reply With Quote

Old   May 5, 2018, 08:49
Default
  #3
Senior Member
 
Sebastian Engel
Join Date: Jun 2011
Location: Germany
Posts: 566
Rep Power: 20
bluebase will become famous soon enough
Hi Collin,

could you try to redirect the output of starccm in a seperate log file and report back?
Such as:

starccm+ -power -batch-report ....... > $SLURM_JOBID.star.log 2>&1


Btw you might also need the argument -new if you don't open a .sim file.


Best regards,
Sebastian
sky101001 likes this.
bluebase is offline   Reply With Quote

Old   May 5, 2018, 21:35
Default
  #4
New Member
 
Colin
Join Date: Mar 2018
Posts: 12
Rep Power: 8
sky101001 is on a distinguished road
Quote:
Originally Posted by bluebase View Post
Hi Collin,

could you try to redirect the output of starccm in a seperate log file and report back?
Such as:

starccm+ -power -batch-report ....... > $SLURM_JOBID.star.log 2>&1


Btw you might also need the argument -new if you don't open a .sim file.


Best regards,
Sebastian
Thank you for your reply!
however, I tried it and nothing changes. The output file still is Starting STAR-CCM+ parallel server and no more error message
sky101001 is offline   Reply With Quote

Old   May 6, 2018, 00:03
Default
  #5
New Member
 
Colin
Join Date: Mar 2018
Posts: 12
Rep Power: 8
sky101001 is on a distinguished road
I added -v to the starting command and get much more details. But it seems that no error occurs. It drives me mad
Attached Files
File Type: txt 1742440.txt (24.5 KB, 21 views)
sky101001 is offline   Reply With Quote

Old   May 6, 2018, 03:52
Default
  #6
Senior Member
 
Sebastian Engel
Join Date: Jun 2011
Location: Germany
Posts: 566
Rep Power: 20
bluebase will become famous soon enough
Ok, i don't see any error either.



# What is the default time of a job in your slurm? there is no -t line, such as
#SBATCH -t 001:00:00


# I am unsure whether this line works as intended:
scontrol show hostname $SLURM_JOB_NODELIST > machinefile

in my cluster we use something like this:
srun /bin/hostname -s | sort -u > $WORKDIR/$MACHINEFILE


# Are sure to have to set the intel mpi? Does starccm not select the right (infiniband) network by itself?
sky101001 likes this.
bluebase is offline   Reply With Quote

Old   May 6, 2018, 03:58
Default
  #7
Senior Member
 
Sebastian Engel
Join Date: Jun 2011
Location: Germany
Posts: 566
Rep Power: 20
bluebase will become famous soon enough
one more addition,

i remember an obscure issue with a colleagues account which we only could resolve by deleting the starccm config in his user account.

run ls -a in your home directory and delete the corresponding (or any) .star.... folder. They will be recreated cleanly on the next session of starccm.
bluebase is offline   Reply With Quote

Old   May 6, 2018, 06:50
Default
  #8
New Member
 
Colin
Join Date: Mar 2018
Posts: 12
Rep Power: 8
sky101001 is on a distinguished road
Quote:
Originally Posted by bluebase View Post
Ok, i don't see any error either.



# What is the default time of a job in your slurm? there is no -t line, such as
#SBATCH -t 001:00:00


# I am unsure whether this line works as intended:
scontrol show hostname $SLURM_JOB_NODELIST > machinefile

in my cluster we use something like this:
srun /bin/hostname -s | sort -u > $WORKDIR/$MACHINEFILE


# Are sure to have to set the intel mpi? Does starccm not select the right (infiniband) network by itself?
Your help was greatly appreciated. Many thanks!
There is no -t line because the batch will shut down automatically when simulation completed.
Code:
scontrol show hostname $SLURM_JOB_NODELIST > machinefile
srun  /bin/hostname -s | sort -u > $WORKDIR/$MACHINEFILE
These two lines create the same machinefile.
If I don't set the mpi, star uses ibm mpi and error occurs like Host key verification failed.
I give up
sky101001 is offline   Reply With Quote

Old   May 6, 2018, 07:48
Default
  #9
Senior Member
 
Sebastian Engel
Join Date: Jun 2011
Location: Germany
Posts: 566
Rep Power: 20
bluebase will become famous soon enough
Quote:
Originally Posted by bluebase View Post
one more addition,

i remember an obscure issue with a colleagues account which we only could resolve by deleting the starccm config in his user account.

run ls -a in your home directory and delete the corresponding (or any) .star.... folder. They will be recreated cleanly on the next session of starccm.
Ok, i searched the corresponding notes of the meeting with my colleague,

he had the issue, that star was starting flawlessly on our hpc with the -server and -macro option, but star stopped responding after the "Starting STAR-CCM+ parallel server" line in -batch mode.

In a last attempt, deleting the .star directories in his home resolved the issue, but we didn't found the cause. I assumed it may have been some errors in the presettings which make star fail to start, and deleted them.
sky101001 likes this.
bluebase is offline   Reply With Quote

Old   May 6, 2018, 09:42
Default
  #10
New Member
 
Colin
Join Date: Mar 2018
Posts: 12
Rep Power: 8
sky101001 is on a distinguished road
Quote:
Originally Posted by bluebase View Post
Ok, i searched the corresponding notes of the meeting with my colleague,

he had the issue, that star was starting flawlessly on our hpc with the -server and -macro option, but star stopped responding after the "Starting STAR-CCM+ parallel server" line in -batch mode.

In a last attempt, deleting the .star directories in his home resolved the issue, but we didn't found the cause. I assumed it may have been some errors in the presettings which make star fail to start, and deleted them.
Yes! I'm facing the issue exactly the same as your colleague. Unluckily, I deleted the .star dir but the problem remains.
Anyway, I really appreciate for your help.
sky101001 is offline   Reply With Quote

Old   May 12, 2018, 17:35
Default
  #11
Senior Member
 
Sebastian Engel
Join Date: Jun 2011
Location: Germany
Posts: 566
Rep Power: 20
bluebase will become famous soon enough
Hi sky


Could you resolve the issue?


I guess, there must be one obvious step which i don't remember which might solve the issue.


One tripwire we have in our cluster, we have to set a ssh key on the hpc for the hpc. Only that way software can communicate over ssh when slurm does not provide the mpi-environment.

Can you log into a hpc node without a password?



Best regards,
Sebastian
bluebase is offline   Reply With Quote

Old   May 13, 2018, 10:22
Default
  #12
New Member
 
Colin
Join Date: Mar 2018
Posts: 12
Rep Power: 8
sky101001 is on a distinguished road
Thanks for your reply!
I believe that there must be a way to solve the issue, but maybe not now.
One thing confused me is that I can run the simulation successfully on one node. But when the number of nodes increases to 2, star stops responding. I guess something was wrong with the mpi.
I'm not sure whether the slurm can log into a node without a password. Of course, if I login to the hpc login node, a password is required.
sky101001 is offline   Reply With Quote

Old   April 20, 2020, 15:46
Default
  #13
New Member
 
Alex
Join Date: Feb 2020
Posts: 1
Rep Power: 0
fencingm is on a distinguished road
I have been running across the same issue that has been discussed here. The only solution I've been able to find is re-submitting the job and that occasionally fixes it. I would love to hear though if there's a more robust solution.
fencingm is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
bash script for pseudo-parallel usage of reconstructPar kwardle OpenFOAM Post-Processing 41 August 23, 2023 02:48
big difference between clockTime and executionTime LM4112 OpenFOAM Running, Solving & CFD 21 February 15, 2019 03:05
Parallel Setting of Ansys 13 installed on HPC server windows 2008 R2 chris.z ANSYS 0 August 7, 2014 07:13
parallel execution not starting gfilip OpenFOAM Running, Solving & CFD 2 November 3, 2011 08:31
A genearl question on STAR CCM nstar STAR-CCM+ 5 June 24, 2009 09:39


All times are GMT -4. The time now is 06:15.