HPC issues--Starting STAR-CCM+ parallel server

sky101001 · May 5, 2018, 01:06

Hello everyone!

I'm trying to run STAR-CCM+ in slurm. If I use one node, the software works correctly. But when I use 2 nodes, the output message only shows "Starting STAR-CCM+ parallel server" and seems no longer running, without any error message.

Does anyone know the solution? Is there any way to get more error message?

Many thanks!

here is my slurm file:

Quote:

#!/bin/bash

#SBATCH -J STAR
#SBATCH -p cpu
#SBATCH -o outputs/%j.out
#SBATCH -e outputs/%j.err
#SBATCH --ntasks-per-node=1
#SBATCH -N 2

source /usr/share/Modules/init/bash
module purge
module load icc impi

ulimit -s unlimited
ulimit -l unlimited

cat /dev/null > machinefile
scontrol show hostname $SLURM_JOB_NODELIST > machinefile

starccm+ -power -batch-report -mpi intel -machinefile machinefile -np $SLURM_NTASKS -rsh ssh -cpubind -batch 'batch/run22s.java'

sky101001 · May 5, 2018, 01:09

Here is the output when I scancel the slurm job:

Code:

Starting local server: /bin/star12/12.02.010/STAR-CCM+12.02.010/star/bin/starccm+ -power -mpi intel -machinefile machinefile -np 2 -cpubind -server -rsh ssh
Starting STAR-CCM+ parallel server
[mpiexec@node092] HYDU_sock_write (../../utils/sock/sock.c:417): write error (Bad file descriptor)

bluebase · May 5, 2018, 08:49

Hi Collin,

could you try to redirect the output of starccm in a seperate log file and report back?
Such as:

starccm+ -power -batch-report ....... > $SLURM_JOBID.star.log 2>&1

Btw you might also need the argument -new if you don't open a .sim file.

Best regards,
Sebastian

sky101001 · May 5, 2018, 21:35

Quote:

Originally Posted by bluebase

Hi Collin,

could you try to redirect the output of starccm in a seperate log file and report back?
Such as:

starccm+ -power -batch-report ....... > $SLURM_JOBID.star.log 2>&1

Btw you might also need the argument -new if you don't open a .sim file.

Best regards,
Sebastian

Thank you for your reply!
however, I tried it and nothing changes. The output file still is Starting STAR-CCM+ parallel server and no more error message

sky101001 · May 6, 2018, 00:03

I added -v to the starting command and get much more details. But it seems that no error occurs. It drives me mad

bluebase · May 6, 2018, 03:52

Ok, i don't see any error either.

# What is the default time of a job in your slurm? there is no -t line, such as
#SBATCH -t 001:00:00

# I am unsure whether this line works as intended:
scontrol show hostname $SLURM_JOB_NODELIST > machinefile

in my cluster we use something like this:
srun /bin/hostname -s | sort -u > $WORKDIR/$MACHINEFILE

# Are sure to have to set the intel mpi? Does starccm not select the right (infiniband) network by itself?

bluebase · May 6, 2018, 03:58

one more addition,

i remember an obscure issue with a colleagues account which we only could resolve by deleting the starccm config in his user account.

run ls -a in your home directory and delete the corresponding (or any) .star.... folder. They will be recreated cleanly on the next session of starccm.

sky101001 · May 6, 2018, 06:50

Quote:

Originally Posted by bluebase

Ok, i don't see any error either.

# What is the default time of a job in your slurm? there is no -t line, such as
#SBATCH -t 001:00:00

# I am unsure whether this line works as intended:
scontrol show hostname $SLURM_JOB_NODELIST > machinefile

in my cluster we use something like this:
srun /bin/hostname -s | sort -u > $WORKDIR/$MACHINEFILE

# Are sure to have to set the intel mpi? Does starccm not select the right (infiniband) network by itself?

Your help was greatly appreciated. Many thanks!
There is no -t line because the batch will shut down automatically when simulation completed.

Code:

scontrol show hostname $SLURM_JOB_NODELIST > machinefile
srun  /bin/hostname -s | sort -u > $WORKDIR/$MACHINEFILE

These two lines create the same machinefile.
If I don't set the mpi, star uses ibm mpi and error occurs like Host key verification failed.
I give up

bluebase · May 6, 2018, 07:48

Quote:

Originally Posted by bluebase

one more addition,

i remember an obscure issue with a colleagues account which we only could resolve by deleting the starccm config in his user account.

run ls -a in your home directory and delete the corresponding (or any) .star.... folder. They will be recreated cleanly on the next session of starccm.

Ok, i searched the corresponding notes of the meeting with my colleague,

he had the issue, that star was starting flawlessly on our hpc with the -server and -macro option, but star stopped responding after the "Starting STAR-CCM+ parallel server" line in -batch mode.

In a last attempt, deleting the .star directories in his home resolved the issue, but we didn't found the cause. I assumed it may have been some errors in the presettings which make star fail to start, and deleted them.

sky101001 · May 6, 2018, 09:42

Quote:

Originally Posted by bluebase

Ok, i searched the corresponding notes of the meeting with my colleague,

he had the issue, that star was starting flawlessly on our hpc with the -server and -macro option, but star stopped responding after the "Starting STAR-CCM+ parallel server" line in -batch mode.

In a last attempt, deleting the .star directories in his home resolved the issue, but we didn't found the cause. I assumed it may have been some errors in the presettings which make star fail to start, and deleted them.

Yes! I'm facing the issue exactly the same as your colleague. Unluckily, I deleted the .star dir but the problem remains.
Anyway, I really appreciate for your help.

bluebase · May 12, 2018, 17:35

Hi sky

Could you resolve the issue?

I guess, there must be one obvious step which i don't remember which might solve the issue.

One tripwire we have in our cluster, we have to set a ssh key on the hpc for the hpc. Only that way software can communicate over ssh when slurm does not provide the mpi-environment.

Can you log into a hpc node without a password?

Best regards,
Sebastian

sky101001 · May 13, 2018, 10:22

Thanks for your reply!
I believe that there must be a way to solve the issue, but maybe not now.
One thing confused me is that I can run the simulation successfully on one node. But when the number of nodes increases to 2, star stops responding. I guess something was wrong with the mpi.
I'm not sure whether the slurm can log into a node without a password. Of course, if I login to the hpc login node, a password is required.

fencingm · April 20, 2020, 15:46

I have been running across the same issue that has been discussed here. The only solution I've been able to find is re-submitting the job and that occasionally fixes it. I would love to hear though if there's a more robust solution.

May 5, 2018, 08:49		#3
bluebase Senior Member Sebastian Engel Join Date: Jun 2011 Location: Germany Posts: 566 Rep Power: 20	Hi Collin, could you try to redirect the output of starccm in a seperate log file and report back? Such as: starccm+ -power -batch-report ....... > $SLURM_JOBID.star.log 2>&1 Btw you might also need the argument -new if you don't open a .sim file. Best regards, Sebastian sky101001 likes this.

May 6, 2018, 03:52		#6
bluebase Senior Member Sebastian Engel Join Date: Jun 2011 Location: Germany Posts: 566 Rep Power: 20	Ok, i don't see any error either. # What is the default time of a job in your slurm? there is no -t line, such as #SBATCH -t 001:00:00 # I am unsure whether this line works as intended: scontrol show hostname $SLURM_JOB_NODELIST > machinefile in my cluster we use something like this: srun /bin/hostname -s \| sort -u > $WORKDIR/$MACHINEFILE # Are sure to have to set the intel mpi? Does starccm not select the right (infiniband) network by itself? sky101001 likes this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
bash script for pseudo-parallel usage of reconstructPar	kwardle	OpenFOAM Post-Processing	41	August 23, 2023 02:48
big difference between clockTime and executionTime	LM4112	OpenFOAM Running, Solving & CFD	21	February 15, 2019 03:05
Parallel Setting of Ansys 13 installed on HPC server windows 2008 R2	chris.z	ANSYS	0	August 7, 2014 07:13
parallel execution not starting	gfilip	OpenFOAM Running, Solving & CFD	2	November 3, 2011 08:31
A genearl question on STAR CCM	nstar	STAR-CCM+	5	June 24, 2009 09:39

May 6, 2018, 03:58		#7
bluebase Senior Member Sebastian Engel Join Date: Jun 2011 Location: Germany Posts: 566 Rep Power: 20	one more addition, i remember an obscure issue with a colleagues account which we only could resolve by deleting the starccm config in his user account. run ls -a in your home directory and delete the corresponding (or any) .star.... folder. They will be recreated cleanly on the next session of starccm.

May 12, 2018, 17:35		#11
bluebase Senior Member Sebastian Engel Join Date: Jun 2011 Location: Germany Posts: 566 Rep Power: 20	Hi sky Could you resolve the issue? I guess, there must be one obvious step which i don't remember which might solve the issue. One tripwire we have in our cluster, we have to set a ssh key on the hpc for the hpc. Only that way software can communicate over ssh when slurm does not provide the mpi-environment. Can you log into a hpc node without a password? Best regards, Sebastian

May 13, 2018, 10:22		#12
sky101001 New Member Colin Join Date: Mar 2018 Posts: 12 Rep Power: 8	Thanks for your reply! I believe that there must be a way to solve the issue, but maybe not now. One thing confused me is that I can run the simulation successfully on one node. But when the number of nodes increases to 2, star stops responding. I guess something was wrong with the mpi. I'm not sure whether the slurm can log into a node without a password. Of course, if I login to the hpc login node, a password is required.

April 20, 2020, 15:46		#13
fencingm New Member Alex Join Date: Feb 2020 Posts: 1 Rep Power: 0	I have been running across the same issue that has been discussed here. The only solution I've been able to find is re-submitting the job and that occasionally fixes it. I would love to hear though if there's a more robust solution.