|
[Sponsors] |
HPC issues--Starting STAR-CCM+ parallel server |
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
May 5, 2018, 01:06 |
HPC issues--Starting STAR-CCM+ parallel server
|
#1 | |
New Member
Colin
Join Date: Mar 2018
Posts: 12
Rep Power: 8 |
Hello everyone!
I'm trying to run STAR-CCM+ in slurm. If I use one node, the software works correctly. But when I use 2 nodes, the output message only shows "Starting STAR-CCM+ parallel server" and seems no longer running, without any error message. Does anyone know the solution? Is there any way to get more error message? Many thanks! here is my slurm file: Quote:
|
||
May 5, 2018, 01:09 |
|
#2 |
New Member
Colin
Join Date: Mar 2018
Posts: 12
Rep Power: 8 |
Here is the output when I scancel the slurm job:
Code:
Starting local server: /bin/star12/12.02.010/STAR-CCM+12.02.010/star/bin/starccm+ -power -mpi intel -machinefile machinefile -np 2 -cpubind -server -rsh ssh Starting STAR-CCM+ parallel server [mpiexec@node092] HYDU_sock_write (../../utils/sock/sock.c:417): write error (Bad file descriptor) |
|
May 5, 2018, 08:49 |
|
#3 |
Senior Member
Sebastian Engel
Join Date: Jun 2011
Location: Germany
Posts: 566
Rep Power: 20 |
Hi Collin,
could you try to redirect the output of starccm in a seperate log file and report back? Such as: starccm+ -power -batch-report ....... > $SLURM_JOBID.star.log 2>&1 Btw you might also need the argument -new if you don't open a .sim file. Best regards, Sebastian |
|
May 5, 2018, 21:35 |
|
#4 | |
New Member
Colin
Join Date: Mar 2018
Posts: 12
Rep Power: 8 |
Quote:
however, I tried it and nothing changes. The output file still is Starting STAR-CCM+ parallel server and no more error message |
||
May 6, 2018, 00:03 |
|
#5 |
New Member
Colin
Join Date: Mar 2018
Posts: 12
Rep Power: 8 |
I added -v to the starting command and get much more details. But it seems that no error occurs. It drives me mad
|
|
May 6, 2018, 03:52 |
|
#6 |
Senior Member
Sebastian Engel
Join Date: Jun 2011
Location: Germany
Posts: 566
Rep Power: 20 |
Ok, i don't see any error either.
# What is the default time of a job in your slurm? there is no -t line, such as #SBATCH -t 001:00:00 # I am unsure whether this line works as intended: scontrol show hostname $SLURM_JOB_NODELIST > machinefile in my cluster we use something like this: srun /bin/hostname -s | sort -u > $WORKDIR/$MACHINEFILE # Are sure to have to set the intel mpi? Does starccm not select the right (infiniband) network by itself? |
|
May 6, 2018, 03:58 |
|
#7 |
Senior Member
Sebastian Engel
Join Date: Jun 2011
Location: Germany
Posts: 566
Rep Power: 20 |
one more addition,
i remember an obscure issue with a colleagues account which we only could resolve by deleting the starccm config in his user account. run ls -a in your home directory and delete the corresponding (or any) .star.... folder. They will be recreated cleanly on the next session of starccm. |
|
May 6, 2018, 06:50 |
|
#8 | |
New Member
Colin
Join Date: Mar 2018
Posts: 12
Rep Power: 8 |
Quote:
There is no -t line because the batch will shut down automatically when simulation completed. Code:
scontrol show hostname $SLURM_JOB_NODELIST > machinefile srun /bin/hostname -s | sort -u > $WORKDIR/$MACHINEFILE If I don't set the mpi, star uses ibm mpi and error occurs like Host key verification failed. I give up |
||
May 6, 2018, 07:48 |
|
#9 | |
Senior Member
Sebastian Engel
Join Date: Jun 2011
Location: Germany
Posts: 566
Rep Power: 20 |
Quote:
he had the issue, that star was starting flawlessly on our hpc with the -server and -macro option, but star stopped responding after the "Starting STAR-CCM+ parallel server" line in -batch mode. In a last attempt, deleting the .star directories in his home resolved the issue, but we didn't found the cause. I assumed it may have been some errors in the presettings which make star fail to start, and deleted them. |
||
May 6, 2018, 09:42 |
|
#10 | |
New Member
Colin
Join Date: Mar 2018
Posts: 12
Rep Power: 8 |
Quote:
Anyway, I really appreciate for your help. |
||
May 12, 2018, 17:35 |
|
#11 |
Senior Member
Sebastian Engel
Join Date: Jun 2011
Location: Germany
Posts: 566
Rep Power: 20 |
Hi sky
Could you resolve the issue? I guess, there must be one obvious step which i don't remember which might solve the issue. One tripwire we have in our cluster, we have to set a ssh key on the hpc for the hpc. Only that way software can communicate over ssh when slurm does not provide the mpi-environment. Can you log into a hpc node without a password? Best regards, Sebastian |
|
May 13, 2018, 10:22 |
|
#12 |
New Member
Colin
Join Date: Mar 2018
Posts: 12
Rep Power: 8 |
Thanks for your reply!
I believe that there must be a way to solve the issue, but maybe not now. One thing confused me is that I can run the simulation successfully on one node. But when the number of nodes increases to 2, star stops responding. I guess something was wrong with the mpi. I'm not sure whether the slurm can log into a node without a password. Of course, if I login to the hpc login node, a password is required. |
|
April 20, 2020, 15:46 |
|
#13 |
New Member
Alex
Join Date: Feb 2020
Posts: 1
Rep Power: 0 |
I have been running across the same issue that has been discussed here. The only solution I've been able to find is re-submitting the job and that occasionally fixes it. I would love to hear though if there's a more robust solution.
|
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
bash script for pseudo-parallel usage of reconstructPar | kwardle | OpenFOAM Post-Processing | 41 | August 23, 2023 02:48 |
big difference between clockTime and executionTime | LM4112 | OpenFOAM Running, Solving & CFD | 21 | February 15, 2019 03:05 |
Parallel Setting of Ansys 13 installed on HPC server windows 2008 R2 | chris.z | ANSYS | 0 | August 7, 2014 07:13 |
parallel execution not starting | gfilip | OpenFOAM Running, Solving & CFD | 2 | November 3, 2011 08:31 |
A genearl question on STAR CCM | nstar | STAR-CCM+ | 5 | June 24, 2009 09:39 |