CFD Online Discussion Forums

CFD Online Discussion Forums (http://www.cfd-online.com/Forums/)
-   FLUENT (http://www.cfd-online.com/Forums/fluent/)
-   -   problem of running parallel Fluent on linux cluster (http://www.cfd-online.com/Forums/fluent/66725-problem-running-parallel-fluent-linux-cluster.html)

ivanbuz July 22, 2009 16:33

problem of running parallel Fluent on linux cluster
 
the case runs fine if I require several processors on the SAME node, but if the processors are on different nodes, I have the "Connection refused" problem.

I search online and see that some people have the similar problem, but I can not find a solution to this specific problem. the output from Fluent and the submission script are attached below.

Thanks in advance!


OUTPUT FROM FLUENT
-----------------------------------------
/opt/hpc/Fluent.Inc/fluent6.3.26/bin/fluent -r6.3.26 -pib -cnf=/var/spool/PBS/aux//3666504.cmgr01 -g 2ddp -t6 -i test2.jou
/opt/hpc/Fluent.Inc/fluent6.3.26/cortex/lnamd64/cortex.3.7.3 -f fluent -g -i test2.jou (fluent "2ddp -pib -host -r6.3.26 -t6 -mpi=hp -cnf=/var/spool/PBS/aux//3666504.cmgr01 -path/opt/hpc/Fluent.Inc")
Loading "/opt/hpc/Fluent.Inc/fluent6.3.26/lib/fluent.dmp.114-64"
Done.
/opt/hpc/Fluent.Inc/fluent6.3.26/bin/fluent -r6.3.26 2ddp -pib -host -t6 -mpi=hp -cnf=/var/spool/PBS/aux//3666504.cmgr01 -path/opt/hpc/Fluent.Inc -cx scw-029.i:59263:37434
Starting /opt/hpc/Fluent.Inc/fluent6.3.26/lnamd64/2ddp_host/fluent.6.3.26 host -cx scw-029.i:59263:37434 "(list (rpsetvar (QUOTE parallel/function) "fluent 2ddp -node -r6.3.26 -t6 -pib -mpi=hp -cnf=/var/spool/PBS/aux//3666504.cmgr01 ") (rpsetvar (QUOTE parallel/rhost) "") (rpsetvar (QUOTE parallel/ruser) "") (rpsetvar (QUOTE parallel/nprocs_string) "6") (rpsetvar (QUOTE parallel/auto-spawn?) #t) (rpsetvar (QUOTE parallel/trace-level) 0) (rpsetvar (QUOTE parallel/remote-shell) 0) (rpsetvar (QUOTE parallel/path) "/opt/hpc/Fluent.Inc") (rpsetvar (QUOTE parallel/hostsfile) "/var/spool/PBS/aux//3666504.cmgr01") )"
Welcome to Fluent 6.3.26
Copyright 2006 Fluent Inc.
All Rights Reserved
Loading "/opt/hpc/Fluent.Inc/fluent6.3.26/lib/flprim.dmp.1119-64"
Done.

Host spawning Node 0 on machine "scw-029" (unix).
/opt/hpc/Fluent.Inc/fluent6.3.26/bin/fluent -r6.3.26 2ddp -node -t6 -pib -mpi=hp -cnf=/var/spool/PBS/aux//3666504.cmgr01 -mport 192.168.2.129:192.168.2.129:34193:0
Starting /opt/hpc/Fluent.Inc/fluent6.3.26/multiport/mpi/lnamd64/hp/bin/mpirun -prot -vapi -e MPI_HASIC_VAPI=1 -e MPI_USE_MALLOPT_SBRK_PROTECTION=1 -e MPI_USE_MALLOPT_AVOID_MMAP=1 -f /tmp/fluent-appfile.32087
192.168.2.135: Connection refused
mpirun: Warning one more more remote shell commands exited with non-zero status, which may indicate a remote access problem.





SUBMISSION SCRIPT
-----------------------------------------
#!/bin/sh
#PBS -j oe
#PBS -l nodes=2:ppn=3
#PBS -q main
#PBS -l walltime=00:10:00
cd ${PBS_O_WORKDIR}
cat ${PBS_NODEFILE}
#Set variables for script
# What version of the solver to use
FLUENTSOLVER=2ddp
#HOW MANY CPUS- note that you'll still need to update the $PBS -l nodes line
CPUCOUNT=6
#Which input journal file to use to give fluent?
#INPUT=${PBS_O_WORKDIR}/${PBS_JOBNAME}
INPUT=test2.jou
#Where do we want to put output at?
OUTPUT=${PBS_O_WORKDIR}/${PBS_JOBID}.out

# Run Fluent with:
# -pib use Infiniband parallel
# -cnf=$PBS_NODEFILE get the list of machines PBS is running on from the server
# -t$CPUCOUNT use $CPUCOUNT CPUs total
# -g no graphics, batch mode
# -i read the file in $INPUT
# > $OUTPUT 2>&1 Redirect program output to a file in your home directory.
fluent $FLUENTSOLVER -t$CPUCOUNT -pib cnf=$PBS_NODEFILE -g -i $INPUT > $OUTPUT 2>&1

-mAx- July 23, 2009 01:32

Are you able to connect to your nodes with rsh or ssh?

ivanbuz July 23, 2009 03:15

yes, I have access to all nodes using SSH

-mAx- July 23, 2009 03:44

passwords aren't required?

ivanbuz July 23, 2009 14:07

I log into the cluster before submitting the job. so there should be no problem with password.

ivanbuz July 23, 2009 14:10

If guys from Fluent.INC see this thread, can you take a look?

-mAx- July 24, 2009 01:19

I am not an linux expert / fluent-parallel , but you said that you are using SSH.
SSH needs password.
if you try a command on a node, what is the result?
> ssh ip-address ls

ivanbuz July 24, 2009 15:10

Hi, mAx, There is no problem. the output is similar to the following

bushivan@scw-097:~>ssh 192.166.2.195 ls
airfoil
airfoil_ins_eig
default_id.dbs
...

-mAx- July 24, 2009 16:35

Is there an administrator under your Cluster or did you set it yourself?

ivanbuz July 24, 2009 17:43

The cluster is operated by the HPC center. I just submit my job there. I actually talked to people of HPC, and they said something like the MPI used by the clusters is not compatible with Fluent, but I am not sure if they are right. So I just post my problem here and see if anyone has encountered the same problem.

-mAx- July 25, 2009 02:15

Then they are the right people to solve your problem.
But it may be sad not being able to use fluent's parallel enhancement with this cluster

ibnkureshi March 10, 2010 16:13

i know this is late, but you have to give the '-ssh' in the fluent command line in the submission file. that forces fluent to use ssh rather than rsh which it always goes to by default


All times are GMT -4. The time now is 11:03.