problem of running parallel Fluent on linux cluster
the case runs fine if I require several processors on the SAME node, but if the processors are on different nodes, I have the "Connection refused" problem.
I search online and see that some people have the similar problem, but I can not find a solution to this specific problem. the output from Fluent and the submission script are attached below.
Thanks in advance!
OUTPUT FROM FLUENT
/opt/hpc/Fluent.Inc/fluent6.3.26/bin/fluent -r6.3.26 -pib -cnf=/var/spool/PBS/aux//3666504.cmgr01 -g 2ddp -t6 -i test2.jou
/opt/hpc/Fluent.Inc/fluent6.3.26/cortex/lnamd64/cortex.3.7.3 -f fluent -g -i test2.jou (fluent "2ddp -pib -host -r6.3.26 -t6 -mpi=hp -cnf=/var/spool/PBS/aux//3666504.cmgr01 -path/opt/hpc/Fluent.Inc")
/opt/hpc/Fluent.Inc/fluent6.3.26/bin/fluent -r6.3.26 2ddp -pib -host -t6 -mpi=hp -cnf=/var/spool/PBS/aux//3666504.cmgr01 -path/opt/hpc/Fluent.Inc -cx scw-029.i:59263:37434
Starting /opt/hpc/Fluent.Inc/fluent6.3.26/lnamd64/2ddp_host/fluent.6.3.26 host -cx scw-029.i:59263:37434 "(list (rpsetvar (QUOTE parallel/function) "fluent 2ddp -node -r6.3.26 -t6 -pib -mpi=hp -cnf=/var/spool/PBS/aux//3666504.cmgr01 ") (rpsetvar (QUOTE parallel/rhost) "") (rpsetvar (QUOTE parallel/ruser) "") (rpsetvar (QUOTE parallel/nprocs_string) "6") (rpsetvar (QUOTE parallel/auto-spawn?) #t) (rpsetvar (QUOTE parallel/trace-level) 0) (rpsetvar (QUOTE parallel/remote-shell) 0) (rpsetvar (QUOTE parallel/path) "/opt/hpc/Fluent.Inc") (rpsetvar (QUOTE parallel/hostsfile) "/var/spool/PBS/aux//3666504.cmgr01") )"
Welcome to Fluent 6.3.26
Copyright 2006 Fluent Inc.
All Rights Reserved
Host spawning Node 0 on machine "scw-029" (unix).
/opt/hpc/Fluent.Inc/fluent6.3.26/bin/fluent -r6.3.26 2ddp -node -t6 -pib -mpi=hp -cnf=/var/spool/PBS/aux//3666504.cmgr01 -mport 192.168.2.129:192.168.2.129:34193:0
Starting /opt/hpc/Fluent.Inc/fluent6.3.26/multiport/mpi/lnamd64/hp/bin/mpirun -prot -vapi -e MPI_HASIC_VAPI=1 -e MPI_USE_MALLOPT_SBRK_PROTECTION=1 -e MPI_USE_MALLOPT_AVOID_MMAP=1 -f /tmp/fluent-appfile.32087
192.168.2.135: Connection refused
mpirun: Warning one more more remote shell commands exited with non-zero status, which may indicate a remote access problem.
#PBS -j oe
#PBS -l nodes=2:ppn=3
#PBS -q main
#PBS -l walltime=00:10:00
#Set variables for script
# What version of the solver to use
#HOW MANY CPUS- note that you'll still need to update the $PBS -l nodes line
#Which input journal file to use to give fluent?
#Where do we want to put output at?
# Run Fluent with:
# -pib use Infiniband parallel
# -cnf=$PBS_NODEFILE get the list of machines PBS is running on from the server
# -t$CPUCOUNT use $CPUCOUNT CPUs total
# -g no graphics, batch mode
# -i read the file in $INPUT
# > $OUTPUT 2>&1 Redirect program output to a file in your home directory.
fluent $FLUENTSOLVER -t$CPUCOUNT -pib cnf=$PBS_NODEFILE -g -i $INPUT > $OUTPUT 2>&1
Are you able to connect to your nodes with rsh or ssh?
yes, I have access to all nodes using SSH
passwords aren't required?
I log into the cluster before submitting the job. so there should be no problem with password.
If guys from Fluent.INC see this thread, can you take a look?
I am not an linux expert / fluent-parallel , but you said that you are using SSH.
SSH needs password.
if you try a command on a node, what is the result?
> ssh ip-address ls
Hi, mAx, There is no problem. the output is similar to the following
bushivan@scw-097:~>ssh 220.127.116.11 ls
Is there an administrator under your Cluster or did you set it yourself?
The cluster is operated by the HPC center. I just submit my job there. I actually talked to people of HPC, and they said something like the MPI used by the clusters is not compatible with Fluent, but I am not sure if they are right. So I just post my problem here and see if anyone has encountered the same problem.
Then they are the right people to solve your problem.
But it may be sad not being able to use fluent's parallel enhancement with this cluster
i know this is late, but you have to give the '-ssh' in the fluent command line in the submission file. that forces fluent to use ssh rather than rsh which it always goes to by default
|All times are GMT -4. The time now is 12:48.|