CFD Online Discussion Forums

CFD Online Discussion Forums (http://www.cfd-online.com/Forums/)
-   CFX (http://www.cfd-online.com/Forums/cfx/)
-   -   Remote cluster parallel solve without master, Ansys CFX 14.5 (http://www.cfd-online.com/Forums/cfx/117324-remote-cluster-parallel-solve-without-master-ansys-cfx-14-5-a.html)

aaguirre May 6, 2013 16:19

Remote cluster parallel solve without master, Ansys CFX 14.5
 
Hi.

I'm currently trying to configure Ansys CFX 14.5 to run on a Linux Cluster (Rocks 6.1). I've already followed all the installation process, including environment variables. The communication via ssh is working, and I'm using Platform MPI. I configured the hostinfo.ccl file. I'm even able to run in distributed parallel mode using this sintax:

cfx5solve -def input.def -start-method "Platform MPI Distributed Parallel" -par-dist "master,node01*2,node02*2"

The problem is that I'm not allowed to run using master node because the cluster belongs to the university I work for.

I've tried to cheat cfx using, for example "master*0" or removing master, but the program fails with the following message:

Unable to find the master host cluster.domain.edu in the host list: at least one partition must be assigned to the master host.

I've also tried launching the run from node01, but I got something like this:

MPI Application rank 0 exited before MPI_Finalize() with status 2
An error has occurred in cfx5remote on compute-2-0.local:

/share/apps/ansys_inc/v145/CFX/bin/linux-amd64/solver-pcmpi.exe was
interrupted by signal TERM (15)

An error has occurred in cfx5remote on compute-2-0.local:

/share/apps/ansys_inc/v145/CFX/bin/linux-amd64/solver-pcmpi.exe was
interrupted by signal TERM (15)

An error has occurred in cfx5remote on compute-2-0.local:

/share/apps/ansys_inc/v145/CFX/bin/linux-amd64/solver-pcmpi.exe was
interrupted by signal TERM (15)

An error has occurred in cfx5remote on compute-2-1.local:

/share/apps/ansys_inc/v145/CFX/bin/linux-amd64/solver-pcmpi.exe was
interrupted by signal TERM (15)

An error has occurred in cfx5remote on compute-2-1.local:

/share/apps/ansys_inc/v145/CFX/bin/linux-amd64/solver-pcmpi.exe was
interrupted by signal TERM (15)

An error has occurred in cfx5solve:

The ANSYS CFX solver could not be started, or exited with return code 255.
No results file has been created.


Running at least in one processor of the master is our last option. Users are allowed to log in to the master node and launch programs from it, but are not allowed to use master processors.

I also configured APDL and I'm able to do something similar to the above, using this sintax:

ansys145 -dis -b -machines compute-2-0:2:compute-2-1:2 < input.dat > output.out

Is it possible to something similar with CFX?

Regards,


A. Aguirre.

flomer August 27, 2013 06:34

Hello!

I just installed Rocks 6.1 on a small cluster to run Ansys CFX and I am having the same problem; how to set up parallel runs without the head node...

Did you ever find a solution to this?

Best regards,

John

brunoc August 28, 2013 12:55

CFX requires that the computer you're logged at be a part of the simulation. There is a way to do what you want called indirect start, but it involves editing some of the files from the CFX setup ('CFX/etc/start-methods.ccl') plus writing some scripts. It can be done, but its a hassle, so skip it.

Instead, just send the solver command though SSH to one of the nodes that belong to the simulation. You'll need an additional option (-chdir) directing CFX to run the solver on a specified path, though, or else it'll run on your home directory. Your command line will be something like this:

Code:

ssh node01 cfx5solve -def input.def -chdir /path/to/deffile -start-method \"Platform MPI Distributed Parallel\" -par-dist \"node01*2,node02*2\" -batch <other_options>
Notice the '\' in front of the quotation marks.

That works fine (I use it here), as long as you've got SSH configured not to ask for passwords (which you probably already do).

Cheers

flomer August 29, 2013 01:34

Thanks, that worked!
 
Quote:

Originally Posted by brunoc (Post 448573)
CFX requires that the computer you're logged at be a part of the simulation. There is a way to do what you want called indirect start, but it involves editing some of the files from the CFX setup ('CFX/etc/start-methods.ccl') plus writing some scripts. It can be done, but its a hassle, so skip it.

Instead, just send the solver command though SSH to one of the nodes that belong to the simulation. You'll need an additional option (-chdir) directing CFX to run the solver on a specified path, though, or else it'll run on your home directory. Your command line will be something like this:

Code:

ssh node01 cfx5solve -def input.def -chdir /path/to/deffile -start-method \"Platform MPI Distributed Parallel\" -par-dist \"node01*2,node02*2\" -batch <other_options>
Notice the '\' in front of the quotation marks.

That works fine (I use it here), as long as you've got SSH configured not to ask for passwords (which you probably already do).

Cheers


Hello, Bruno!

The method you suggested works very well. Thank you!

On our setup the head node is the only node that can see the license server, so using the above method manages to get licenses and then run the parallel computation on the nodes only. This is exactly what I was looking for. Great!

We want to avoid using the head node in the calculations because this node is presumably slower than the compute nodes (head has one E5-2620 @ 2.0 GHz versus dual E5-2643 @ 3.3 GHz on the nodes). We have not been able to do much testing yet, but we suspect that using cores from the head node in the calculations might slow things down. We have set the Relative Speed settings in hostinfo.ccl, but still think it is wise to avoid using the slower cores.

Do you know if this actually makes a difference?

Anyway, not using any cores on the head node gives it more resources to keep Ganglia running smoothly and keeping the disk system happy :).

Speaking of Ganglia, do you know a way to get it to keep the head node out of the reporting and resource use displays? The head node is primarily there to help us with interfacing and utilizing the cluster, so I think it is a bit odd that it as default includes its own CPU cores in the statistics and reporting. Not a huge problem, but...

Now I just need to get the InfiniBand network up to speed...

Best regards,

John

Felipe Mendes January 27, 2015 10:15

Dear friends,
I have a similar problem.
We've got a cluster with 3 nodes: 1 master node with 12 procs and 2 computing nodes with 16 procs.
I've tried to launch CFX only on the computing nodes using the previous sintax posted on the forum, but it doesn't find the license.
If I put at least one proc on the master node it will run pretty well.
The license server is the master node.
On the ".out" the computing nodes seems to find the license server but search for acfd_cfx license for the solver (which does not find) instead of the correct name "acfx_cfx_solver".
Does anyone know what may be the problem?
Thanks!


All times are GMT -4. The time now is 03:21.