CFX parallel multi-node jobs fail w/ SLURM on Ubuntu 10.04
I help maintain a cluster which we've just installed ANSYS 14.0 on, for one of our users. We're running Ubuntu 10.04 LTS with SLURM for job control.
The installation had a couple of 'unexpected operators' errors. Focusing on CFX first, despite the install errors we've successfully run interactive pre, solver and post, as well as single node solver parallel jobs via the job scheduler SLURM, so we're mostly there.
The problem is trying to run parallel jobs on multiple nodes via SLURM. When we submit a multiple node job to SLURM, the job fails, complaining that it cannot connect via RSH.
This isn't surprising since we don't use RSH, it's too insecure, so the question is: Does anyone know how best to setup/allow CFX parallel to play nice with SSH/SLURM?
As a side note, CFX seems to be scaling terribly when running in parallel, in the context of CPU utilization at least. As an example, our nodes each have 4 CPUs, 12 cores each: when running CFX with 4 processes on the node, each process is only utilizing 25% of the core it's running on. At 8 processes, it's only 8% per core, at 48 processes, it's reported down at 1 or 2% per process! This is with no other processes competing for CPU cycles. Can anyone comment on their experience with how well the CFX parallel solver scales? Perhaps it's a configuration issue on our side? If this is the best CFX can do, then there's no point in worrying about trying to get multi-node parallel computing to work...
|All times are GMT -4. The time now is 22:19.|