StarCCMS+ on AWS Parallel Cluster not distributing workload across multiple nodes

dwagoner · October 27, 2020, 23:48

Summary
1. StarCCMS+ jobs submitted will not run on more than one node in a cluster
2. When jobs run on the first compute node (“local node”) they are significantly CPU-throttled. When the same jobs run on other nodes, they consume all available CPU as intended.

Detail
StarCCM+ runs on AWS ParallelCluster in a batch mode. This implementation uses Sun Grid Engine (SGE) on Linux and uses slurm. The implementation is documented here:

https://aws.amazon.com/blogs/compute/running-simcenter-star-ccm-on-aws/

The document is missing some critical details that may take some investigation to determine. These somewhat important undocumented details include:

FSx Lustre file system DNS references are not in the public DNS as is the case for other EC2 resources like ALBs, NLBs, S3 buckets, EC2 instances, etc. Instead, resolution can be done only by the AWS Provided DNS service inside the VPC where the FSx Lustre file system is created. The pcluster utility uses a CloudFormation template and may fail if the DNS resolution cannot be done for the FSx Lustre file system to allow it to be mounted on the master and all compute nodes. This resolution is done at the IP address of x.x.x.2/32, where x.x.x is the lowest set of octets in the VPC CIDR block. The critical piece is that this IP address should be in the DHCP default options for the VPC in which the cluster is being created. While the FSx Lustre DNS reference can be done manually, specifying x.x.x.2 as the DNS server, but this manual step cannot be done in the middle of the Cloudformation template being used to build the cluster.
The master and compute nodes may not consistently mount the FSx Lustre partition, even if the IP address translation is done correctly. The lustre modules do not always get loaded, and this approach seems to load the modules so that the FSx partition can be mounted:

apt update
apt-get install -y lustre-client-modules-$(uname -r)
The master should have local entries for the compute nodes in /etc/hosts. Login to each of the compute nodes and copy the relevant line from /etc/fstab and copy it into the master. You may need to update the security group of the compute node to allow access to port 22 from the rest of your environment. You may need to add ssh keys directly onto the compute nodes as well for the default non-root user (e.g. “ubuntu”, “centos”, or “ec2-user”).
The first compute node attempts to launch “remote” jobs on other compute nodes, so the entries that were just put into the /etc/fstab of the master should be copied to the compute nodes for consistency. Out of sheer paranoia, do an “ssh compute-node-name uptime” from the first compute node to the others to ensure that the entries are correct and that the host key has been accepted and that the ssh keys are present. This ssh attempt should be done as the default non-root user, not as root.
Update the kernel parameter for ptrace; the default is “1” and it needs to be “0”. This is set in /etc/sysctl.d/10-ptrace.conf and rebooting; however, doing an init 6 on individual nodes causes pcluster to replace them, not just reboot them. Use pcluster stop cluster-name and start it again, or set the value dynamically with:

sysctl -w kernel.yama.ptrace_scope=0

Without this, there are likely to be various complaints about *btl_vader_single_copy_mechanism emitted. (Thanks to Dennis.Kingsley@us.fincantieri.com for this valuable tidbit).
DNS may not consistently be resolved for the FSx Lustre partition after the first system boot. It may be prudent to replace the DNS name in /etc/fstab with the IP address translation after the cluster has been created.
When using a machine file (specified after the “—machinefile” option), do not use the FQDN - what is there must match the result of “/bin/hostname” when run on the compute nodes.
Be sure to include the parameter “-mpi openmpi”. Without this, you are likely obtain error messages like those below and the suggestion of updating /etc/security/limits.conf has nothing to do with the real problem:

starccm+: compute-st-c5n18xlarge-1:[pid-18967] Rank 0:38: MPI_Init: ibv_create_qp(left ring) failed
starccm+: compute-st-c5n18xlarge-1:[pid-18967] Rank 0:38: MPI_Init: probably you need to increase pinnable memory in /etc/security/limits.conf
starccm+: compute-st-c5n18xlarge-1:[pid-18967] Rank 0:38: MPI_Init: ibv_ring_createqp() failed
starccm+: compute-st-c5n18xlarge-1:[pid-18967] Rank 0:38: MPI_Init: Can't initialize RDMA device
starccm+: compute-st-c5n18xlarge-1:[pid-18967] Rank 0:38: MPI_Init: Internal Error: Cannot initialize RDMA protocol

Disclaimer on my background: I am not a user of StarCCM, just an IT guy trying to help our mechanical engineers by setting this up. I have a large sample job (a .sim file) that I use for testing.

The above constitute the useful items I have been able to collect to date. Where I need some guidance is on:
1. getting workloads to actually run on multiple nodes instead of only 1
2. getting the first compute node to not throttle its CPU usage.

I created a cluster using three large compute nodes (48 vCPUs) and a master.

Test cases:

I submitted a job with no —machine file parameter provided to let the workload default. I set -np (number of slots) to 48.
All processes were placed on the first compute node and CPU was throttled to nearly zero. (I saw 2-3% when doing this test with slightly smaller compute nodes).

Starting local server: /fsx/Siemens/15.04.010/STAR-CCM+15.04.010/star/bin/starccm+ -power -podkey XXX -licpath 1999@flex.cd-adapco.com -np 48 -mpi openmpi -machinefile /fsx/machinefile -server /fsx/X.sim
I changed the order of the compute nodes in the machine file, placing the second compute node at the top. Processes were started on the second compute node and were not throttled - the whole system ran at 100% CPU (desirable in this case). The first compute node executed the following statement to start the run on the second compute node:

Starting remote server: ssh compute-st-c5d24xlarge-2 echo "Remote PID : $$"; exec /fsx/Siemens/15.04.010/STAR-CCM+15.04.010/star/bin/starccm+ -power -podkey XX -licpath 1999@flex.cd-adapco.com -np 48 -mpi openmpi -machinefile /fsx/machinefile -server -rsh ssh /fsx/XXX.sim

In both cases, I had hopes of distributing the workload across all available compute nodes and running on CPUs in an unconstrained fashion.

Any hints from folks who have traveled this road?

dwagoner · October 29, 2020, 00:19

It seems that with the help of Dennis.Kingsley@us.fincantieri.com I was able to get StarCCM+ workloads split across multiple nodes and have them run with unconstrained CPU utilization.

The one additional change to those enumerated earlier in the thread is that the machine file used MUST contain the master (head) node and this entry must appear first in the list of machines in that file.

Next challenge to pursue is the optimal number of nodes in a cluster. After getting a healthy workload running, I noted that the vast majority of the CPU time is in system time rather than user time. A bit of stracing showed that a great deal of polling to coordinate interprocess communication/activity was being done. There is also a significant amount of network traffic between the nodes and this requires CPU to drive. Adding nodes may increase overhead and actually decrease throughput - a topic for subsequent testing.

cwl · October 30, 2020, 16:32

I'd like to thank you for sharing your experience - these notes might save loads of time for someone in the future.

philip_m_jones · May 25, 2021, 02:39

A lot of the above makes little sense in the context of a posting in cfd-online.

Obvious disclaimer: I work for Siemens who write STAR-CCM+ and I build and run clusters on AWS regularly and interact with the Amazon team.

When you are running CFD on a cluster you need to have a correctly configured cluster and then you need to operate the CFD code in a manner that reflects the cluster you are using.

If you want to run on AWS then ParallelCluster is a good way to get a cluster set up and running in very little time. If you have problems with running ParallelCluster then there are forums that are applicable to that.

Once you have a working cluster then you have material that is related to CFD.

STAR_CCM+ is batch system aware. I see some confusion over batch systems:

"This implementation uses Sun Grid Engine (SGE) on Linux and uses slurm."

SGE and Slurm are both batch systems and are mutually exclusive, so you can have one or the other but not both.

Once you have one batch system (and if you are coming to this fresh then the latest versions of ParallelCluster are dropping SGE and adopting Slurm as default) then you simply run STAR-CCM+ with the appropriate flag, so either

-bs sge
-bs slurm

These flags mean that STAR-CCM+ picks up the resource it has allocated to it via the batch system and starts the relevant processes

Starting STAR-CCM+ parallel server
MPI Distribution : IBM Platform MPI-09.01.04.03
Host 0 -- ip-10-192-12-64.ec2.internal -- Ranks 0-35
Host 1 -- ip-10-192-12-123.ec2.internal -- Ranks 36-71
Host 2 -- ip-10-192-12-189.ec2.internal -- Ranks 72-107
Host 3 -- ip-10-192-12-161.ec2.internal -- Ranks 108-143
Process rank 0 ip-10-192-12-64.ec2.internal 46154
Total number of processes : 144

October 27, 2020, 23:48	StarCCMS+ on AWS Parallel Cluster not distributing workload across multiple nodes	#1
dwagoner New Member Dave Wagoner Join Date: Oct 2020 Posts: 2 Rep Power: 0	Summary 1. StarCCMS+ jobs submitted will not run on more than one node in a cluster 2. When jobs run on the first compute node (“local node”) they are significantly CPU-throttled. When the same jobs run on other nodes, they consume all available CPU as intended. Detail StarCCM+ runs on AWS ParallelCluster in a batch mode. This implementation uses Sun Grid Engine (SGE) on Linux and uses slurm. The implementation is documented here: https://aws.amazon.com/blogs/compute/running-simcenter-star-ccm-on-aws/ The document is missing some critical details that may take some investigation to determine. These somewhat important undocumented details include: FSx Lustre file system DNS references are not in the public DNS as is the case for other EC2 resources like ALBs, NLBs, S3 buckets, EC2 instances, etc. Instead, resolution can be done only by the AWS Provided DNS service inside the VPC where the FSx Lustre file system is created. The pcluster utility uses a CloudFormation template and may fail if the DNS resolution cannot be done for the FSx Lustre file system to allow it to be mounted on the master and all compute nodes. This resolution is done at the IP address of x.x.x.2/32, where x.x.x is the lowest set of octets in the VPC CIDR block. The critical piece is that this IP address should be in the DHCP default options for the VPC in which the cluster is being created. While the FSx Lustre DNS reference can be done manually, specifying x.x.x.2 as the DNS server, but this manual step cannot be done in the middle of the Cloudformation template being used to build the cluster. The master and compute nodes may not consistently mount the FSx Lustre partition, even if the IP address translation is done correctly. The lustre modules do not always get loaded, and this approach seems to load the modules so that the FSx partition can be mounted: apt update apt-get install -y lustre-client-modules-$(uname -r) The master should have local entries for the compute nodes in /etc/hosts. Login to each of the compute nodes and copy the relevant line from /etc/fstab and copy it into the master. You may need to update the security group of the compute node to allow access to port 22 from the rest of your environment. You may need to add ssh keys directly onto the compute nodes as well for the default non-root user (e.g. “ubuntu”, “centos”, or “ec2-user”). The first compute node attempts to launch “remote” jobs on other compute nodes, so the entries that were just put into the /etc/fstab of the master should be copied to the compute nodes for consistency. Out of sheer paranoia, do an “ssh compute-node-name uptime” from the first compute node to the others to ensure that the entries are correct and that the host key has been accepted and that the ssh keys are present. This ssh attempt should be done as the default non-root user, not as root. Update the kernel parameter for ptrace; the default is “1” and it needs to be “0”. This is set in /etc/sysctl.d/10-ptrace.conf and rebooting; however, doing an init 6 on individual nodes causes pcluster to replace them, not just reboot them. Use pcluster stop cluster-name and start it again, or set the value dynamically with: sysctl -w kernel.yama.ptrace_scope=0 Without this, there are likely to be various complaints about btl_vader_single_copy_mechanism emitted. (Thanks to Dennis.Kingsley@us.fincantieri.com for this valuable tidbit). DNS may not consistently be resolved for the FSx Lustre partition after the first system boot. It may be prudent to replace the DNS name in /etc/fstab with the IP address translation after the cluster has been created. When using a machine file (specified after the “—machinefile” option), do not use the FQDN - what is there must match the result of “/bin/hostname” when run on the compute nodes. Be sure to include the parameter “-mpi openmpi”. Without this, you are likely obtain error messages like those below and the suggestion of updating /etc/security/limits.conf has nothing to do with the real problem: starccm+: compute-st-c5n18xlarge-1:[pid-18967] Rank 0:38: MPI_Init: ibv_create_qp(left ring) failed starccm+: compute-st-c5n18xlarge-1:[pid-18967] Rank 0:38: MPI_Init: probably you need to increase pinnable memory in /etc/security/limits.conf starccm+: compute-st-c5n18xlarge-1:[pid-18967] Rank 0:38: MPI_Init: ibv_ring_createqp() failed starccm+: compute-st-c5n18xlarge-1:[pid-18967] Rank 0:38: MPI_Init: Can't initialize RDMA device starccm+: compute-st-c5n18xlarge-1:[pid-18967] Rank 0:38: MPI_Init: Internal Error: Cannot initialize RDMA protocol Disclaimer on my background: I am not a user of StarCCM, just an IT guy trying to help our mechanical engineers by setting this up. I have a large sample job (a .sim file) that I use for testing.* The above constitute the useful items I have been able to collect to date. Where I need some guidance is on: 1. getting workloads to actually run on multiple nodes instead of only 1 2. getting the first compute node to not throttle its CPU usage. I created a cluster using three large compute nodes (48 vCPUs) and a master. Test cases: I submitted a job with no —machine file parameter provided to let the workload default. I set -np (number of slots) to 48. All processes were placed on the first compute node and CPU was throttled to nearly zero. (I saw 2-3% when doing this test with slightly smaller compute nodes). Starting local server: /fsx/Siemens/15.04.010/STAR-CCM+15.04.010/star/bin/starccm+ -power -podkey XXX -licpath 1999@flex.cd-adapco.com -np 48 -mpi openmpi -machinefile /fsx/machinefile -server /fsx/X.sim I changed the order of the compute nodes in the machine file, placing the second compute node at the top. Processes were started on the second compute node and were not throttled - the whole system ran at 100% CPU (desirable in this case). The first compute node executed the following statement to start the run on the second compute node: Starting remote server: ssh compute-st-c5d24xlarge-2 echo "Remote PID : $$"; exec /fsx/Siemens/15.04.010/STAR-CCM+15.04.010/star/bin/starccm+ -power -podkey XX -licpath 1999@flex.cd-adapco.com -np 48 -mpi openmpi -machinefile /fsx/machinefile -server -rsh ssh /fsx/XXX.sim In both cases, I had hopes of distributing the workload across all available compute nodes and running on CPUs in an unconstrained fashion. Any hints from folks who have traveled this road? bluebase and Nikpap like this.

October 29, 2020, 00:19	Resolved	#2
dwagoner New Member Dave Wagoner Join Date: Oct 2020 Posts: 2 Rep Power: 0	It seems that with the help of Dennis.Kingsley@us.fincantieri.com I was able to get StarCCM+ workloads split across multiple nodes and have them run with unconstrained CPU utilization. The one additional change to those enumerated earlier in the thread is that the machine file used MUST contain the master (head) node and this entry must appear first in the list of machines in that file. Next challenge to pursue is the optimal number of nodes in a cluster. After getting a healthy workload running, I noted that the vast majority of the CPU time is in system time rather than user time. A bit of stracing showed that a great deal of polling to coordinate interprocess communication/activity was being done. There is also a significant amount of network traffic between the nodes and this requires CPU to drive. Adding nodes may increase overhead and actually decrease throughput - a topic for subsequent testing. Nikpap likes this.

May 25, 2021, 02:39		#4
philip_m_jones New Member Philip Morris Jones Join Date: Jul 2014 Posts: 1 Rep Power: 0	A lot of the above makes little sense in the context of a posting in cfd-online. Obvious disclaimer: I work for Siemens who write STAR-CCM+ and I build and run clusters on AWS regularly and interact with the Amazon team. When you are running CFD on a cluster you need to have a correctly configured cluster and then you need to operate the CFD code in a manner that reflects the cluster you are using. If you want to run on AWS then ParallelCluster is a good way to get a cluster set up and running in very little time. If you have problems with running ParallelCluster then there are forums that are applicable to that. Once you have a working cluster then you have material that is related to CFD. STAR_CCM+ is batch system aware. I see some confusion over batch systems: "This implementation uses Sun Grid Engine (SGE) on Linux and uses slurm." SGE and Slurm are both batch systems and are mutually exclusive, so you can have one or the other but not both. Once you have one batch system (and if you are coming to this fresh then the latest versions of ParallelCluster are dropping SGE and adopting Slurm as default) then you simply run STAR-CCM+ with the appropriate flag, so either -bs sge -bs slurm These flags mean that STAR-CCM+ picks up the resource it has allocated to it via the batch system and starts the relevant processes Starting STAR-CCM+ parallel server MPI Distribution : IBM Platform MPI-09.01.04.03 Host 0 -- ip-10-192-12-64.ec2.internal -- Ranks 0-35 Host 1 -- ip-10-192-12-123.ec2.internal -- Ranks 36-71 Host 2 -- ip-10-192-12-189.ec2.internal -- Ranks 72-107 Host 3 -- ip-10-192-12-161.ec2.internal -- Ranks 108-143 Process rank 0 ip-10-192-12-64.ec2.internal 46154 Total number of processes : 144 cwl and arvindpj like this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
how to set periodic boundary conditions	Ganesh	FLUENT	15	November 18, 2020 06:09
AWS EC2 Cluster Running in Parallel Issues with v1612+	bassaad17	OpenFOAM Running, Solving & CFD	16	April 15, 2020 17:13
SU2 code scaling poorly on multiple nodes	Samirs	SU2	1	August 25, 2018 19:15
Script to Run Parallel Jobs in Rocks Cluster	asaha	OpenFOAM Running, Solving & CFD	12	July 4, 2012 22:51
Help: how to realize UDF on parallel cluster?	Haoyin	FLUENT	1	August 6, 2007 13:53

October 30, 2020, 16:32		#3
cwl Senior Member Chaotic Water Join Date: Jul 2012 Location: Elgrin Fau Posts: 435 Rep Power: 17	I'd like to thank you for sharing your experience - these notes might save loads of time for someone in the future.