CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > Siemens > STAR-CCM+

Floating Point overflow and MPI tuning parms

Register Blogs Community New Posts Updated Threads Search

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   September 1, 2019, 17:41
Default Floating Point overflow and MPI tuning parms
  #1
New Member
 
Join Date: Sep 2019
Posts: 1
Rep Power: 0
lstonebr is on a distinguished road
I am working with star ccm+ 2019.1.1 Build 14.02.012
CentOS 7.6 kernel 3.10.0-957.21.3.el7.x86_64
Intel MPI Version 2018 Update 5 Build 20190404 (this is version shipped with star ccm+)
Cisco UCS cluster using USNIC fabric over 10gbe
Intel(R) Xeon(R) CPU E5-2698
7 nodes
280 cores

enic RPM version kmod-enic-3.2.210.22-738.18.centos7u7.x86_64 installed
usnic RPM kmod-usnic_verbs-3.2.158.15-738.18.rhel7u6.x86_64 installed
enic modinfo version: 3.2.210.22
enic loaded module version: 3.2.210.22
usnic_verbs modinfo version: 3.2.158.15
usnic_verbs loaded module version: 3.2.158.15
libdaplusnic RPM version 2.0.39cisco3.2.112.8 installed
libfabric RPM version 1.6.0cisco3.2.112.9.rhel7u6 installed


On runs less than 5 hours, everything works flawlessly and is quite fast.

However when running with 280 cores at or around 5 hours into a job, the longer jobs die with the floating point exception.
The same job completes fine with 140 cores.
Also I am using PBS Pro with 99 hour wall time

------------------
Turbulent viscosity limited on 56 cells in Region
A floating point exception has occurred: floating point exception [Overflow]. The specific cause cannot be identified. Please refer to the troubleshooting section of the User's Guide.
Context: star.coupledflow.CoupledImplicitSolver
Command: Automation.Run
error: Server Error
------------------

I have been doing some reading and some say that using other MPI are more stable with Star CCM.

I have not ruled out that I am missing some parameters or tuning with Intel MPI as this is a new cluster.

I am also trying to make Open MPI work. I have openmpi compiled and it runs, however only with very small number of CPU. Anything over about 2 cores per node it hangs indefinately.

I have compiled Open MPI 3.1.3 from https://www.open-mpi.org/ because this is what Star CCM version I am running supports. I am telling star to use the open mpi that I installed so it can support the Cisco USNIC fabric, which I can verify using Cisco native tools. Note that star also ships with openmpi however

I am thinking that I need to tune OpenMPI, which was also requried with Intel MPI.

With Intel MPI, jobs with more than about 100 cores would hang until I added these parameters:

reference: https://software.intel.com/en-us/for...y/topic/542591
reference: https://software.intel.com/en-us/art...ced-techniques

export I_MPI_DAPL_UD_SEND_BUFFER_NUM=8208
export I_MPI_DAPL_UD_RECV_BUFFER_NUM=8208
export I_MPI_DAPL_UD_ACK_SEND_POOL_SIZE=8704
export I_MPI_DAPL_UD_ACK_RECV_POOL_SIZE=8704
export I_MPI_DAPL_UD_RNDV_EP_NUM=2
export I_MPI_DAPL_UD_REQ_EVD_SIZE=2000
export I_MPI_DAPL_UD_MAX_MSG_SIZE=4096
export I_MPI_DAPL_UD_DIRECT_COPY_THRESHOLD=2147483647

After adding these parms I can scale to 280 cores and it runs very fast, up until the point where it gets the floating point exception about 5 hours into the job.

I am struggling trying to find equivelant turning parms for Open MPI.

I have listed all the MCA available with Open using MCA, and have tried setting these parms with no success.

btl_max_send_size = 4096
btl_usnic_eager_limit = 2147483647
btl_usnic_rndv_eager_limit = 2147483647
btl_usnic_sd_num = 8208
btl_usnic_rd_num = 8208
btl_usnic_prio_sd_num = 8704
btl_usnic_prio_rd_num = 8704
btl_usnic_pack_lazy_threshold = -1


Does anyone have any advice or ideas for:

1.) The floating point overflow issue
and
2.) Know of equivelant tuning parms for Open MPI

Many thanks in advance

Last edited by lstonebr; September 5, 2019 at 14:08.
lstonebr is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On



All times are GMT -4. The time now is 09:22.