CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM > OpenFOAM Running, Solving & CFD

Weird performance problem between hosts

Register Blogs Members List Search Today's Posts Mark Forums Read

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   August 11, 2022, 11:47
Default Weird performance problem between hosts
  #1
New Member
 
Join Date: Aug 2022
Posts: 1
Rep Power: 0
jmellipse is on a distinguished road
Hi,

First of all, I was tasked to debug this problem as a Linux administrator. I have very little experience in cfd/openfoam/openmpi/hpc

the problem:

I have 5 nodes (servers), all run the same OpenFOAM test locally (standalone). All nodes are 100% identical. 1 node is somehow a supernode and can run tests much faster (37 seconds vs 205 seconds)

the specification:

Ubuntu 20.04
OpenFOAM 8 8-1c9b5879390b from: http://dl.openfoam.org/ubuntu focal main
mpirun (Open MPI) 4.0.3
32GB memory
2x Intel(R) Xeon(R) CPU E5-2678 v3 @ 2.50GHz

my research:

Using stress-ng with all tests show exactly the same numbers on all nodes.
(stress-ng --cpu 8 --cpu-method all --metrics-brief --perf -t 100)
That is why I suspect mpi

When I consider 2 nodes, normalnode (5 of them) and supernode. I do the following:

All tests have an endTime of 0.1

Code:
decomposePar -allRegions > log.decomposePar
mpirun -n 24 chtMultiRegionFoam -parallel > log.chtMultiRegionFoam
cat log.chtMultiRegionFoam|grep "ExecutionTime"|tail -n 1
The more cpu cores I use, the bigger the difference in time.

When I diff both log files between hosts, the only difference is the execution time. All other parameters are identical.

When starting the run, normal node outputs:

Code:
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           normalnode
  Local device:         mlx4_0
  Local port:           1
  CPCs attempted:       udcm
--------------------------------------------------------------------------
23 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
ExecutionTime = 205.16 s  ClockTime = 208 s
On supernode the output is:

Code:
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           supernode
  Local device:         mlx4_0
  Local port:           1
  CPCs attempted:       udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: rich-cicada
  Location: mtl_ofi_component.c:629
  Error: No such file or directory (2)
--------------------------------------------------------------------------
23 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
23 more processes have sent help message help-mtl-ofi.txt / OFI call fail
ExecutionTime = 37.04 s  ClockTime = 37 s
Supernode has an error “Open MPI failed an OFI Libfabric library call (fi_endpoint)”
Supernode outputs an extra line “23 more processes have sent help message help-mtl-ofi.txt / OFI call fail”

questions:

mlx4_0 is the Mellanox nic (no infiniband) on these node, as i understand it mpi should use a direct memory system when using localhost. Can firmware version of the nic cause this?
Is the Open MPI OFI error related? Even tough it is an error, it is still much faster.
What could I further investigate?

Thanks
jmellipse is offline   Reply With Quote

Reply

Tags
openmpi

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
If memory bound : how to improve performance? aerosayan Main CFD Forum 13 July 7, 2021 06:44
Engine performance calculator weird error MMatt CONVERGE 1 February 4, 2020 09:55
Weird problem in diesel spray in OpenFoam-1.6-ext Slanth OpenFOAM 3 July 10, 2016 19:42
Weird reversed flow problem Kimican Main CFD Forum 0 February 26, 2016 19:15
natural convection problem for a CHT problem Se-Hee CFX 2 June 10, 2007 07:29


All times are GMT -4. The time now is 14:58.