CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   OpenFOAM Running, Solving & CFD (https://www.cfd-online.com/Forums/openfoam-solving/)
-   -   Weird performance problem between hosts (https://www.cfd-online.com/Forums/openfoam-solving/244502-weird-performance-problem-between-hosts.html)

jmellipse August 11, 2022 10:47

Weird performance problem between hosts
 
Hi,

First of all, I was tasked to debug this problem as a Linux administrator. I have very little experience in cfd/openfoam/openmpi/hpc

the problem:

I have 5 nodes (servers), all run the same OpenFOAM test locally (standalone). All nodes are 100% identical. 1 node is somehow a supernode and can run tests much faster (37 seconds vs 205 seconds)

the specification:

Ubuntu 20.04
OpenFOAM 8 8-1c9b5879390b from: http://dl.openfoam.org/ubuntu focal main
mpirun (Open MPI) 4.0.3
32GB memory
2x Intel(R) Xeon(R) CPU E5-2678 v3 @ 2.50GHz

my research:

Using stress-ng with all tests show exactly the same numbers on all nodes.
(stress-ng --cpu 8 --cpu-method all --metrics-brief --perf -t 100)
That is why I suspect mpi

When I consider 2 nodes, normalnode (5 of them) and supernode. I do the following:

All tests have an endTime of 0.1

Code:

decomposePar -allRegions > log.decomposePar
mpirun -n 24 chtMultiRegionFoam -parallel > log.chtMultiRegionFoam
cat log.chtMultiRegionFoam|grep "ExecutionTime"|tail -n 1

The more cpu cores I use, the bigger the difference in time.

When I diff both log files between hosts, the only difference is the execution time. All other parameters are identical.

When starting the run, normal node outputs:

Code:

No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:          normalnode
  Local device:        mlx4_0
  Local port:          1
  CPCs attempted:      udcm
--------------------------------------------------------------------------
23 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
ExecutionTime = 205.16 s  ClockTime = 208 s

On supernode the output is:

Code:

No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:          supernode
  Local device:        mlx4_0
  Local port:          1
  CPCs attempted:      udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: rich-cicada
  Location: mtl_ofi_component.c:629
  Error: No such file or directory (2)
--------------------------------------------------------------------------
23 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
23 more processes have sent help message help-mtl-ofi.txt / OFI call fail
ExecutionTime = 37.04 s  ClockTime = 37 s

Supernode has an error “Open MPI failed an OFI Libfabric library call (fi_endpoint)”
Supernode outputs an extra line “23 more processes have sent help message help-mtl-ofi.txt / OFI call fail”

questions:

mlx4_0 is the Mellanox nic (no infiniband) on these node, as i understand it mpi should use a direct memory system when using localhost. Can firmware version of the nic cause this?
Is the Open MPI OFI error related? Even tough it is an error, it is still much faster.
What could I further investigate?

Thanks

openFo July 7, 2023 04:06

Hello,

have you found an answer to your question? I am facing the same warnings and am now wondering if this results in performance issues.
Thank you in advance!

EDIT:

I just fixed my issue by switching from SYSTEMOPENMPI to OPENMPI in the etc/bashrc.


All times are GMT -4. The time now is 22:50.