Weird performance problem between hosts
Hi,
First of all, I was tasked to debug this problem as a Linux administrator. I have very little experience in cfd/openfoam/openmpi/hpc the problem: I have 5 nodes (servers), all run the same OpenFOAM test locally (standalone). All nodes are 100% identical. 1 node is somehow a supernode and can run tests much faster (37 seconds vs 205 seconds) the specification: Ubuntu 20.04 OpenFOAM 8 8-1c9b5879390b from: http://dl.openfoam.org/ubuntu focal main mpirun (Open MPI) 4.0.3 32GB memory 2x Intel(R) Xeon(R) CPU E5-2678 v3 @ 2.50GHz my research: Using stress-ng with all tests show exactly the same numbers on all nodes. (stress-ng --cpu 8 --cpu-method all --metrics-brief --perf -t 100) That is why I suspect mpi When I consider 2 nodes, normalnode (5 of them) and supernode. I do the following: All tests have an endTime of 0.1 Code:
decomposePar -allRegions > log.decomposePar When I diff both log files between hosts, the only difference is the execution time. All other parameters are identical. When starting the run, normal node outputs: Code:
No OpenFabrics connection schemes reported that they were able to be Code:
No OpenFabrics connection schemes reported that they were able to be Supernode outputs an extra line “23 more processes have sent help message help-mtl-ofi.txt / OFI call fail” questions: mlx4_0 is the Mellanox nic (no infiniband) on these node, as i understand it mpi should use a direct memory system when using localhost. Can firmware version of the nic cause this? Is the Open MPI OFI error related? Even tough it is an error, it is still much faster. What could I further investigate? Thanks |
Hello,
have you found an answer to your question? I am facing the same warnings and am now wondering if this results in performance issues. Thank you in advance! EDIT: I just fixed my issue by switching from SYSTEMOPENMPI to OPENMPI in the etc/bashrc. |
All times are GMT -4. The time now is 22:50. |