CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM > OpenFOAM Running, Solving & CFD

MPI_Send: MPI_ERR_COUNT: invalid count argument

Register Blogs Community New Posts Updated Threads Search

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   December 24, 2023, 02:36
Default MPI_Send: MPI_ERR_COUNT: invalid count argument
  #1
wht
New Member
 
Ilya
Join Date: Jan 2021
Posts: 10
Rep Power: 5
wht is on a distinguished road
Dear Colleagues,

I have encountered a problem with MPI_Send while running a very big case on the cluster:

[lxbk1208:00000] *** An error occurred in MPI_Send
[lxbk1208:00000] *** reported by process [3960864768,1024]
[lxbk1208:00000] *** on communicator MPI_COMM_WORLD
[lxbk1208:00000] *** MPI_ERR_COUNT: invalid count argument
[lxbk1208:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[lxbk1208:00000] ****** and MPI will try to terminate your MPI job as well)
(Full log is in the attachment)

I'm using chtMultiRegionSimpleFoam to solve a heat transfer problem for a multilayer PCB with vias in great detail.

The solver goes through a few regions with no problem, but when it proceeds to a very big region with 141 211 296 cells (I'm using 2048 processors, so it's 68 950 cells per processor - should be fine), it crashes with the error above. Decomposition method is hierarchical:

decomposeParDict:
numberOfSubdomains 2048;
method hierarchical;
coeffs
{
n (32 64 1);
}


The cluster I'm using is called Virgo. It uses Slurm for tasks scheduling. More informations is available at https://hpc.gsi.de/virgo/ I submit the task with the following command:

sbatch --ntasks=2048 --mem-per-cpu=4G --hint=multithread --partition=main --mincpus=32 slurmScripts/chtMultiRegionSimpleFoam.sh &

chtMultiRegionSimpleFoam.sh:
srun chtMultiRegionSimpleFoam -parallel

OpenFOAM is compiled with WM_LABEL_SIZE=64, WM_MPLIB=SYSTEMOPENMPI, and WM_ARCH_OPTION=64. OpenFOAM version is ESI OpenFOAM v2306.

Our support team assumes that this error appears because the solver is calling MPI_Send with a negative count argument. The count argument is a signed int of 32 bit size, so it is likely overflowing in my case.

For solving this problem, I tried to change MPI optimization parameters in controlDict, but didn't achieve any success:

1. Set pbufs.tuning to 1 to activate new NBX algorythm.
2. I varied nbx.min parameter between 1 and 100.
3. Tried setting nbx.tuning to 0 and 1.
4. Setting maxCommsSize to 2147483647, which is 2^31 - 1
5. Tried to find if similar problem was mentioned already on the forum.

What could be the cause and how can this error be fixed?

Thank you for help.

Best regards,
Ilya

---
CBM Department

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstr. 1, 64291 Darmstadt, Germany
Attached Files
File Type: txt log.chtMultiRegionSimpleFoam.txt (16.1 KB, 2 views)

Last edited by wht; January 3, 2024 at 07:22.
wht is offline   Reply With Quote

Old   January 2, 2024, 08:06
Default
  #2
wht
New Member
 
Ilya
Join Date: Jan 2021
Posts: 10
Rep Power: 5
wht is on a distinguished road
Dear Colleagues,


I'm still working on the solution to this problem, but with no success. So far I have tried a few more things:

1. I've built openmpi-4.1.2 from the ThirdParty-v2306 with -m64 flag and run the solver with it on the cluster.

etc/bashrc from OpenFOAM folder:
<...>
export WM_MPLIB=OPENMPI
<...>

All the compilation flags for OpenMPI (ompi_info):
<...>
configure command line: 'CFLAGS=-m64' 'FFLAGS=-m64' 'FCFLAGS=-m64'
'CXXFLAGS=-m64'
'--prefix=/linux64Gcc/openmpi-4.1.2'
'--with-max-info-key=255' '--with-max-info-val=512'
'--with-max-object-name=128'
'--with-max-datarep-string=256'
'--with-wrapper-cflags=-m64'
'--with-wrapper-cxxflags=-m64'
'--with-wrapper-fcflags=-m64'
'--disable-orterun-prefix-by-default' '--with-pmix'
'--with-libevent' '--with-ompi-pmix-rte'
'--with-orte=no' '--disable-oshmem'
'--enable-shared' '--without-verbs' '--with-hwloc'
'--with-ucx=/lustre/cbm/users/elizarov'
'--with-slurm' '--enable-mca-no-build=btl-uct'
'--enable-shared' '--disable-static'
'--enable-mpi-fortran=none' '--with-sge'
<...>

My new wrapper script is:
#!/bin/bash
#SBATCH --job-name=solver
#SBATCH --time 8:00:00
#SBATCH --output Slurm-solver.out
orterun chtMultiRegionSimpleFoam -parallel

2. Tried to switch off multithreading:

sbatch --ntasks=2048 --mem-per-cpu=4G --hint=nomultithread --partition=main --mincpus=32 slurmScripts/chtMultiRegionSimpleFoam.sh &

3. Changed solving algorithm from GAMG to PCG

4. Tried renumbering a problematic region (PCB_Copper) with renumberMesh

Best regards,
Ilya

---
CBM Department

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstr. 1, 64291 Darmstadt, Germany

Last edited by wht; January 3, 2024 at 15:17.
wht is offline   Reply With Quote

Old   January 2, 2024, 11:21
Default
  #3
Senior Member
 
Mark Olesen
Join Date: Mar 2009
Location: https://olesenm.github.io/
Posts: 1,695
Rep Power: 40
olesen has a spectacular aura aboutolesen has a spectacular aura about
Don't start fiddle with the nbx tuning factors - they will only really help for large problems with AMI and with distributed mapping etc. Not likely your case here.
The error appears to arise immediately after trying to solve PCB_BasePlate or during it? (the initial residual of zero could be suspicious).
With MPI errors, it is not always clear when/where they arise. They can also be a result of something else - for example, a zero-size check is triggered on one rank, but inconsistently on other and when the MPI exchange occurs the send/recv are completely mismatched.


After checking your case (possibly with different decompositions), the first thing to try is setting FOAM_ABORT=true which will at least give you a stacktrace, which might help identify how things got to the failure point.
olesen is offline   Reply With Quote

Old   January 3, 2024, 06:35
Default
  #4
wht
New Member
 
Ilya
Join Date: Jan 2021
Posts: 10
Rep Power: 5
wht is on a distinguished road
Thanks for the reply Mark!

I’ve tried to get more information by setting FOAM_ABORT=true, so I added a respective command to my routine:
<…>
export FOAM_ABORT=true
chtMultiRegionSimpleFoam -parallel >> log.chtMultiRegionSimpleFoam 2>&1

However, I don’t see any additional output in the log (below):
[lxbk1159:00000] *** An error occurred in MPI_Send
[lxbk1159:00000] *** reported by process [4110286848,512]
[lxbk1159:00000] *** on communicator MPI_COMM_WORLD
[lxbk1159:00000] *** MPI_ERR_COUNT: invalid count argument
[lxbk1159:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[lxbk1159:00000] *** and MPI will try to terminate your MPI job as well)

In my first message, I forgot to add output from the task scheduler. Here it is below. Maybe it could be helpful:
slurmstepd: error: *** STEP 17666910.0 ON lxbk0997 CANCELLED AT 2024-01-03T12:23:58 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: lxbk1003: tasks 266-297: Killed
srun: error: lxbk0999: tasks 92-137: Killed
srun: error: lxbk1001: tasks 184-229: Killed
srun: error: lxbk1160: tasks 434-475: Killed
srun: error: lxbk0998: tasks 46-91: Killed
srun: error: lxbk1000: tasks 138-183: Killed
srun: error: lxbk1188: tasks 724-777: Killed
srun: error: lxbk1170: tasks 536-591: Killed
srun: error: lxbk1187: tasks 668-723: Killed
srun: error: lxbk1159: tasks 374-433: Killed
srun: error: lxbk1171: tasks 592-667: Killed
srun: error: lxbk1155: tasks 298-373: Killed
slurmstepd: error: mpi/pmix_v2: _errhandler: lxbk1002 [5]: pmixp_client_v2.c:211: Error handler invoked: status = -25, source = [slurm.pmix.17666910.0:241]
srun: error: lxbk1002: tasks 230-265: Killed
srun: error: lxbk1233: tasks 778-861: Killed
srun: error: lxbk1235: tasks 986-1023: Killed
srun: error: lxbk0997: tasks 0-45: Killed
slurmstepd: error: mpi/pmix_v2: _errhandler: lxbk1161 [10]: pmixp_client_v2.c:211: Error handler invoked: status = -25, source = [slurm.pmix.17666910.0:486]
srun: error: lxbk1161: tasks 476-535: Killed
slurmstepd: error: mpi/pmix_v2: _errhandler: lxbk1234 [16]: pmixp_client_v2.c:211: Error handler invoked: status = -25, source = [slurm.pmix.17666910.0:945]
srun: error: lxbk1234: tasks 862-985: Killed

I also tried to set Pstream debug flag in controlDict:
DebugSwitches {Pstream 1;}

OpenFOAM acknowledges the flag:
<…>
Overriding DebugSwitches according to controlDict
Pstream 1;
<…>

But I don’t see any additional output either.

About your question:
Quote:
Originally Posted by olesen View Post
The error appears to arise immediately after trying to solve PCB_BasePlate or during it? (the initial residual of zero could be suspicious).
The error occurs when the solver proceeds with the next region after PCB_BasePlate, which is PCB_Copper (it has ca. 140 million cells).
Initial zero residuals in PCB_BasePlate are there because I applied fixedTemperatureConstraint for this region to imitate isothermal cooling.

Best regards,
Ilya

---
CBM Department

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstr. 1, 64291 Darmstadt, Germany

Last edited by wht; January 23, 2024 at 15:56.
wht is offline   Reply With Quote

Old   January 4, 2024, 03:50
Default
  #5
Senior Member
 
Mark Olesen
Join Date: Mar 2009
Location: https://olesenm.github.io/
Posts: 1,695
Rep Power: 40
olesen has a spectacular aura aboutolesen has a spectacular aura about
Since you are in Darmstadt, could see you if can harness some resources from https://www.mma.tu-darmstadt.de/mma_institute/mma_team/ to help you out (formally or informally).
olesen is offline   Reply With Quote

Old   January 8, 2024, 07:54
Default
  #6
wht
New Member
 
Ilya
Join Date: Jan 2021
Posts: 10
Rep Power: 5
wht is on a distinguished road
Dear Colleagues,

I have tried a few more thing to solve this problem, but didn't succeed, unfortunately. I run out of ideas, except for the one that Mark has suggested (many thanks!).

1. Increased the number of subdomains from 2048 to 4096:

numberOfSubdomains 4096;
method simple;
coeffs
{
n (64 32 2);
}

For PCB_Copper region, it's approximately 35000 cells per processor now.

This makes me think that the problem comes not from a relative number of cells per processor, but, rather, from absolute number of cells in a region.

2. Tried OpenFOAM 10 from Foundation instead of ESI and got the same error:

Solving for solid region PCB_Copper
[lxbk0957:1650843] *** An error occurred in MPI_Send
[lxbk0957:1650843] *** reported by process [1978794742,512]
[lxbk0957:1650843] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
[lxbk0957:1650843] *** MPI_ERR_COUNT: invalid count argument
[lxbk0957:1650843] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[lxbk0957:1650843] *** and potentially your MPI job)
slurmstepd: error: *** STEP 18002547.0 ON lxbk0824 CANCELLED AT 2024-01-08T02:21:37 ***

3. I used hierarchical decomposition for 1024, 2048, and 4096 subdomains, but also tries ptscotch for 1024 subdomains.

4. Tried using IOranks with -fileHandler hostCollated on the same case, but with 1024 subdomains:
<...>
processors1024_0-127
processors1024_128-255
processors1024_256-383
processors1024_384-511
processors1024_512-639
processors1024_640-767
processors1024_768-895
processors1024_896-1023
<...>

export FOAM_IORANKS='(0 128 256 384 512 640 768 896)'
chtMultiRegionSimpleFoam -parallel -fileHandler hostCollated >> log.chtMultiRegionSimpleFoam 2>&1

I have attached a log for this try.

5. Tried to play around with MPI_BUFFER_SIZE variable, the same entry in etc/controlDict, set it to 400 000 000 with no success. Default value is 20 000 000

One more thing:

Maybe this will help. I've noticed this message while compiling OpenFOAM-10 with WM_LABEL_SIZE=64: specified bound between 9223372036854775808 and 18446744073709551615 exceeds maximum object size 9223372036854775807

<...>
In file included from /lustre/cbm/users/elizarov/temp/OpenFOAM-10-compilation/OpenFOAM-10/src/OpenFOAM/lnInclude/List.H:316
,
from /lustre/cbm/users/temp/OpenFOAM-10-compilation/OpenFOAM-10/src/OpenFOAM/lnInclude/HashTable.C:30,
from /lustre/cbm/users/temp/OpenFOAM-10-compilation/OpenFOAM-10/src/OpenFOAM/lnInclude/Istream.H:187,
from /lustre/cbm/users/temp/OpenFOAM-10-compilation/OpenFOAM-10/src/OpenFOAM/lnInclude/ISstream.H:39,
from /lustre/cbm/users/temp/OpenFOAM-10-compilation/OpenFOAM-10/src/OpenFOAM/lnInclude/IOstreams.H:38,
from /lustre/cbm/users/temp/OpenFOAM-10-compilation/OpenFOAM-10/src/OpenFOAM/lnInclude/VectorSpace.C:27,
from /lustre/cbm/users/temp/OpenFOAM-10-compilation/OpenFOAM-10/src/OpenFOAM/lnInclude/VectorSpace.H:232,
from /lustre/cbm/users/temp/OpenFOAM-10-compilation/OpenFOAM-10/src/OpenFOAM/lnInclude/Vector.H:44,
from /lustre/cbm/users/temp/OpenFOAM-10-compilation/OpenFOAM-10/src/OpenFOAM/lnInclude/vector.H:39,
from /lustre/cbm/users/temp/OpenFOAM-10-compilation/OpenFOAM-10/src/OpenFOAM/lnInclude/point.H:35,
from /lustre/cbm/users/temp/OpenFOAM-10-compilation/OpenFOAM-10/src/OpenFOAM/lnInclude/pointField.H:35,
from /lustre/cbm/users/temp/OpenFOAM-10-compilation/OpenFOAM-10/src/OpenFOAM/lnInclude/face.H:46,
from /lustre/cbm/users/temp/OpenFOAM-10-compilation/OpenFOAM-10/src/OpenFOAM/lnInclude/faceList.H:34,
from lnInclude/interpolation.H:35,
from lnInclude/interpolationCellPoint.H:36,
from interpolation/interpolation/interpolationCellPointWallModified/interpolationCellPointWallModified.H:44,
from interpolation/interpolation/interpolationCellPointWallModified/makeInterpolationCellPointWallModified.C:26:
In constructor 'Foam::List<T>::List(Foam::label, const T&) [with T = bool]',
inlined from 'void Foam::volPointInterpolation::interpolateUnconstrai ned(const Foam::GeometricField<Type, Foam::fvPatchField, Foam::volMesh>&, Foam::GeometricField<Type, Foam:ointPatchField, Foam:ointMesh>&) const [with Type = Foam::Vector<double>]' at lnInclude/volPointInterpolationTemplates.C:62:14:
/lustre/cbm/users/temp/OpenFOAM-10-compilation/OpenFOAM-10/src/OpenFOAM/lnInclude/List.C:72:39: warning: 'void* __builtin_memset(void*, int, long unsigned int)' specified bound between 9223372036854775808 and 18446744073709551615 exceeds maximum object size 9223372036854775807 [-Wstringop-overflow=]
72 | List_ELEM((*this), vp, i) = a;
<...>

The version of OpenFOAM 10 is https://github.com/OpenFOAM/OpenFOAM...s/tag/20230119

P.S. I also attached output of ompi_info command run on the cluster.

Best regards,
Ilya

---
CBM Department

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstr. 1, 64291 Darmstadt, Germany
Attached Files
File Type: txt log.chtMultiRegionSimpleFoam-IOranks.txt (67.7 KB, 0 views)
File Type: txt log.ompi_info.txt (13.1 KB, 0 views)

Last edited by wht; January 29, 2024 at 06:03.
wht is offline   Reply With Quote

Old   January 23, 2024, 15:53
Default
  #7
wht
New Member
 
Ilya
Join Date: Jan 2021
Posts: 10
Rep Power: 5
wht is on a distinguished road
Dear Colleagues,

if you have an opportunity to test my case on your system, it would be a great help. Meanwhile, I have filed a bug report https://develop.openfoam.com/Develop.../-/issues/3092; however, it is hard to reproduce the error because of the obvious reasons.

My case can be found at https://sf.gsi.de/f/4db522c9b39b4125855f/?dl=1 (24,2 Mb)

Requirements: 1024 CPUs (multithreading can be used), 4 Gb RAM per processor, Slurm workload manager, OpenFOAM installed with WM_LABEL_SIZE=64

Simply run ./Allrun script

The case uses collated file format and OpenFOAM v2306.

Best regards,
Ilya
---
CBM Department

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstr. 1, 64291 Darmstadt, Germany
wht is offline   Reply With Quote

Reply

Tags
big model, cluster computing, mpi error


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
[General] Extracting ParaView Data into Python Arrays Jeffzda ParaView 30 November 6, 2023 21:00
[OpenFOAM] ParaView command in Foam-extend-4.1 mitu_94 ParaView 0 March 4, 2021 13:46
Pressure outlet boundary condition rolando OpenFOAM Running, Solving & CFD 62 September 18, 2017 06:45
parallel simulations - error message: "OPT_ITERATIONS: invalid option name" v8areu SU2 5 July 23, 2015 02:57
Phase locked average in run time panara OpenFOAM 2 February 20, 2008 14:37


All times are GMT -4. The time now is 16:47.