CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM > OpenFOAM Running, Solving & CFD

AWS EC2 Cluster Running in Parallel Issues with v1612+

Register Blogs Members List Search Today's Posts Mark Forums Read

Like Tree1Likes
  • 1 Post By bassaad17

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   January 22, 2020, 10:05
Default AWS EC2 Cluster Running in Parallel Issues with v1612+
  #1
New Member
 
B. Assaad
Join Date: Sep 2014
Posts: 14
Rep Power: 8
bassaad17 is on a distinguished road
Hi Community,


Reaching out because I'm having issues in getting for example snappyHexMesh or simpleFoam ran in AWS cloud using EC2 c5 instances that have been clustered together. I've compiled the source code from ESI OpenFOAM v1612+ into the Master node for a Linux Ubuntu machine running on system version 18.04 bionic.



When I run this command in terminal:

Code:
>> mpirun --hostfile machines -np 12 snappyHexMesh -parallel -overwrite | tee log/snappyHexMesh.log
Note that the machines files contain the private IP addresses of the Master, and Slave1 and Slave2


It looks like the snappyHexMesh gets hung up and I get the below error:


Code:
-------------------------------------------------------------------------
[[12977,1],9]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: ip-10-0-1-37

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
/*---------------------------------------------------------------------------*\
| =========                 |                                                 |
| \\      /  F ield         | OpenFOAM: The Open Source CFD Toolbox           |
|  \\    /   O peration     | Version:  v1612+                                |
|   \\  /    A nd           | Web:      www.OpenFOAM.com                      |
|    \\/     M anipulation  |                                                 |
\*---------------------------------------------------------------------------*/
Build  : v1612+
Exec   : snappyHexMesh -parallel -overwrite
Date   : Jan 22 2020
Time   : 01:24:17
Host   : "ip-10-0-1-120"
PID    : 14358
[ip-10-0-1-120:14343] 11 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[ip-10-0-1-120:14343] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Based on my understanding EC2 instances in AWS operate using elastic network adapters (ENA) which I think is different than Infiniband.

I think the issue here has to do with 'mpirun' and how AWS uses it versus how another machine may use it. Looks like amazon developed their own library for mpi runs.



When I run this command I get the below output:

Code:
>> printenv | grep /opt/amazon/openmpi
 LD_LIBRARY_PATH=/opt/amazon/efa/lib:/opt/amazon/openmpi/lib
Also when i check the mpirun installations on Master node I get the below ouput:

Code:
>> whereis mpirun
 mpirun: /usr/bin/mpirun.openmpi /usr/bin/mpirun /opt/amazon/openmpi/bin/mpirun /usr/share/man/man1/mpirun.1.gz
I'm not sure how to resolve this hanging up problem of parallel runs in AWS EC2 cluser using OF v1612+. Input on getting this resolved would be appreciated. Thank you.
bassaad17 is offline   Reply With Quote

Old   January 22, 2020, 17:24
Default
  #2
Senior Member
 
Joachim Herb
Join Date: Sep 2010
Posts: 600
Rep Power: 18
jherb is on a distinguished road
What happens, if you call:
Code:
mpirun --hostfile machines -np 12 hostname
You should see 12 lines with the internal host names of your nodes


Some additional ideas:
Is OpenFOAM setups corretly, if MPI connects by ssh to the nodes? You can test this with:
Code:
ssh slave_ip env | grep PATH
ssh slave_ip env | grep LD_LIBRARY_PATH

If not, you have to setup $HOME/.bashrc accordingly: Make sure, that something like source /.../OpenFOAM../etc/bashrc is at the top of the file, so it is really called in a non-login shell.
jherb is offline   Reply With Quote

Old   January 22, 2020, 22:40
Default
  #3
New Member
 
B. Assaad
Join Date: Sep 2014
Posts: 14
Rep Power: 8
bassaad17 is on a distinguished road
Hi jherb,


When I ran the first line I got this .. Based on terminal output from below code, that looks like machines are properly defined (IP .120 is master, .37 is slave1 and .65 is slave2).
Code:
ip-10-0-1-120:000N >> mpirun --hostfile machines -np 12 hostname
ip-10-0-1-37
ip-10-0-1-120
ip-10-0-1-120
ip-10-0-1-120
ip-10-0-1-37
ip-10-0-1-37
ip-10-0-1-65
ip-10-0-1-65
ip-10-0-1-65
ip-10-0-1-37
ip-10-0-1-120
ip-10-0-1-65
I made sure to have in .bashrc file in both Master and Slave with OF v1612+ sourced by including this line
Code:
source /home/ubuntu/OpenFOAM/OpenFOAM-v1612+/etc/bashrc
Then I ran grep PATH and grep LD_LIBRARY_PATH and from the output it looks like its calling Amazon MPI and v1612+ libraries
Code:
ip-10-0-1-120:000N >> ssh 10.0.1.37 env | grep PATH
LD_LIBRARY_PATH=/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64Gcc/fftw-3.3.5/lib64:/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64Gcc/CGAL-4.9/lib64:/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64Gcc/boost_1_62_0/lib64:/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64Gcc/gperftools-2.5/lib64:/home/ubuntu/OpenFOAM/OpenFOAM-v1612+/platforms/linux64GccDPInt32Opt/lib/openmpi-system:/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64GccDPInt32/lib/openmpi-system:/usr/lib/x86_64-linux-gnu/openmpi/lib:/home/ubuntu/OpenFOAM/ubuntu-v1612+/platforms/linux64GccDPInt32Opt/lib:/home/ubuntu/OpenFOAM/site/v1612+/platforms/linux64GccDPInt32Opt/lib:/home/ubuntu/OpenFOAM/OpenFOAM-v1612+/platforms/linux64GccDPInt32Opt/lib:/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64GccDPInt32/lib:/home/ubuntu/OpenFOAM/OpenFOAM-v1612+/platforms/linux64GccDPInt32Opt/lib/dummy
MPI_ARCH_PATH=/usr/lib/x86_64-linux-gnu/openmpi
FFTW_ARCH_PATH=/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64Gcc/fftw-3.3.5
SCOTCH_ARCH_PATH=/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64GccDPInt32/scotch_6.0.3
CGAL_ARCH_PATH=/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64Gcc/CGAL-4.9
PATH=/opt/amazon/efa/bin:/opt/amazon/openmpi/bin:/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64Gcc/gperftools-2.5/bin:/home/ubuntu/OpenFOAM/ubuntu-v1612+/platforms/linux64GccDPInt32Opt/bin:/home/ubuntu/OpenFOAM/site/v1612+/platforms/linux64GccDPInt32Opt/bin:/home/ubuntu/OpenFOAM/OpenFOAM-v1612+/platforms/linux64GccDPInt32Opt/bin:/home/ubuntu/OpenFOAM/OpenFOAM-v1612+/bin:/home/ubuntu/OpenFOAM/OpenFOAM-v1612+/wmake:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
BOOST_ARCH_PATH=/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64Gcc/boost_1_62_0
Code:
ip-10-0-1-120:000N >> ssh 10.0.1.37 env | grep LD_LIBRARY_PATH
   LD_LIBRARY_PATH=/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64Gcc/fftw-3.3.5/lib64:/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64Gcc/CGAL-4.9/lib64:/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64Gcc/boost_1_62_0/lib64:/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64Gcc/gperftools-2.5/lib64:/home/ubuntu/OpenFOAM/OpenFOAM-v1612+/platforms/linux64GccDPInt32Opt/lib/openmpi-system:/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64GccDPInt32/lib/openmpi-system:/usr/lib/x86_64-linux-gnu/openmpi/lib:/home/ubuntu/OpenFOAM/ubuntu-v1612+/platforms/linux64GccDPInt32Opt/lib:/home/ubuntu/OpenFOAM/site/v1612+/platforms/linux64GccDPInt32Opt/lib:/home/ubuntu/OpenFOAM/OpenFOAM-v1612+/platforms/linux64GccDPInt32Opt/lib:/home/ubuntu/OpenFOAM/ThirdParty-v1612+/platforms/linux64GccDPInt32/lib:/home/ubuntu/OpenFOAM/OpenFOAM-v1612+/platforms/linux64GccDPInt32Opt/lib/dummy
I then tried re-running snappyHexMesh and i got the below ouput
Code:
ip-10-0-1-120:000N >> mpirun --hostfile machines -np 12 snappyHexMesh -parallel -overwrite | tee log/snappyHexMesh.log
--------------------------------------------------------------------------
[[32349,1],6]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: ip-10-0-1-37

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
/*---------------------------------------------------------------------------*\
| =========                 |                                                 |
| \\      /  F ield         | OpenFOAM: The Open Source CFD Toolbox           |
|  \\    /   O peration     | Version:  v1612+                                |
|   \\  /    A nd           | Web:      www.OpenFOAM.com                      |
|    \\/     M anipulation  |                                                 |
\*---------------------------------------------------------------------------*/
Build  : v1612+
Exec   : snappyHexMesh -parallel -overwrite
Date   : Jan 23 2020
Time   : 03:35:42
Host   : "ip-10-0-1-120"
PID    : 29939
Case   : /home/ubuntu/Projects/01_TestRuns/53StateSt/000N
nProcs : 12
Slaves : 
11
(
"ip-10-0-1-120.29940"
"ip-10-0-1-120.29941"
"ip-10-0-1-120.29942"
"ip-10-0-1-37.17118"
"ip-10-0-1-37.17119"
"ip-10-0-1-37.17120"
"ip-10-0-1-37.17121"
"ip-10-0-1-65.25946"
"ip-10-0-1-65.25947"
"ip-10-0-1-65.25948"
"ip-10-0-1-65.25949"
)

Pstream initialized with:
    floatTransfer      : 0
    nProcsSimpleSum    : 0
    commsType          : nonBlocking
    polling iterations : 0
sigFpe : Enabling floating point exception trapping (FOAM_SIGFPE).
fileModificationChecking : Monitoring run-time modified files using timeStampMaster (fileModificationSkew 10)
allowSystemOperations : Allowing user-supplied system call operations

// * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * //
Create time

Create mesh for time = 0

[4] 
[4] 
[4] --> FOAM FATAL ERROR: 
[4] Cannot find file "points" in directory "polyMesh" in times 0 down to constant
[4] 
[4]     From function Foam::word Foam::Time::findInstance(const Foam::fileName&, const Foam::word&, Foam::IOobject::readOption, const Foam::word&) const
[4]     in file db/Time/findInstance.C at line [5] 
[5] 
[5] --> FOAM FATAL ERROR: 
[5] Cannot find file "points" in directory "polyMesh" in times 0 down to constant
[5] 
[5]     From function Foam::word Foam::Time::findInstance(const Foam::fileName&, const Foam::word&, Foam::IOobject::readOption, const Foam::word&) const
[5]     in file db/Time/findInstance.C at line 202.
[5] 
FOAM parallel run exiting
[5] 
[8] 
[8] 
[8] --> FOAM FATAL ERROR: 
[8] Cannot find file "points" in directory "polyMesh" in times 0 down to constant
[8] 
[8]     From function Foam::word Foam::Time::findInstance(const Foam::fileName&, const Foam::word&, Foam::IOobject::readOption, const Foam::word&) const
[8]     in file db/Time/findInstance.C at line 202.
[8] 
FOAM parallel run exiting
[8] 
[6] 
[6] 
[6] --> FOAM FATAL ERROR: 
[6] Cannot find file "points" in directory "polyMesh" in times 0 down to constant
[6] 
[6]     From function Foam::word Foam::Time::findInstance(const Foam::fileName&, const Foam::word&, Foam::IOobject::readOption, const Foam::word&) const
[6]     in file db/Time/findInstance.C at line 202.
[6] 
FOAM parallel run exiting
[6] 
[9] 
[9] 
[9] --> FOAM FATAL ERROR: 
[9] Cannot find file "points" in directory "polyMesh" in times 0 down to constant
[9] 
[9]     From function Foam::word Foam::Time::findInstance(const Foam::fileName&, const Foam::word&, Foam::IOobject::readOption, const Foam::word&) const
[9]     in file db/Time/findInstance.C at line 202.
[9] 
FOAM parallel run exiting
[9] 
[7] 
[7] 
[7] --> FOAM FATAL ERROR: 
[7] Cannot find file "points" in directory "polyMesh" in times 0 down to constant
[7] 
[7]     From function Foam::word Foam::Time::findInstance(const Foam::fileName&, const Foam::word&, Foam::IOobject::readOption, const Foam::word&) const
[7]     in file db/Time/findInstance.C at line 202.
[7] 
FOAM parallel run exiting
[7] 
[10] 
[10] 
[10] --> FOAM FATAL ERROR: 
[10] Cannot find file "points" in directory "polyMesh" in times 0 down to constant
[10] 
[10]     From function Foam::word Foam::Time::findInstance(const Foam::fileName&, const Foam::word&, Foam::IOobject::readOption, const Foam::word&) const
[10]     in file db/Time/findInstance.C at line 202.
[10] 
FOAM parallel run exiting
[10] 
202.
[4] 
FOAM parallel run exiting
[4] 
[11] 
[11] 
[11] --> FOAM FATAL ERROR: 
[11] Cannot find file "points" in directory "polyMesh" in times 0 down to constant
[11] 
[11]     From function Foam::word Foam::Time::findInstance(const Foam::fileName&, const Foam::word&, Foam::IOobject::readOption, const Foam::word&) const
[11]     in file db/Time/findInstance.C at line 202.
[11] 
FOAM parallel run exiting
[11] 
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 5 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[ip-10-0-1-120:29931] 11 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[ip-10-0-1-120:29931] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[ip-10-0-1-120:29931] 7 more processes have sent help message help-mpi-api.txt / mpi-abort

When I run the same case in one node with 4 physical cpu's i don't have any of these issues with snappyHexMesh
bassaad17 is offline   Reply With Quote

Old   January 23, 2020, 04:16
Default
  #4
Senior Member
 
Joachim Herb
Join Date: Sep 2010
Posts: 600
Rep Power: 18
jherb is on a distinguished road
Do you have the processorXXX folders available on all machines? You can follow the instructions on https://cfd.direct/cloud/aws/cluster/ to setup NFS (steps 4 and 5).
jherb is offline   Reply With Quote

Old   February 4, 2020, 14:53
Default
  #5
New Member
 
B. Assaad
Join Date: Sep 2014
Posts: 14
Rep Power: 8
bassaad17 is on a distinguished road
After checking that processor folders are available on all machines, and following the steps from the link provided - model worked! Thank you so much jherb.
jherb likes this.
bassaad17 is offline   Reply With Quote

Old   March 10, 2020, 11:34
Default
  #6
New Member
 
B. Assaad
Join Date: Sep 2014
Posts: 14
Rep Power: 8
bassaad17 is on a distinguished road
Hi jherb,


I've tried to do this same thing with a new AWS EC2 c5.18xlarge cluster of 1 Master and 1 Slave - but today the snappyHexMesh -pararell -overwrite operation is hanging up (not meshing is happening).


When I do top on machine 1 (master) I see 100% cpu usage on master for 36 CPUs, and when I do top on machine 2 (slave) I see 100% cpu usage on slave for 36 CPUs.


Attached is a screenshot of the message I'm getting. Please help as I've deleted my old machines that worked, but the new ones don't work and I don't know why.


I'm establishing the cluster using these steps

Code:
Add the .ppk file in pageant
Connect to master EC2 using Putty where Allow agent forwarding is enabled
ssh-add -l to check the added private key
ssh-keygen to generate id_rsa private and id_rsa.pub public key
ssh-add ~/.ssh/id_rsa to add the id_rsa key
ssh-add -l to check private key added
Check the pageant to verify it for two key
ssh-copy-id ubuntu@<Private IP> to copy the id_rsa.pub on slaves
ssh ubuntu@<Private IP> to verify that ssh authentication is enabled from master to slave


Sharing the Master Instance Volume is required just once and need to run below command.  
 
sudo sh -c "echo '/home/ubuntu/OpenFOAM  *(rw,sync,no_subtree_check)' >> /etc/exports"
sudo exportfs -ra
sudo service nfs-kernel-server start
 
Mounting the Master Volume is required on each slave and need to run below command (non-interactive SSH / by writing a .shell file and running it in terminal).
 
SPIPS="XX.X.X.XX YY.Y.Y.YY ZZ.Z.Z.ZZ"
for IP in $SPIPS ; do ssh $IP 'rm -rf ${HOME}/OpenFOAM/*' ; done
for IP in $SPIPS ; do ssh $IP 'sudo mount MM.M.M.MM:${HOME}/OpenFOAM ${HOME}/OpenFOAM' ; done



for IP in $SPIPS ; do ssh $IP 'ls ${HOME}/OpenFOAM' ; done




-BA
Attached Images
File Type: png AWS_EC2Cluster_SHM_hanging.PNG (39.4 KB, 11 views)

Last edited by bassaad17; March 12, 2020 at 09:11. Reason: added ssh master slave steps
bassaad17 is offline   Reply With Quote

Old   March 12, 2020, 17:07
Default
  #7
Senior Member
 
Joachim Herb
Join Date: Sep 2010
Posts: 600
Rep Power: 18
jherb is on a distinguished road
Have you set up the hostfile correctly? Does a simple test like
Code:
mpirun -np 72 hostname
work?
jherb is offline   Reply With Quote

Old   March 16, 2020, 16:32
Default
  #8
New Member
 
B. Assaad
Join Date: Sep 2014
Posts: 14
Rep Power: 8
bassaad17 is on a distinguished road
Hi jherb,


Downsized my 2 instances to c5.4xlarge (8 CPUs) and ran the line you provided in PuTTY terminal. This is the message I received (see image attached).


I also tried running the normal command to mesh


Code:
mpirun --hostfile machines -np 16 snappyHexMesh -parallel -overwrite | tee log/snappyHexMesh.log
where 'machines' is define as Private IP of Master (M) and Private IP of Slave (S)


Code:
MM.M.N.167 cpu=8
SS.S.S.138 cpu=8


-BA
Attached Images
File Type: png AWS_EC2Cluster_mpirun_error.PNG (36.7 KB, 6 views)
bassaad17 is offline   Reply With Quote

Old   March 16, 2020, 16:47
Default
  #9
Senior Member
 
Joachim Herb
Join Date: Sep 2010
Posts: 600
Rep Power: 18
jherb is on a distinguished road
Do you have a hostfile? It should look like this:
Code:
first.private.ip.address slots=8
second.private.ip.address slots=8
And you would start the command with:
Code:
mpirun --hostfile my_hostfile -np 16 hostname

(your first private IP seems to be 10.0.2.167). You can check the EC2 management console for them (perhaps you have to change its properties to make them visible).
jherb is offline   Reply With Quote

Old   March 16, 2020, 17:02
Default
  #10
New Member
 
B. Assaad
Join Date: Sep 2014
Posts: 14
Rep Power: 8
bassaad17 is on a distinguished road
I did your suggestion and it does list the 8 cpu's for master, and 8 cpu's for slave once i type your command provided
Attached Images
File Type: png AWS_EC2Cluster_mpirun_error2.PNG (22.0 KB, 11 views)
bassaad17 is offline   Reply With Quote

Old   March 17, 2020, 07:20
Default
  #11
Senior Member
 
Joachim Herb
Join Date: Sep 2010
Posts: 600
Rep Power: 18
jherb is on a distinguished road
Okay. Now you could also start snappyHexMesh or an OpenFOAM solver this way. E.g.:
Code:
mpirun -np 72 --hostfile my_hostfile snappyHexMesh -parallel

If the output should be redirected into an file, then:
Code:
mpirun -np 72 --hostfile my_hostfile snappyHexMesh -parallel >log.snappyHexMesh 2>&1
jherb is offline   Reply With Quote

Old   March 17, 2020, 08:30
Default
  #12
New Member
 
B. Assaad
Join Date: Sep 2014
Posts: 14
Rep Power: 8
bassaad17 is on a distinguished road
hi jherb,


Same issue occurs .... snappyHexMesh command you gave when typed into terminal 'hangs' during parallel run for master and slave in my example v1612+ case.


Code:
mpirun -np 16 --hostfile machines snappyHexMesh -parallel -overwrite > log.snappyHexMesh 2>&1

I already tried running SHM in single processor on master machine and it works just fine.


Not sure why the parallel run between master and slave nodes keeps hanging
Attached Images
File Type: jpg AWS_EC2Cluster_SHM_hanging1.jpg (106.0 KB, 2 views)
bassaad17 is offline   Reply With Quote

Old   March 17, 2020, 13:45
Default
  #13
Senior Member
 
Joachim Herb
Join Date: Sep 2010
Posts: 600
Rep Power: 18
jherb is on a distinguished road
What is the output of snappyHexMesh, i.e. the content of log.snappyHexMesh?
jherb is offline   Reply With Quote

Old   March 17, 2020, 17:13
Default
  #14
New Member
 
B. Assaad
Join Date: Sep 2014
Posts: 14
Rep Power: 8
bassaad17 is on a distinguished road
Output of log of SHM is attached as screenshot.

Master private ip .167 and slave private ip .138 showed 100% cpu usage for 8 CPU's on each instance for the SHM command, but no actual meshing iterations occurred.
Attached Images
File Type: png AWS_EC2Cluster_SHM_hanging3.PNG (95.5 KB, 6 views)
bassaad17 is offline   Reply With Quote

Old   March 17, 2020, 17:46
Default
  #15
Senior Member
 
Joachim Herb
Join Date: Sep 2010
Posts: 600
Rep Power: 18
jherb is on a distinguished road
Is this the whole output? Then something with the communication is going wrong.


Again, have you checked any of the solvers? E. g. just try:
Code:
mpirun --np 16 simpleFoam -parallel

Does it complain, that there are no case files?


Do you use the correct MPI version for which your OpenFOAM installation was build?


Can you check your system setup with some of the "normal" OpenMPI examples/tutorials by compiling them yourself.


I am using the Ubuntu packages of the OpenFOAM foundation on the Ubuntu AMI provided by Amazon. If I want to use the special openMPI version of Amzaon to support the EFA interface, I needed to compile the Pstream part of OpenFOAM newly.
jherb is offline   Reply With Quote

Old   March 31, 2020, 11:13
Default
  #16
New Member
 
B. Assaad
Join Date: Sep 2014
Posts: 14
Rep Power: 8
bassaad17 is on a distinguished road
Hi jherb,


Yes, this is the whole output - the SHM hangs when running it with Master & Slave instance. When I run SHM in parallel in Master only with 4 CPU's it works fine.


How can I check that I'm using the correct MPI version for the OF v1612+ that I'm using?


Do you think compiling the Pstream will help?


What I find weird in this is that it worked a few months ago, and now it doesn't with whatever I'm doing.
bassaad17 is offline   Reply With Quote

Old   April 15, 2020, 17:13
Default
  #17
Senior Member
 
Joachim Herb
Join Date: Sep 2010
Posts: 600
Rep Power: 18
jherb is on a distinguished road
If you have not yet solved the problem, another idea: Add the following to the $HOME/.ssh/config file:
Code:
 Host *                                                                                                                                                                                                                                          
  StrictHostKeyChecking no

And add the prefix option to the mpirun command. First check, which mpirun is used:
Code:
which mpirun
e. g. /opt/amazon/openmpi/bin/mpirun.


Then call mpirun with this addtional option:
Code:
mpirun -prefix /opt/amazon/openmpi
jherb is offline   Reply With Quote

Reply

Tags
aws, cluster, ec2, openfoam, v1612+

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Openfoam running error in parallel in Cluster sibo OpenFOAM 2 February 25, 2017 13:26
Script to Run Parallel Jobs in Rocks Cluster asaha OpenFOAM Running, Solving & CFD 12 July 4, 2012 22:51
Running Foam on multiple nodes (small cluster) Hisham OpenFOAM Running, Solving & CFD 4 June 11, 2012 13:44
Parallel cluster solving with OpenFoam? P2P Cluster? hornig OpenFOAM Programming & Development 8 December 5, 2010 16:06
Kubuntu uses dash breaks All scripts in tutorials platopus OpenFOAM Bugs 8 April 15, 2008 07:52


All times are GMT -4. The time now is 01:42.