CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM > OpenFOAM Installation

Openfoam on windows cluster with OpenMPI

Register Blogs Community New Posts Updated Threads Search

Like Tree1Likes
  • 1 Post By craig.tickle

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   September 26, 2014, 00:54
Default Openfoam on windows cluster with OpenMPI
  #1
New Member
 
Craig Tickle
Join Date: Jun 2013
Posts: 8
Rep Power: 12
craig.tickle is on a distinguished road
Hi,

I've not sure if this is an Openfoam problem or an Open MPI problem,

I've been trying to form a mini cluster utilizing a couple of spare workstations,

If I decompose my test job into 2 and set slots=1 for each entry in the hostfile then I can run 1 process on each workstation and it all works fine.

If I decompose the test job to 4 and set slots=2 for each entry in the hostfile and attempt to run 2 processes on each workstation then it fails.

I can see correct number of processes start on both workstations and they're using 100% of the CPU core they're running on but the solver doesn't seem to do anything, it doesn't exit, doesn't generate any results, doesn't generate any error messages.

If I run 3 processes on the head node and 1 on the slave node, this works fine (!?)

I can also run all 4 processes on the slave node with out problems.

As I'm not getting any error messages it's proving difficult to trouble shoot - Any ideas ?

Additional Info

I'm using BlueCapes windows build of OpenFoam 2.3 with OpenMPI-1.6.2 on windows 7 x64
Both machines can run decomposed cases successfully by themselves.
Paths and installation directories are identical on both machines.
Both machines have a copy of the decomposed case on their local drive

I'm starting mpirun with the following parameters
mpirun -np 4 -hostfile machines -mca btl_tcp_if_exclude sppp -x HOME -x PATH -x USERNAME -x WM_PROJECT_DIR -x WM_PROJECT_INST_DIR -x WM_OPTIONS -x FOAM_LIBBIN -x FOAM_APPBIN -x FOAM_USER_APPBIN -x MPI_BUFFER_SIZE icoFoam.exe -parallel

where
MPI_BUFFER_SIZE=20000000
craig.tickle is offline   Reply With Quote

Old   September 28, 2014, 13:16
Default
  #2
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,975
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Greetings Craig,

Quote:
Originally Posted by craig.tickle View Post
I can see correct number of processes start on both workstations and they're using 100% of the CPU core they're running on but the solver doesn't seem to do anything, it doesn't exit, doesn't generate any results, doesn't generate any error messages.
This reminds me of a similar problem, which was due to a firewall issue, where it didn't allow proper communication between processes, unless they were acting as a browser or something similar.
Therefore, a few questions:
  1. Did you follow the firewall configuration instructions provided in the blueCFD-Core User Guide?
  2. When you configured the firewall, did you configure it only for private network or for the public network as well, or vice-versa? You might want to check in which network mode you're using to connect the two machines.
  3. In addition, have you tried using the other two MS-MPI versions provided in blueCFD-Core?
Best regards,
Bruno

PS1: I'm the one responsible for the development of blueCFD-Core.
PS2: For reference, the other post I remember about which is somewhat related to this topic is this one: http://www.cfd-online.com/Forums/ope...ing-wrong.html
__________________
wyldckat is offline   Reply With Quote

Old   September 28, 2014, 17:29
Default
  #3
New Member
 
Craig Tickle
Join Date: Jun 2013
Posts: 8
Rep Power: 12
craig.tickle is on a distinguished road
Quote:
Originally Posted by wyldckat View Post


  1. Did you follow the firewall configuration instructions provided in the blueCFD-Core User Guide?
Yes


Quote:
Originally Posted by wyldckat View Post
  1. When you configured the firewall, did you configure it only for private network or for the public network as well, or vice-versa? You might want to check in which network mode you're using to connect the two machines.
I've set the profile on icoFoam.exe and mpirun.exe to "ALL" and allowed edge traversal, I've also tried turning the firewall Off and disabling virus protection.


Notes: The test-parallel case works fine, I can launch that from either machine and it gives the expected output, I only seem to have problems if I try and launch more than 1 process on the slave node.



Side note - if I stop the firewall service I get errors about connecting to namespaces.


Quote:
Originally Posted by wyldckat View Post
  1. In addition, have you tried using the other two MS-MPI versions provided in blueCFD-Core?


I had a brief play with MS-MPI and got as far as being able to launch processes on the remote node, but I never got the two nodes communicating. It seemed that MS-MPI assumed the existence of a domain controller - which I don't have.
Correct me if I'm wrong - but it seems to me that forming an adhoc cluster by temporarily joining 2 to 3 workstations together in a workgroup environment isn't really catered for by Microsoft, you need to go from multi-cored single machines to a full blown cluster with all the overhead that entails.

Thanks
Craig
craig.tickle is offline   Reply With Quote

Old   October 4, 2014, 12:43
Default
  #4
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,975
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Hi Craig,

Sorry for taking so long to answer, but it was a long week for me and only today did I finally manage to come to the forum and spend some time here.

Regarding MS-MPI: the domain namespace is only needed for hard-core HPC in Windows. If more than one machine is needed, either you need to install the HPC pack (which may or may not exist, since they sometimes offer it separately, and other times they don't ) or you have to run smpd in debug mode on all nodes... which isn't very efficient And yes, the HPC pack will likely need a domain namespace

As for Open-MPI not working: I believe the problem might be related to an IP address to host name correspondence, which can be defined in the "hosts" file on each machine: http://en.wikipedia.org/wiki/Hosts_%28file%29
This is because even if you define the "machines" file to use IP addresses, the MPI toolbox might get the bright idea to look up the name for each machine and then try to use that, instead of the IP address we gave it. This would explain why the slave machine spends so much time doing nothing... it's possible it's still searching for the IP for a particular host name.

edit: Have a look into this: http://www.open-mpi.org/faq/?categor...outability-1.3

Best regards,
Bruno
__________________

Last edited by wyldckat; October 4, 2014 at 12:44. Reason: see "edit:"
wyldckat is offline   Reply With Quote

Old   October 6, 2014, 21:09
Default
  #5
New Member
 
Craig Tickle
Join Date: Jun 2013
Posts: 8
Rep Power: 12
craig.tickle is on a distinguished road
Adding the machine names and IP addresses to the host files didn't resolve my issue but now both machines can now ping each other by name.

I also tried adding in IPV6 addresses so they can ping each other using IPV6 , that didn't help either.

I haven't yet fully digested the information at the second link but setting --mca btl_base_verbose 30 gives me some additional information.

It appears to hang while contacting the second process on the remote node
If I run 2 processes on the remote node I get,
< bunch of information>
[Rigel:07792] btl: tcp: attempting to connect() to address 192.168.1.61 on port 4
[Rigel:07792] btl: tcp: attempting to connect() to address 192.168.1.61 on port 4
< case hangs>

If I run 1 process in the remote node I get,
< bunch of information>
[Rigel:07792] btl: tcp: attempting to connect() to address 192.168.1.61 on port 4
Case : C:/blueCFD-Core-2.3/ofuser-2.3/cavity2
< case proceeds>
craig.tickle is offline   Reply With Quote

Old   October 7, 2014, 15:09
Default
  #6
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,975
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Hi Craig,

"on port 4"!? If there wasn't a problem somewhere in the copy-paste procedure, then that's pretty much the real problem here. Ports 1 to 1024 are special system restricted ports, which only administrators and system applications can use.

Let me see if I can find out how to configure which ports to be used by Open-MPI... OK:
  1. This one is really important and is already partially used by blueCFD-Core's MPI script/batch files: http://www.open-mpi.org/faq/?category=tcp#tcp-selection
    I'm referring to this group of arguments used for running mpirun:
    Code:
    -mca btl_tcp_if_exclude sppp
    In the blueCFD-Core User Guide you can find additional information on this, in the section "Open-MPI can't find any Ethernet interfaces".
  2. I don't know if the following two items also apply to Windows or only Linux:
  3. Try using the following additional arguments for mpirun:
    Code:
    --mca btl_tcp_port_min_v4 10000 --mca btl_tcp_port_range_v4 30000
    These two set the minimum port range and the number of ports after that minimum. In this example, it allows only to use ports 10000 to 40000.
Best regards,
Bruno
__________________
wyldckat is offline   Reply With Quote

Old   October 7, 2014, 18:31
Default
  #7
New Member
 
Craig Tickle
Join Date: Jun 2013
Posts: 8
Rep Power: 12
craig.tickle is on a distinguished road
Hi Craig,

"on port 4"!? If there wasn't a problem somewhere in the copy-paste procedure, then that's pretty much the real problem here. Ports 1 to 1024 are special system restricted ports, which only administrators and system
Code:
-mca btl_tcp_if_exclude sppp
I am using the above parameter, currently my command line is

Code:
mpirun  --mca btl_base_verbose 30 --mca btl_tcp_port_min_v4 40000  --mca btl_tcp_port_range_v4 30000 --mca btl_tcp_if_exclude sppp -n 4  machines -x HOME -x PATH -x USERNAME -x WM_PROJECT_DIR -x WM_PROJECT_INST_DIR -x WM_OPTIONS -x FOAM_LIBBIN -x FOAM_APPBIN -x FOAM_USER_APPBIN -x MPI_BUFFER_SIZE -x WM_MPLIB icofoam.exe -parallel
Quote:
Originally Posted by wyldckat View Post
  1. Try using the following additional arguments for mpirun:
    Code:
    --mca btl_tcp_port_min_v4 10000 --mca btl_tcp_port_range_v4 30000
    These two set the minimum port range and the number of ports after that minimum. In this example, it allows only to use ports 10000 to 40000.
This is where it gets strange
if I use
--mca btl_tcp_port_min_v4 10000 --mca btl_tcp_port_range_v4 30000 it attempts to connect on port 4135

--mca btl_tcp_port_min_v4 20000 --mca btl_tcp_port_range_v4 30000 it attempts to connect on port 8270

--mca btl_tcp_port_min_v4 30000 --mca btl_tcp_port_range_v4 30000 it attempts to connect on port 12405

--mca btl_tcp_port_min_v4 40000 --mca btl_tcp_port_range_v4 30000 it attempts to connect on port 16540

but it still doesn't connect.


It might be time to take a fresh look at MS-MPI. I assume there were some very good technical reasons for removing support for MPICH, that used to work really well for me.

Thank You
Craig

On
craig.tickle is offline   Reply With Quote

Old   October 11, 2014, 14:14
Default
  #8
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,975
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Hi Craig,

That weird issue with the port numbers is sort-of known to be a strange issue in Open-MPI, at least as far as could find on the web.

MPICH2 was dropped when developing blueCFD-Core 2.3 because the latest versions of MPICH2 are only available for Linux/Unix/POSIX, while MS-MPI is derived from MPICH2 and has taken over its development for Windows. We had to make the choice for dropping the support for an old MPICH2 1.4.1p1 and adding MS-MPI 2012, on the principle of it being more recent and allegedly improved for practical use on Windows. I'll see what I can do to bring back some support for MPICH2 1.4.1p1, even if it's as a DIY package+instructions

Beyond this, I've got a bad feeling that choosing Open-MPI 1.6.2 (because it's the latest major version of Open-MPI supported for Windows) wasn't the best option either. The previous version was 1.4.4 and perhaps it was the better choice. I'll try to have a look into this as well... perhaps designing a similar DIY kit is the simplest as well... possibly it's the most reasonable choice as well, in case someone wants to extend the ability to add Intel-MPI to the mix

Best regards,
Bruno
__________________

Last edited by wyldckat; October 11, 2014 at 14:14. Reason: typo
wyldckat is offline   Reply With Quote

Old   October 12, 2014, 15:57
Default
  #9
New Member
 
Craig Tickle
Join Date: Jun 2013
Posts: 8
Rep Power: 12
craig.tickle is on a distinguished road
Hi Bruno,

Thank you for your support, if you could reintroduce support for MPICH2 I'd very much appreciate it.

In the meantime I'll keep working with Openmpi and ms-mpi as time allows, perhaps I'll have a break through, if so I'll report back.

Cheers
Craig
craig.tickle is offline   Reply With Quote

Old   October 16, 2014, 23:21
Default
  #10
New Member
 
Craig Tickle
Join Date: Jun 2013
Posts: 8
Rep Power: 12
craig.tickle is on a distinguished road
I got it working! but with MS-MPI 2008. I'd thought I'd looked at MS-MPI before trying Openmpi without sucess. I suspect I've fixed an underlying network issue while trouble shooting openmpi.

The trick is to run a batch file on the remote nodes, which sets the path, environment strings, maps network drives as required and then launches smpd in debug mode.

To to that I have a batch file on the remote nodes

foamnode.bat
Code:
@echo off
min
call setvars.bat
rem put map statements here to map work directory to the same <dir>:<path> for all nodes
smpd -d
where min is a little utility to minimize a cmd window, it's handy to stop the smpd window on the head node opening directly over the cmd window you're working in.

I then modified the msmpi section of gompi
Code:
:MSMPI2008
:MSMPI2012

rem     if a machines file exists start smpd on each node
rem      %2 has been reserved for the password
rem      script assumes that each node has the same user/password
IF "%MACHINEFILE%" == "" goto :LOCALONLY
set HOSTFILE=-machinefile %MACHINEFILE%
FOR /F "eol=» tokens=1,2 delims= " %%i in (%MACHINEFILE%) do (
  psexec \\%%i -u %%i\%username% -p %2 -d foamnode.bat
)
:lOCALONLY
@echo on
mpiexec -n %x_numprocs% %MPI_ACCESSORY_OPTIONS% %HOSTFILE% %1 -parallel %3 %4 %5 %6 %7 %8 %9
@echo off

IF "%MACHINEFILE%" == "" GOTO :END

rem kill smpd on nodes
FOR /F "eol=» tokens=1,2 delims= " %%i in (%MACHINEFILE%) do (
  taskkill  /s %%i /u %%i\%username% /p %2 /im smpd.exe /f
)

GOTO END
psexec launches the foamnode batch file on all the nodes, then taskkill kills smpd on all the nodes once the analysis run is complete.

The script assumes that the username is stored in %USERNAME% and that the user exists on all the nodes with the same password. I've co-opted the second parameter to pass the password to psexec

psexec is a bit of a pain to get working, on windows 7 you need to create the registry key “HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\Cur rentVersion\Policies\System\LocalAccountTokenFilte rPolicy” and set it to 1 and open the firewall for Remote Services Management (RPC)

psexec.exe can be found at http://technet.microsoft.com/en-us/s.../bb897553.aspx
min.exe can be found at http://www.paulsadowski.com/wsh/cmdprogs.htm
wyldckat likes this.
craig.tickle is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Frequently Asked Questions about Installing OpenFOAM wyldckat OpenFOAM Installation 3 November 14, 2023 11:58
How to setup a simple OpenFOAM cluster? TommiPLaiho OpenFOAM Installation 3 October 27, 2013 15:15
New OpenFOAM Forum Structure jola OpenFOAM 2 October 19, 2011 06:55
openfoam in windows xp ssixr OpenFOAM 17 November 19, 2010 05:44
OpenFOAM (Linux) in a MS-HPC-Cloud fossy OpenFOAM 6 September 23, 2010 11:48


All times are GMT -4. The time now is 11:05.