|
[Sponsors] |
September 26, 2014, 00:54 |
Openfoam on windows cluster with OpenMPI
|
#1 |
New Member
Craig Tickle
Join Date: Jun 2013
Posts: 8
Rep Power: 12 |
Hi,
I've not sure if this is an Openfoam problem or an Open MPI problem, I've been trying to form a mini cluster utilizing a couple of spare workstations, If I decompose my test job into 2 and set slots=1 for each entry in the hostfile then I can run 1 process on each workstation and it all works fine. If I decompose the test job to 4 and set slots=2 for each entry in the hostfile and attempt to run 2 processes on each workstation then it fails. I can see correct number of processes start on both workstations and they're using 100% of the CPU core they're running on but the solver doesn't seem to do anything, it doesn't exit, doesn't generate any results, doesn't generate any error messages. If I run 3 processes on the head node and 1 on the slave node, this works fine (!?) I can also run all 4 processes on the slave node with out problems. As I'm not getting any error messages it's proving difficult to trouble shoot - Any ideas ? Additional Info I'm using BlueCapes windows build of OpenFoam 2.3 with OpenMPI-1.6.2 on windows 7 x64 Both machines can run decomposed cases successfully by themselves. Paths and installation directories are identical on both machines. Both machines have a copy of the decomposed case on their local drive I'm starting mpirun with the following parameters mpirun -np 4 -hostfile machines -mca btl_tcp_if_exclude sppp -x HOME -x PATH -x USERNAME -x WM_PROJECT_DIR -x WM_PROJECT_INST_DIR -x WM_OPTIONS -x FOAM_LIBBIN -x FOAM_APPBIN -x FOAM_USER_APPBIN -x MPI_BUFFER_SIZE icoFoam.exe -parallel where MPI_BUFFER_SIZE=20000000 |
|
September 28, 2014, 13:16 |
|
#2 | |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,975
Blog Entries: 45
Rep Power: 128 |
Greetings Craig,
Quote:
Therefore, a few questions:
Bruno PS1: I'm the one responsible for the development of blueCFD-Core. PS2: For reference, the other post I remember about which is somewhat related to this topic is this one: http://www.cfd-online.com/Forums/ope...ing-wrong.html
__________________
|
||
September 28, 2014, 17:29 |
|
#3 | |||
New Member
Craig Tickle
Join Date: Jun 2013
Posts: 8
Rep Power: 12 |
Quote:
Quote:
Notes: The test-parallel case works fine, I can launch that from either machine and it gives the expected output, I only seem to have problems if I try and launch more than 1 process on the slave node. Side note - if I stop the firewall service I get errors about connecting to namespaces. Quote:
I had a brief play with MS-MPI and got as far as being able to launch processes on the remote node, but I never got the two nodes communicating. It seemed that MS-MPI assumed the existence of a domain controller - which I don't have. Correct me if I'm wrong - but it seems to me that forming an adhoc cluster by temporarily joining 2 to 3 workstations together in a workgroup environment isn't really catered for by Microsoft, you need to go from multi-cored single machines to a full blown cluster with all the overhead that entails. Thanks Craig |
||||
October 4, 2014, 12:43 |
|
#4 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,975
Blog Entries: 45
Rep Power: 128 |
Hi Craig,
Sorry for taking so long to answer, but it was a long week for me and only today did I finally manage to come to the forum and spend some time here. Regarding MS-MPI: the domain namespace is only needed for hard-core HPC in Windows. If more than one machine is needed, either you need to install the HPC pack (which may or may not exist, since they sometimes offer it separately, and other times they don't ) or you have to run smpd in debug mode on all nodes... which isn't very efficient And yes, the HPC pack will likely need a domain namespace As for Open-MPI not working: I believe the problem might be related to an IP address to host name correspondence, which can be defined in the "hosts" file on each machine: http://en.wikipedia.org/wiki/Hosts_%28file%29 This is because even if you define the "machines" file to use IP addresses, the MPI toolbox might get the bright idea to look up the name for each machine and then try to use that, instead of the IP address we gave it. This would explain why the slave machine spends so much time doing nothing... it's possible it's still searching for the IP for a particular host name. edit: Have a look into this: http://www.open-mpi.org/faq/?categor...outability-1.3 Best regards, Bruno
__________________
Last edited by wyldckat; October 4, 2014 at 12:44. Reason: see "edit:" |
|
October 6, 2014, 21:09 |
|
#5 |
New Member
Craig Tickle
Join Date: Jun 2013
Posts: 8
Rep Power: 12 |
Adding the machine names and IP addresses to the host files didn't resolve my issue but now both machines can now ping each other by name.
I also tried adding in IPV6 addresses so they can ping each other using IPV6 , that didn't help either. I haven't yet fully digested the information at the second link but setting --mca btl_base_verbose 30 gives me some additional information. It appears to hang while contacting the second process on the remote node If I run 2 processes on the remote node I get, < bunch of information> [Rigel:07792] btl: tcp: attempting to connect() to address 192.168.1.61 on port 4 [Rigel:07792] btl: tcp: attempting to connect() to address 192.168.1.61 on port 4 < case hangs> If I run 1 process in the remote node I get, < bunch of information> [Rigel:07792] btl: tcp: attempting to connect() to address 192.168.1.61 on port 4 Case : C:/blueCFD-Core-2.3/ofuser-2.3/cavity2 < case proceeds> |
|
October 7, 2014, 15:09 |
|
#6 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,975
Blog Entries: 45
Rep Power: 128 |
Hi Craig,
"on port 4"!? If there wasn't a problem somewhere in the copy-paste procedure, then that's pretty much the real problem here. Ports 1 to 1024 are special system restricted ports, which only administrators and system applications can use. Let me see if I can find out how to configure which ports to be used by Open-MPI... OK:
Bruno
__________________
|
|
October 7, 2014, 18:31 |
|
#7 | |
New Member
Craig Tickle
Join Date: Jun 2013
Posts: 8
Rep Power: 12 |
Hi Craig,
"on port 4"!? If there wasn't a problem somewhere in the copy-paste procedure, then that's pretty much the real problem here. Ports 1 to 1024 are special system restricted ports, which only administrators and system Code:
-mca btl_tcp_if_exclude sppp Code:
mpirun --mca btl_base_verbose 30 --mca btl_tcp_port_min_v4 40000 --mca btl_tcp_port_range_v4 30000 --mca btl_tcp_if_exclude sppp -n 4 machines -x HOME -x PATH -x USERNAME -x WM_PROJECT_DIR -x WM_PROJECT_INST_DIR -x WM_OPTIONS -x FOAM_LIBBIN -x FOAM_APPBIN -x FOAM_USER_APPBIN -x MPI_BUFFER_SIZE -x WM_MPLIB icofoam.exe -parallel Quote:
if I use --mca btl_tcp_port_min_v4 10000 --mca btl_tcp_port_range_v4 30000 it attempts to connect on port 4135 --mca btl_tcp_port_min_v4 20000 --mca btl_tcp_port_range_v4 30000 it attempts to connect on port 8270 --mca btl_tcp_port_min_v4 30000 --mca btl_tcp_port_range_v4 30000 it attempts to connect on port 12405 --mca btl_tcp_port_min_v4 40000 --mca btl_tcp_port_range_v4 30000 it attempts to connect on port 16540 but it still doesn't connect. It might be time to take a fresh look at MS-MPI. I assume there were some very good technical reasons for removing support for MPICH, that used to work really well for me. Thank You Craig On |
||
October 11, 2014, 14:14 |
|
#8 |
Retired Super Moderator
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,975
Blog Entries: 45
Rep Power: 128 |
Hi Craig,
That weird issue with the port numbers is sort-of known to be a strange issue in Open-MPI, at least as far as could find on the web. MPICH2 was dropped when developing blueCFD-Core 2.3 because the latest versions of MPICH2 are only available for Linux/Unix/POSIX, while MS-MPI is derived from MPICH2 and has taken over its development for Windows. We had to make the choice for dropping the support for an old MPICH2 1.4.1p1 and adding MS-MPI 2012, on the principle of it being more recent and allegedly improved for practical use on Windows. I'll see what I can do to bring back some support for MPICH2 1.4.1p1, even if it's as a DIY package+instructions Beyond this, I've got a bad feeling that choosing Open-MPI 1.6.2 (because it's the latest major version of Open-MPI supported for Windows) wasn't the best option either. The previous version was 1.4.4 and perhaps it was the better choice. I'll try to have a look into this as well... perhaps designing a similar DIY kit is the simplest as well... possibly it's the most reasonable choice as well, in case someone wants to extend the ability to add Intel-MPI to the mix Best regards, Bruno
__________________
Last edited by wyldckat; October 11, 2014 at 14:14. Reason: typo |
|
October 12, 2014, 15:57 |
|
#9 |
New Member
Craig Tickle
Join Date: Jun 2013
Posts: 8
Rep Power: 12 |
Hi Bruno,
Thank you for your support, if you could reintroduce support for MPICH2 I'd very much appreciate it. In the meantime I'll keep working with Openmpi and ms-mpi as time allows, perhaps I'll have a break through, if so I'll report back. Cheers Craig |
|
October 16, 2014, 23:21 |
|
#10 |
New Member
Craig Tickle
Join Date: Jun 2013
Posts: 8
Rep Power: 12 |
I got it working! but with MS-MPI 2008. I'd thought I'd looked at MS-MPI before trying Openmpi without sucess. I suspect I've fixed an underlying network issue while trouble shooting openmpi.
The trick is to run a batch file on the remote nodes, which sets the path, environment strings, maps network drives as required and then launches smpd in debug mode. To to that I have a batch file on the remote nodes foamnode.bat Code:
@echo off min call setvars.bat rem put map statements here to map work directory to the same <dir>:<path> for all nodes smpd -d I then modified the msmpi section of gompi Code:
:MSMPI2008 :MSMPI2012 rem if a machines file exists start smpd on each node rem %2 has been reserved for the password rem script assumes that each node has the same user/password IF "%MACHINEFILE%" == "" goto :LOCALONLY set HOSTFILE=-machinefile %MACHINEFILE% FOR /F "eol=» tokens=1,2 delims= " %%i in (%MACHINEFILE%) do ( psexec \\%%i -u %%i\%username% -p %2 -d foamnode.bat ) :lOCALONLY @echo on mpiexec -n %x_numprocs% %MPI_ACCESSORY_OPTIONS% %HOSTFILE% %1 -parallel %3 %4 %5 %6 %7 %8 %9 @echo off IF "%MACHINEFILE%" == "" GOTO :END rem kill smpd on nodes FOR /F "eol=» tokens=1,2 delims= " %%i in (%MACHINEFILE%) do ( taskkill /s %%i /u %%i\%username% /p %2 /im smpd.exe /f ) GOTO END The script assumes that the username is stored in %USERNAME% and that the user exists on all the nodes with the same password. I've co-opted the second parameter to pass the password to psexec psexec is a bit of a pain to get working, on windows 7 you need to create the registry key “HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\Cur rentVersion\Policies\System\LocalAccountTokenFilte rPolicy” and set it to 1 and open the firewall for Remote Services Management (RPC) psexec.exe can be found at http://technet.microsoft.com/en-us/s.../bb897553.aspx min.exe can be found at http://www.paulsadowski.com/wsh/cmdprogs.htm |
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Frequently Asked Questions about Installing OpenFOAM | wyldckat | OpenFOAM Installation | 3 | November 14, 2023 11:58 |
How to setup a simple OpenFOAM cluster? | TommiPLaiho | OpenFOAM Installation | 3 | October 27, 2013 15:15 |
New OpenFOAM Forum Structure | jola | OpenFOAM | 2 | October 19, 2011 06:55 |
openfoam in windows xp | ssixr | OpenFOAM | 17 | November 19, 2010 05:44 |
OpenFOAM (Linux) in a MS-HPC-Cloud | fossy | OpenFOAM | 6 | September 23, 2010 11:48 |