Openfoam on windows cluster with OpenMPI
Hi,
I've not sure if this is an Openfoam problem or an Open MPI problem, I've been trying to form a mini cluster utilizing a couple of spare workstations, If I decompose my test job into 2 and set slots=1 for each entry in the hostfile then I can run 1 process on each workstation and it all works fine. If I decompose the test job to 4 and set slots=2 for each entry in the hostfile and attempt to run 2 processes on each workstation then it fails. I can see correct number of processes start on both workstations and they're using 100% of the CPU core they're running on but the solver doesn't seem to do anything, it doesn't exit, doesn't generate any results, doesn't generate any error messages. If I run 3 processes on the head node and 1 on the slave node, this works fine (!?) I can also run all 4 processes on the slave node with out problems. As I'm not getting any error messages it's proving difficult to trouble shoot - Any ideas ? Additional Info I'm using BlueCapes windows build of OpenFoam 2.3 with OpenMPI-1.6.2 on windows 7 x64 Both machines can run decomposed cases successfully by themselves. Paths and installation directories are identical on both machines. Both machines have a copy of the decomposed case on their local drive I'm starting mpirun with the following parameters mpirun -np 4 -hostfile machines -mca btl_tcp_if_exclude sppp -x HOME -x PATH -x USERNAME -x WM_PROJECT_DIR -x WM_PROJECT_INST_DIR -x WM_OPTIONS -x FOAM_LIBBIN -x FOAM_APPBIN -x FOAM_USER_APPBIN -x MPI_BUFFER_SIZE icoFoam.exe -parallel where MPI_BUFFER_SIZE=20000000 |
Greetings Craig,
Quote:
Therefore, a few questions:
Bruno PS1: I'm the one responsible for the development of blueCFD-Core. PS2: For reference, the other post I remember about which is somewhat related to this topic is this one: http://www.cfd-online.com/Forums/ope...ing-wrong.html |
Quote:
Quote:
Notes: The test-parallel case works fine, I can launch that from either machine and it gives the expected output, I only seem to have problems if I try and launch more than 1 process on the slave node. Side note - if I stop the firewall service I get errors about connecting to namespaces. Quote:
I had a brief play with MS-MPI and got as far as being able to launch processes on the remote node, but I never got the two nodes communicating. It seemed that MS-MPI assumed the existence of a domain controller - which I don't have. Correct me if I'm wrong - but it seems to me that forming an adhoc cluster by temporarily joining 2 to 3 workstations together in a workgroup environment isn't really catered for by Microsoft, you need to go from multi-cored single machines to a full blown cluster with all the overhead that entails. Thanks Craig |
Hi Craig,
Sorry for taking so long to answer, but it was a long week for me and only today did I finally manage to come to the forum and spend some time here. Regarding MS-MPI: the domain namespace is only needed for hard-core HPC in Windows. If more than one machine is needed, either you need to install the HPC pack (which may or may not exist, since they sometimes offer it separately, and other times they don't :rolleyes:) or you have to run smpd in debug mode on all nodes... which isn't very efficient :( And yes, the HPC pack will likely need a domain namespace :( As for Open-MPI not working: I believe the problem might be related to an IP address to host name correspondence, which can be defined in the "hosts" file on each machine: http://en.wikipedia.org/wiki/Hosts_%28file%29 This is because even if you define the "machines" file to use IP addresses, the MPI toolbox might get the bright idea to look up the name for each machine and then try to use that, instead of the IP address we gave it. This would explain why the slave machine spends so much time doing nothing... it's possible it's still searching for the IP for a particular host name. edit: Have a look into this: http://www.open-mpi.org/faq/?categor...outability-1.3 Best regards, Bruno |
Adding the machine names and IP addresses to the host files didn't resolve my issue but now both machines can now ping each other by name.
I also tried adding in IPV6 addresses so they can ping each other using IPV6 , that didn't help either. I haven't yet fully digested the information at the second link but setting --mca btl_base_verbose 30 gives me some additional information. It appears to hang while contacting the second process on the remote node If I run 2 processes on the remote node I get, < bunch of information> [Rigel:07792] btl: tcp: attempting to connect() to address 192.168.1.61 on port 4 [Rigel:07792] btl: tcp: attempting to connect() to address 192.168.1.61 on port 4 < case hangs> If I run 1 process in the remote node I get, < bunch of information> [Rigel:07792] btl: tcp: attempting to connect() to address 192.168.1.61 on port 4 Case : C:/blueCFD-Core-2.3/ofuser-2.3/cavity2 < case proceeds> |
Hi Craig,
"on port 4"!? If there wasn't a problem somewhere in the copy-paste procedure, then that's pretty much the real problem here. Ports 1 to 1024 are special system restricted ports, which only administrators and system applications can use. Let me see if I can find out how to configure which ports to be used by Open-MPI... OK:
Bruno |
Hi Craig,
"on port 4"!? If there wasn't a problem somewhere in the copy-paste procedure, then that's pretty much the real problem here. Ports 1 to 1024 are special system restricted ports, which only administrators and system Code:
-mca btl_tcp_if_exclude sppp Code:
mpirun --mca btl_base_verbose 30 --mca btl_tcp_port_min_v4 40000 --mca btl_tcp_port_range_v4 30000 --mca btl_tcp_if_exclude sppp -n 4 machines -x HOME -x PATH -x USERNAME -x WM_PROJECT_DIR -x WM_PROJECT_INST_DIR -x WM_OPTIONS -x FOAM_LIBBIN -x FOAM_APPBIN -x FOAM_USER_APPBIN -x MPI_BUFFER_SIZE -x WM_MPLIB icofoam.exe -parallel Quote:
if I use --mca btl_tcp_port_min_v4 10000 --mca btl_tcp_port_range_v4 30000 it attempts to connect on port 4135 --mca btl_tcp_port_min_v4 20000 --mca btl_tcp_port_range_v4 30000 it attempts to connect on port 8270 --mca btl_tcp_port_min_v4 30000 --mca btl_tcp_port_range_v4 30000 it attempts to connect on port 12405 --mca btl_tcp_port_min_v4 40000 --mca btl_tcp_port_range_v4 30000 it attempts to connect on port 16540 but it still doesn't connect. It might be time to take a fresh look at MS-MPI. I assume there were some very good technical reasons for removing support for MPICH, that used to work really well for me. Thank You Craig On |
Hi Craig,
That weird issue with the port numbers is sort-of known to be a strange issue in Open-MPI, at least as far as could find on the web. MPICH2 was dropped when developing blueCFD-Core 2.3 because the latest versions of MPICH2 are only available for Linux/Unix/POSIX, while MS-MPI is derived from MPICH2 and has taken over its development for Windows. We had to make the choice for dropping the support for an old MPICH2 1.4.1p1 and adding MS-MPI 2012, on the principle of it being more recent and allegedly improved for practical use on Windows. I'll see what I can do to bring back some support for MPICH2 1.4.1p1, even if it's as a DIY package+instructions ;) Beyond this, I've got a bad feeling that choosing Open-MPI 1.6.2 (because it's the latest major version of Open-MPI supported for Windows) wasn't the best option either. The previous version was 1.4.4 and perhaps it was the better choice. I'll try to have a look into this as well... perhaps designing a similar DIY kit is the simplest as well... possibly it's the most reasonable choice as well, in case someone wants to extend the ability to add Intel-MPI to the mix :D Best regards, Bruno |
Hi Bruno,
Thank you for your support, if you could reintroduce support for MPICH2 I'd very much appreciate it. In the meantime I'll keep working with Openmpi and ms-mpi as time allows, perhaps I'll have a break through, if so I'll report back. Cheers Craig |
I got it working! but with MS-MPI 2008. I'd thought I'd looked at MS-MPI before trying Openmpi without sucess. I suspect I've fixed an underlying network issue while trouble shooting openmpi.
The trick is to run a batch file on the remote nodes, which sets the path, environment strings, maps network drives as required and then launches smpd in debug mode. To to that I have a batch file on the remote nodes foamnode.bat Code:
@echo off I then modified the msmpi section of gompi Code:
:MSMPI2008 The script assumes that the username is stored in %USERNAME% and that the user exists on all the nodes with the same password. I've co-opted the second parameter to pass the password to psexec psexec is a bit of a pain to get working, on windows 7 you need to create the registry key “HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\Cur rentVersion\Policies\System\LocalAccountTokenFilte rPolicy” and set it to 1 and open the firewall for Remote Services Management (RPC) psexec.exe can be found at http://technet.microsoft.com/en-us/s.../bb897553.aspx min.exe can be found at http://www.paulsadowski.com/wsh/cmdprogs.htm |
All times are GMT -4. The time now is 07:11. |