decomposed case to 2-cores (Not working)
I am working on interFoam/laminar/damBreak case. The number of cells in the generated mesh is 955000. To run in parallel, the mesh decomposition is done using metis decomposition.
When decomposed and running for 4 cores (quad-core Xeon E5620 ), it works perfectly fine. On changing the decomposed case to 2-cores, After some time of running, the systems hangs and displays the following error: "mpirun noticed that process rank 0 with PID 3758 on node exited on signal 11 (Segmentation fault)." Please suggest.Thanks |
did you delete all the previous processor* folders before decomposing again?
|
Oh yes, I deleted the processor* directories. Any other clues?
|
Hi, sometimes changing the decomposition method fixes the problem, try with Simple.
Best. |
Thanks for your response. You are right, changing the decomposition method to simple fix this problem. But my requirement is to reduce the boundary faces shared between the cores, so that I can reduce the communication cost. This is the reason I shifted to metis from simple.
Do you think the problem is related to the MPI buffer size? |
Greetings to all!
@pkr: Wow, you've been asking about this in a few places! Like Santiago implied, the problem is related to how the cells are distributed through the sub-domains. And this isn't a new issue, it has been happening for quite a while now. On one case that I accompanied a bit led to this bug report: http://www.cfd-online.com/Forums/ope...-parallel.html - If your follow up the story on that thread and related link, you'll learn about an issue with cyclic patches and so on. Nonetheless, if your case is just a high-resolution of the damBreak case, without adding anything in particular - like cyclic patches, wedges and so on - then the problem should be related to a few cells that are split up between processors, when they should be kept together. Also, if it's just the damBreak case, AFAIK decomposing with Metis will not be more minimized than using simple or hierarchical methods. The proof is the face count returned by decomposePar, which always leads to Metis having something like 50% more faces interfacing between domains, than with simple or hierarchical methods. My experiences with high-resolution versions of the damBreak and cavity cases, in attempts to do some benchmarks with OpenFOAM in a multi-core machine, have led me to conclude that both simple and hierarchical methods are more than enough and also better for situations like these, where the meshes are so simple. Metis and Scotch are for the more complex meshes, with no clear indication of where are the most likely and best places to split the mesh. Now, if you still want to use Metis, then also try scotch, which usually is available with the latest versions of OpenFOAM. It's conceptually similar to Metis, but has a far more permissive software license than Metis. It will likely produce a different way of distributing the cells between sub-domains; with luck, the wrong cells wont end up apart from each other. Also, if you run the following command in the tutorials and applications folders of OpenFOAM, you can find out a bit more about decomposition options from other dictionaries: Code:
find . -name "decomposeParDict" Bruno PS: By the way, another conclusion was that with a single machine, with multiple cores, sometimes over-scheduling the processors leads to higher processing power; such case was with a modified cavity 3D using icoFoam solver and about 1 million cells, where on my AMD 1055T x6 cores, 16 sub-domains lead to a rather better run time, rather than 4 and 6 sub-domains! But still, I have yet to be able to achieve linear speed-up or anything near :( even from a CPU computation power point of view (i.e. 6x times the power with 6 core machine, no matter how many sub-domains). |
Hey Bruno, thx for the explanation. I have a related problem, working with interFoam and METIS too. We've a parallel facility with a server and diskless nodes which reads the SO trough the net via NFS. When I use METIS and run for example a) 2 threads each one in one core in the server things go well. Then, if I do the same in b) a node (server and nodes have 8 cores) problem is decomposed correctly but only one core has load and the problem runs at lesser load core, it is very slowly.
Other case, c) launching from the server, but sending a thread to node1 and the other one to node2. Correct decomposition, balanced load. All OK. Finally d) launching from server sending two threads to the same node, same problem as a). It is very weird, sounds like nodes don't like multicore processing with OpenFOAM. Regards. |
Hi Santiago,
Yes, I know, it's really weird! Here's another proof I picked up from this forum, a draft report by Gijsbert Wierink: Installation of OpenFOAM on the Rosa cluster If you see the Figure 1 in that document, you'll see the case can't speed up unless it's unleashed into more than one machine! I've replicated the case used and the timings with my AMD 1055T x6 are roughly the same. It was with that information that lead me to try do over-scheduling of 16 processes into the 6 processors and managed to get a rather better performance than using only 6 processes. Basically, the timings reported on that draft indicate a lousy speed up of almost 4 times in a 8 core machine (4 core per socket, dual socket machine if I'm not mistaken), but when 16 and 32 cores (3-4 nodes) are used, the speed up is of 10 and 20 times! Above that, it saturates due to the cell/core count dropping too much under the 50k cell/core estimate. With this information, along with the information in the report "OpenFOAM Performance Benchmark and Profiling" and the estimated minimum limit of 50k cells/core, my deductions are:
edit: I forgot to mention, if my memory isn't failing me, that here in the forum is some limited information about configuring the shared memory defined by the kernel, which can play a rather important key in local runs, but I've never been able to actually be successful in doing a proper tuning of these parameters. Best regards, Bruno |
Thanks Bruno. I am working on your suggestions.
I am also trying to make the parallel case running across the machines. To test the parallel solution, I followed the steps mentioned in other post http://www.cfd-online.com/Forums/ope...tml#post256927. If I run the parallel case with 2 processes on a single machine. The parallelTest utility works fine: Quote:
On the other hand, If I split the processing across 2 machines then the system hangs after create time: Quote:
P.S. OpenFoam version 1.6 is used in both the machines. |
Hi pkr,
Wow, I love it when the parallelTest application does some crazy time travel and flushes the buffer in a crazy order :D As for the second run:
On the other hand, if at least one of them is true, then you should check how the firewall is configured on those two machines. The other possibility is that the naming convention for the IP addresses isn't being respected in both machines. For example, if the first machine has defined in "/etc/hosts" that:
My usual trick to try to isolate these cases is to:
Best regards and good luck! Bruno |
Bruno, some comments,
Quote:
Regards. |
Thanks for your response Bruno. I tried your suggestions, but still no progress in solving the problem.
Quote:
Quote:
Quote:
Quote:
Quote:
Apart from this I tried a simple OpenMPI program which works fine. The code and output is as follows: Quote:
Quote:
mpirun -np 2 parallelTest ==> Works mpirun --hostfile machines -np 2 parallelTest ==> Not working Quote:
Do you think it might be a problem with the version I am using? I am currently working on OpenFoam1.6. Shall I move to OpenFoam1.6.X? Please suggest some other things I can check upon. Thanks. |
Hi Bruno,
Another query: Please comment on the process I am following to execute parallelTest across the machines. 1. machine1 as master and machine2 as slave. 2. In machine1, change system/decomposeParDict for 2 processes 3. Execute decomposePar on machine1 which creates two directories as processor1 and processor2. 4. Create machines file in machine1 to contain machine1 and machine2 as entries. 5. Copy processor1 and processor2 directory from machine1 to machine2. (Directory: OpenFOAM/OpenFOAM-1.6/tutorials/multiphase/interFoam/laminar/damBreak) 6. Launch "foamjob -p -s parallelTest" on machine1 After following these steps, the output stucks at create time as follows: Quote:
Please comment if I am following the right process for executing parallelTest application across the machines. |
Hi Bruno,
Yet another query: It seems that the problem might be due to setting of some environment variables at the slave. Please suggest? The OpenFoam project directory is visible on the slave side: rphull@fire3:~$ echo $WM_PROJECT_DIR /home/rphull/OpenFOAM/OpenFOAM-1.6 1. When complete path of executable is not specified: Quote:
2. The case when the complete executable path is specified: Quote:
From the second case, it looks like the machine tried to launch the application but failed as it was not able to figure out the path for the shared object (libinterfaceProperties.so in this case). Any suggestions to fix this? |
Hi pkr,
Mmm, that's a lot of testing you've been doing :) OK, let's see if I don't forget anything:
Code:
mpirun -np 1 -host machine1 /home/rphull/OpenFOAM/OpenFOAM-1.6.x/bin/foamExec parallelTest -case $PWD These two tests can help you isolate where the problem is, since they only launch one instance of parallelTest on the remote machine, without the need for explicit communication between processes via MPI. OK, I hope I didn't forget anything. Best regards and good luck! Bruno |
Hi Bruno,
Quote:
Quote:
Quote:
Does putting "-parallel" makes it to run in master-slave framework? Quote:
Both the cases works fine on machine1. When I try the same on machine2, the following case fails: rphull@fire3:~/OpenFOAM/OpenFOAM-1.6/tutorials/multiphase/interFoam/laminar/damBreak$ mpirun -np 1 -host fire2 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest -case $PWD exec: 128: parallelTest: not found Is this the root cause for the problem? Any suggestions to fix this? |
Hi pkr,
Quote:
Quote:
When using NFS or sshfs, this is sort of done automatically for you, except that the real files should only reside on a single machine, instead of being physically replicated on both machines. Quote:
Quote:
OK, the first thing that comes to my mind is that "parallelTest" is only available on one of the machines. To confirm this, run on both machines: Code:
which parallelTest Now, when I try to think more deeply about this, I get the feeling that there is something else that is slightly different on one of the machines, but I can't put my finger on it... it feels that it's either the Linux distribution version that isn't identical... or something about bash not working the same exact way. Perhaps it's how "~/.bashrc" is defined on both machines... check if there are any big differences between the two files. Any changes to the variables "PATH" and "LD_LIBRARY_PATH" inside "~/.bashrc" which are different in some particular way, can lead to very different working environments! The other possibility would be how ssh is configured on both machines... Best regards, Bruno |
Thanks Bruno.
Quote:
mpirun -np 1 -host machine1 /home/rphull/OpenFOAM/OpenFOAM-1.6.x/bin/foamExec parallelTest -case $PWD mpirun -np 1 -host machine2 /home/rphull/OpenFOAM/OpenFOAM-1.6.x/bin/foamExec parallelTest -case $PWD But still the system hangs when I try to run the openmpi parallelTest with -parallel keyword for 2 machines :( Quote:
Quote:
|
Hi pkr,
Quote:
Another test you can try, which is launching a parallel case to work solely on the remote machine: Code:
mpirun -np 2 -host machine1 /home/rphull/OpenFOAM/OpenFOAM-1.6.x/bin/foamExec parallelTest -parallel -case $PWD It feels we are really close to getting this to work and yet it seems so far... OK, another test for trying to isolate what on digital earth is going on - run this using the two machines, from/to either one of them and also only locally and only remotely: Code:
foamJob -s -p bash -c export
This is like the last resort, which should help verify what the environment looks like on the remote machine, when using mpirun to launch the process remotely. The things to keep an eye out for are:
Right now I'm too tired to figure out any more tests and/or possibilities. Best regards and good luck! Bruno |
Thanks for your response. I am mentioning all the commands in this post. I still have to try the commands for difference in remote machine configuration. I will get back to you soon on that.
Quote:
Commands to launch: 1. Without -parallel but with machine file mpirun -hostfile machines -np 2 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest ====> Works fine from both the machines 2. With -parallel but without any machine file mpirun -np 2 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest -parallel ====> Works fine from both the machines 3. With -parallel and with machine file mpirun -hostfile machines -np 2 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest -parallel ====> Does not work from any of the machine With Foam Job, I try the following: foamJob -p -s parallelTest ==> This works when machines file is not present in the current directory, otherwise it fails All of the following commands work fine from both the machines. mpirun -np 1 -host fire3 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest -case $PWD mpirun -np 1 -host fire2 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest -case $PWD mpirun -np 1 -host fire3 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest mpirun -np 1 -host fire2 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest mpirun -np 2 -host fire3 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest -parallel mpirun -np 2 -host fire2 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec parallelTest -parallel When running with -parallel across the machines, once in a while I am seeing following error message. Have you seen it before? Quote:
I also tried debugging with gdb. Here is the call stack where program gets stuck when running with -parallel and across the machines: Quote:
|
Hi Bruno,
Thanks for all your help. I figured out that the problem is because of the way OpneMPI behaves: "Unless otherwise specified, Open MPI will greedily use all TCP networks that it can find and try to connect to all peers upon demand (i.e., Open MPI does not open sockets to all of its MPI peers during MPI_INIT -- see this FAQ entry for more details). Hence, if you want MPI jobs to not use specific TCP networks -- or not use any TCP networks at all -- then you need to tell Open MPI." When using MPI_reduce, the OpenMPI was trying to establish TCP through a different interface. The problem is solved if the following command is used: mpirun --mca btl_tcp_if_exclude lo,virbr0 -hostfile machines -np 2 /home/rphull/OpenFOAM/OpenFOAM-1.6/bin/foamExec interFoam -parallel The above command will restrict MPI to use certain networks (lo, vibro in this case). |
Hi pkr,
:eek: Congratulations! It would have taken me a while longer suspect that this would be the case :( I still only had the notion that something was fishy about the network or how things were connecting to each other. And thank you for sharing the solution! I guess this is one of those details that either comes from experience or from a dedicated MPI course! :) Best regards, Bruno |
Hi Bruno,
I am would like to use scotch partitioning instead of metis. Please suggest a way to enable scotch partitioning in OpenFoam-1.6. Thanks. |
Hi Pkr,
I just tested to confirm it... it's as simple as editing the file "system/decomposeParDict" and setting the method like this: Code:
method scotch; Bruno |
Bruno,
I have couple on questions on decompositions. On running decomposePar utiluty, the mesh gets decomposed into processor* directories. Each processor directory contains points, cells and neighbors on which the process works upon. As with the OpenFoam solver, we are trying to solve the equation Ax=b.So i have folowing queries: 1. Where in the code the processor* directories get converted to matrice A, vector x and b? 2. Now each processor will compute matrix-vector product Ax on the set of data it works upon. After each multiplication the processor exchanges boundary elements with the neighbors and somehow updates the matrix-vector product to rectify its own result? How does this update works? 3. How does working on small matrices by each processor equivalent to work on the complete matrix? I mean none of the processor gets the complete Matrix vector product at any time. Let me know if you have same idea on the same. Thanks |
Hi Pkr,
Sadly I don't know how OpenFOAM does the Ax=b in parallel. But if I needed to find out, I would start looking for which solvers/functions/methods call upon the methods made available in "libPstream.so"! The source code for Pstream is in the folder "$WM_PROJECT_DIR/src/Pstream/mpi". It has the methods used for communicating between processors. If you trace back who calls these methods, you should be able to figure out how OpenFOAM does the matrix operations! The other keywords to keep an eye out for are the classes/methods related to the pre-conditioners and respective matrix solvers that we usually define in fvSolution. Good luck! Bruno |
Bruno, Thanks for your response. Another query:
If the mesh is already decomposed among 4 processors. After decomposition, Is there a way where one of the processor transfer some of it's cells/points/faces to another processor at runtime. I am looking for some kind of join and split operation. Thanks |
Quote:
1. The processor directories only contain sub-domains of the mesh after partitioning. Parallelisation is actually done at the lduMatrix level. 2. If you really want to see the exchange process in action, take a look at the lduMatrix class: ($FOAM_SRC)/OpenFOAM/matrices/lduMatrix/lduMatrix/lduMatrixUpdateMatrixInterfaces.C There's are (init/update)MatrixInterfaces member functions that do what you're looking for. 3. You're right - each processor does only its bit of the matrix multiply. The rest is handled through information exchange across processor boundaries. Most Krylov solvers require some form of global dot-product reduction, but besides that, this is essentially the meat-and-potatoes of it. Hope this helps. |
Thanks for your response Sandeep.
It seems that each process solves it's own part of Ax=b. Whre A, x and b are formulated by a sub-domains of mesh in Processor* directories. Can you explain further what do you mean by parallelization at lduMatrix level? Talking about exchange of messages across the interface, I don't understand why is the following operation performed: void processorFvPatchField<scalar>::updateInterfaceMatr ix(...) const { scalarField pnf ( procPatch_.compressedReceive<scalar>(commsType, this->size())() ); const unallocLabelList& faceCells = patch().faceCells(); forAll(faceCells, facei) { result[faceCells[facei]] -= coeffs[facei]*pnf[facei]; } } What's the need of subtracting coeff[]*pnf[] from matrix-vector product result[]. How does it fix the result after being partitioned? Again, How does splitting of a big mesh into smaller meshes where each smaller mesh is solved for Ax=b gives the global solution for the complete large mesh. Thanks. |
Hi pkr,
Quote:
Best regards, Bruno |
Quote:
This is my guess, lets say you spit the main matrix into two parts A = A_local + A_parallel. Now A_Parallel has the matrix coefficients for neighbours on other processors. To solve Ax = b, you would have to write it like this: A_local x = b - A_parallel . x_old This why that vector is subtracting the product of coeff with boundary values. This is just a guess though. --------- Another guess is that Matrix is stored as Ap phi_p = Sum( Al phi_l ) + b In this case for vector matrix product you would have to subtract rather than add. |
Some basic doubts
Dear Bruno,
I have some basic doubts in the working of interFoam/dambreak case: 1. What difference does it make case if the following changes are made in controlDict: Start time: 0 end time : 1 Changed to start time: 0 end time: 0.25 I understand that it reduces the simulation time, but what happens to the quality/correctness of the results? 2.What equation is interFoam/dambreak actually solving? From the code in pEqn.H: Code:
{ 3. How does solving of some equation narrows down to PCG solver which solves the equation of the form Ax=b? Looking forward to hear from you. Thanks. |
Hi Pkr,
My experience in CFD is really limited, so I ask of arjun and deepsterblue or any other experienced OpenFOAM user to fill in the gaps of what I can't answer ;) Quote:
This is different from, for example, icoFoam and simpleFoam, which are stationary solvers (again, not sure of the terminology); in these the time steps are related to the number of iterations made, with no temporal relevance. In interFoam, what affects the quality are other parameters, such as: maxCo, maxAlphaCo, maxDeltaT, "writeControl adjustableRunTime", deltaT. For more information about these, and the physics/equations behind this, read the OpenFOAM's User Guide in the section about the damBreak tutorial: http://www.openfoam.com/docs/user/damBreak.php or more specifically Time step control. As for the other questions, I don't know enough. Again I suggest you read section in the user guide for the damBreak; also check the fvSchemes file for the list of equations possibly used and the methods used to solve them. Also, check the rest of the code that interFoam uses, because it might still be using pieces of code from other header files that you didn't look into ;) Best regards, Bruno |
All times are GMT -4. The time now is 14:03. |