[PyFoam] multiprocessing

ashkan · December 25, 2017, 03:20

Hi All,
I would like to run series of OpenFoam simulations using PyFoam and concurrently, i.e running multiple simulation in parallel.

I created Python script using PyFoam and joblib. The script works perfectly fine when each simulation itself is running in serial. However, when I want to run each simulation in parallel then the script still works but simulations gets very slow.

For example, I have 64 cores and I want to use 2 processors for each simulations, so basically running 32 parallel simulations concurrently.

I checked the CPU usage and when I run the script in parallel then some of processors have very low usage which does not happen when running in serial.

Here is my script and I would highly appreciate if anyone can give me some comments/hints to resolve the issue. I was wondering if it has something to do with then LAMMachine settings (maybe specifying node numbers)?

Many thanks in advance
Ashkan

Code:

import sys,os
import numpy as np
from os import path
from PyFoam.Execution.UtilityRunner import UtilityRunner
from PyFoam.Execution.BasicRunner import BasicRunner
from PyFoam.Infrastructure.ClusterJob import SolverJob
from PyFoam.RunDictionary.SolutionFile import SolutionFile
from PyFoam.RunDictionary.SolutionDirectory import SolutionDirectory
from PyFoam.RunDictionary.ParsedParameterFile import ParsedParameterFile
from PyFoam.Execution.ParallelExecution import LAMMachine
from PyFoam.Basics.DataStructures import Vector
from PyFoam.Error import error

from joblib import Parallel, delayed
import multiprocessing
############################################################################################
num_cores = multiprocessing.cpu_count()   #number of cores on machine

OFcpus  = 2        #number of CPUs for each simulation
num_cases_paral = int(num_cores/OFcpus)
print(num_cases_paral)

solver="simpleFoam"        
OrigCase="BaseCase"
curpath=os.path.abspath('.')

nu      = 1.e-6;
Leng    = 8.0;

U0 = 0.;  Ue = 2.0;  dU = .1
NumUin = int((Ue - U0)/dU) + 1
CurrVel = np.linspace(U0,Ue,NumUin,endpoint=True) 

roughness = np.array([5e-6, 4e-5, 2e-3, 4e-2])
############################################################################################

#---------------------defining the function for cloning and running--------------------

def Run_Case(iu,ir) :
    flowVelocity  = CurrVel[iu];
    ks            = roughness[ir];
    
    #-------Estimating initial values----------
    flowRe  = flowVelocity*Leng/nu
    TurbInt = 0.168*pow(flowRe,-1./8.)
    
    turbulentKE      = (3./2.)*pow((flowVelocity*TurbInt),2.)
    turbulentEpsilon = pow(0.09,3./4.)*pow(turbulentKE,3./2.)/(0.07*Leng);
    turbulentOmega   = np.sqrt(turbulentKE)/(0.07*Leng);
    
    #-------Creating new case directory----------
    NewCase="CurrProfile"+"Uc_"+str(flowVelocity)+"Ks_"+str(ks)
    
    orig=SolutionDirectory(OrigCase,archive=None,paraviewLink=False)
    
    
    case=orig.cloneCase(NewCase)
    dire=SolutionDirectory(NewCase,archive=None,paraviewLink=False)
    

    #-------Modifying initial conditions----------
    velFile=ParsedParameterFile(path.join(dire.initialDir(),"U"))
    velFile["internalField"].setUniform(Vector(flowVelocity,0,0))
    velFile["boundaryField"]["inlet"]["average"]=Vector(flowVelocity,0,0)
    velFile.writeFile ()
    
    pressFile=ParsedParameterFile(path.join(dire.initialDir(),"p"))
    pressFile["internalField"].setUniform(0)
    pressFile.writeFile ()
    
    kFile=ParsedParameterFile(path.join(dire.initialDir(),"k"))
    kFile["internalField"].setUniform(turbulentKE)
    kFile["boundaryField"]["inlet"]["average"]=turbulentKE
    kFile.writeFile ()
    
    omegaFile=ParsedParameterFile(path.join(dire.initialDir(),"omega"))
    omegaFile["internalField"].setUniform(turbulentOmega)
    omegaFile["boundaryField"]["inlet"]["average"]=turbulentOmega
    omegaFile.writeFile ()
    
    epsilonFile=ParsedParameterFile(path.join(dire.initialDir(),"epsilon"))
    epsilonFile["internalField"].setUniform(turbulentEpsilon)
    epsilonFile["boundaryField"]["inlet"]["average"]=turbulentEpsilon
    epsilonFile.writeFile ()
    
    nutFile=ParsedParameterFile(path.join(dire.initialDir(),"nut"))
    nutFile["boundaryField"]["bottom"]["Ks"].setUniform(ks)
    nutFile.writeFile ()

    #-------Meshing----------
    os.system('m4 '+NewCase+'/system/blockMeshDict.m4'+' > '+NewCase+'/system/blockMeshDict')
    blockRun = BasicRunner(argv=["blockMesh","-case",NewCase],silent=True,logname="Blocky")
    print("Running blockMesh")
    blockRun.start()
    if not blockRun.runOK() :
        error("There was a problem with blockMesh in Case ",NewCase)

    #-------decomepose the case----------
    
    decompRun = UtilityRunner(argv=["decomposePar -force","-case",NewCase,],silent=True)
    print("Decomposing the case")
    decompRun.start()
    if not decompRun.runOK() :
        error("There was a problem with decomposPar in Case ",NewCase)
    
    #--------Run the simulation in parallel--------------
    machine = LAMMachine(nr=OFcpus)
    print("Running case",NewCase)
    theRun = BasicRunner(argv=[solver,"-case",NewCase], silent=True, lam=machine)
    theRun.start()


#------------------------Running the Function in parallel-----------------------

Parallel(n_jobs=num_cases_paral)(delayed(Run_Case)(i, j) for j in range(len(roughness)) for i in range(1,len(CurrVel)))

Taataa · December 25, 2017, 17:51

OpenFOAM has a lot of utilities that allows you to do these kind of scripting using only bash so you don't have to rely on third-party scripts such as pyFoam. I would suggest to explore those options as well because in the future if you want to use a cluster, setting up all these third-party software in a way to work with OF is a lot of unnecessary and painful work!

Back to your question, do you have 64 physical core or are you using hyper-threading, virtual cores? You can check it by running lscpu and multiply Core(s) per socket by Socket(s). For example on my laptop it would be 2x1 = 2 while CPU(s) is equal to 4. OF only uses the physical cores and if you try use the virtual cores the performance may drop noticeably.

Another question is do you have enough memory for all these cases to run simultaneously? It could be another bottleneck. If I remember correctly, a rule of thumb in OF is for about 1 mil cells you need 1 GB of memory.

ashkan · December 25, 2017, 19:32

Thanks for your comments Taataa.
I personally believe PyFoam is a very useful tool for interacting with OpenFoam particularly in cases when you need to run many simulations of a same problem but with different conditions. I am just learning it though.

Regarding OF only using physical cores, are you certain? I did a test while ago with and without hyperthreading on my laptop. I noticed that the simulations were slightly faster with 8 cores (hyperthreading) than only 4 physical cores. So I think OF does use virtual cores as well. But I might be wrong!

I also think that the problem might be the physical-virtual cores combinations but believe it might be the Python handling of cores rather than OF, that's why I thought maybe I need to define the node list but have no idea how to do it here.

Thanks again for your comments
Ashkan

Taataa · December 25, 2017, 23:43

Yes, I am sure. I asked Chris, one of the OF developers, and he confirmed. It depends on the case but it's better use the physical ones.

Regarding pyFoam, I am not disregarding its benefits but you can do all that it does with bash and OF utilities so why not exploit them? I usually use python scripts for data analysis only.

Anyhow, cores and memory are usually the bottlenecks in these situations where you don't have to worry about IO communications.

gschaider · December 31, 2017, 06:04

Quote:

Originally Posted by ashkan

Regarding OF only using physical cores, are you certain? I did a test while ago with and without hyperthreading on my laptop. I noticed that the simulations were slightly faster with 8 cores (hyperthreading) than only 4 physical cores. So I think OF does use virtual cores as well. But I might be wrong!

OF does not have any assumptions about the nature of the cores. It just tells MPI "start N instances" assumes that MPI knows what it is doing and MPI usually hands the assigning of the instances to cores to the OS. You can start 200 instances on your 4 core machine and MPI and the OS will try to make sense of the situation in the best way the can (which won't be very effective). In your case this means that the 4 additional instances use parts of the processor that are currently not used by the other 4 (therefor the small speedup)

Quote:

Originally Posted by ashkan

I also think that the problem might be the physical-virtual cores combinations but believe it might be the Python handling of cores rather than OF, that's why I thought maybe I need to define the node list but have no idea how to do it here.

Thanks again for your comments
Ashkan

PyFoam does no core handling at all. It just starts mpirun with appropriate parameters. Usually "only" the "-n" parameter which is determined by the "--procnr" or the "--autosense-parallel"-option (more control is possible by specifying a machinefile with the --machinefile-option). Assigning instances to cores/machines is done by MPI

ashkan · December 31, 2017, 20:52

Quote:

Originally Posted by gschaider

OF does not have any assumptions about the nature of the cores. It just tells MPI "start N instances" assumes that MPI knows what it is doing and MPI usually hands the assigning of the instances to cores to the OS. You can start 200 instances on your 4 core machine and MPI and the OS will try to make sense of the situation in the best way the can (which won't be very effective). In your case this means that the 4 additional instances use parts of the processor that are currently not used by the other 4 (therefor the small speedup)

Thanks Bernhard. It absolutely make sense and what I was thinking as well.

Quote:

Originally Posted by gschaider

PyFoam does no core handling at all. It just starts mpirun with appropriate parameters. Usually "only" the "-n" parameter which is determined by the "--procnr" or the "--autosense-parallel"-option (more control is possible by specifying a machinefile with the --machinefile-option). Assigning instances to cores/machines is done by MPI

Thanks for the comment but still I don't understand why when I run multiple instances (each being parallel OF runs) within my PyFoam script using Python "joblib", the speed-up drops so significantly. Is what I am hoping to achieve correct or make sense at all?

Taataa · January 1, 2018, 18:14

Quote:

Originally Posted by gschaider

OF does not have any assumptions about the nature of the cores. It just tells MPI "start N instances" assumes that MPI knows what it is doing and MPI usually hands the assigning of the instances to cores to the OS.

OF is pure MPI (no openMP) therefore the recommended setting for OMP_NUM_THEADS is 1, meaning no hyperthreading, one job per cpu, only physical core. If you want to utilize hyperthreading you should implement openMP directives in the the code specifically in the linear solvers which partially has been done before, for example this one and this one.

gschaider · January 2, 2018, 07:16

Quote:

Originally Posted by ashkan

Thanks Bernhard. It absolutely make sense and what I was thinking as well.

Thanks for the comment but still I don't understand why when I run multiple instances (each being parallel OF runs) within my PyFoam script using Python "joblib", the speed-up drops so significantly. Is what I am hoping to achieve correct or make sense at all?

If I interpret your code correctly correctly then you're filling all the cores (including the multithreading-cores) with OF-instances as cpu_count() (at least on my machine) reports multithreading-"cores" in that number as well. This leads to the behaviour discussed above. Make sure that num_cores corresponds to the number of physical cores. Check with top that the expected number of OF-solver-instances runs (not more than the number of cores)

gschaider · January 2, 2018, 07:18

Quote:

Originally Posted by Taataa

OF is pure MPI (no openMP) therefore the recommended setting for OMP_NUM_THEADS is 1, meaning no hyperthreading, one job per cpu, only physical core. If you want to utilize hyperthreading you should implement openMP directives in the the code specifically in the linear solvers which partially has been done before, for example this one and this one.

If he asks MPI to start more processes than physical cores then this won't help. Then the processes will take turns using the CPU

ashkan · January 2, 2018, 22:01

Quote:

Originally Posted by gschaider

If I interpret your code correctly correctly then you're filling all the cores (including the multithreading-cores) with OF-instances as cpu_count() (at least on my machine) reports multithreading-"cores" in that number as well. This leads to the behaviour discussed above. Make sure that num_cores corresponds to the number of physical cores. Check with top that the expected number of OF-solver-instances runs (not more than the number of cores)

I think I figure out the issue. Basically, the Python parallelization, decompose the tasks and assign each task (each case/instance of OF simulation) to a single processor. Then because of having "mpirun" within each task, the need more processors which they only have 1. This is why when I use top and look at the CPU percentage not all expected have 100% and only the cores used in the python parallel loop have 100% and hence significant drop in performance.

I am trying to see if it is possible to assign multiple processors to each task in Python so the mpi call have sufficient processors. Any comments is highly appreciated.

Many thanks again for the comments.

gschaider · January 3, 2018, 19:23

Quote:

Originally Posted by ashkan

I think I figure out the issue. Basically, the Python parallelization, decompose the tasks and assign each task (each case/instance of OF simulation) to a single processor. Then because of having "mpirun" within each task, the need more processors which they only have 1. This is why when I use top and look at the CPU percentage not all expected have 100% and only the cores used in the python parallel loop have 100% and hence significant drop in performance.

I am trying to see if it is possible to assign multiple processors to each task in Python so the mpi call have sufficient processors. Any comments is highly appreciated.

Many thanks again for the comments.

So some cores are not utilized because processes are "pinned" to certain cores?
Try running multiple runs (with pyFoamRunner) without your script and see if the same things happens

ashkan · January 3, 2018, 23:41

Quote:

Originally Posted by gschaider

So some cores are not utilized because processes are "pinned" to certain cores?
Try running multiple runs (with pyFoamRunner) without your script and see if the same things happens

I did manage to resolve the issue with Python assigning multiple processors to each case. Here is the link to the discussions

https://stackoverflow.com/questions/...parallel-cases

Also, attached the corrected script.

I have also ran two instances simultaneously each with 2 processors using the PyFoamRunner.py directly rather than my script. Now with the revised script, the processors workload is fine (4 cores have 100% CPU usage) but interestingly, the PyFoamRunner.py is just slightly faster than my python script approach.

Attached also the log files of PyFoam and PythonScript for comparison.

Any comments is highly appreciated.

gschaider · January 4, 2018, 10:12

Quote:

Originally Posted by ashkan

I did manage to resolve the issue with Python assigning multiple processors to each case. Here is the link to the discussions

https://stackoverflow.com/questions/...parallel-cases

Also, attached the corrected script.

I have also ran two instances simultaneously each with 2 processors using the PyFoamRunner.py directly rather than my script. Now with the revised script, the processors workload is fine (4 cores have 100% CPU usage) but interestingly, the PyFoamRunner.py is just slightly faster than my python script approach.

Attached also the log files of PyFoam and PythonScript for comparison.

Any comments is highly appreciated.

Haven't got the time to look at the stuff.

Just one remark: if you have more than one layer (in your case: mpi, PyFoam, the multiprocessing-library) above the OS then processor pinning is a bad idea: let the OS do the assignment (BTW: PyFoam doesn't mess with the pinning). It is quite good at it

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Ansa Script	ganesh0481	ANSA	47	February 21, 2019 04:09
Ansa Script	Grigoriy_Ermolaev	ANSA	3	April 20, 2017 04:50

December 25, 2017, 17:51		#2
Taataa Senior Member Taher Chegini Join Date: Nov 2014 Location: Houston, Texas Posts: 125 Rep Power: 12	OpenFOAM has a lot of utilities that allows you to do these kind of scripting using only bash so you don't have to rely on third-party scripts such as pyFoam. I would suggest to explore those options as well because in the future if you want to use a cluster, setting up all these third-party software in a way to work with OF is a lot of unnecessary and painful work! Back to your question, do you have 64 physical core or are you using hyper-threading, virtual cores? You can check it by running lscpu and multiply Core(s) per socket by Socket(s). For example on my laptop it would be 2x1 = 2 while CPU(s) is equal to 4. OF only uses the physical cores and if you try use the virtual cores the performance may drop noticeably. Another question is do you have enough memory for all these cases to run simultaneously? It could be another bottleneck. If I remember correctly, a rule of thumb in OF is for about 1 mil cells you need 1 GB of memory.

December 25, 2017, 19:32		#3
ashkan Member Join Date: Jul 2010 Posts: 55 Rep Power: 15	Thanks for your comments Taataa. I personally believe PyFoam is a very useful tool for interacting with OpenFoam particularly in cases when you need to run many simulations of a same problem but with different conditions. I am just learning it though. Regarding OF only using physical cores, are you certain? I did a test while ago with and without hyperthreading on my laptop. I noticed that the simulations were slightly faster with 8 cores (hyperthreading) than only 4 physical cores. So I think OF does use virtual cores as well. But I might be wrong! I also think that the problem might be the physical-virtual cores combinations but believe it might be the Python handling of cores rather than OF, that's why I thought maybe I need to define the node list but have no idea how to do it here. Thanks again for your comments Ashkan

December 25, 2017, 23:43		#4
Taataa Senior Member Taher Chegini Join Date: Nov 2014 Location: Houston, Texas Posts: 125 Rep Power: 12	Yes, I am sure. I asked Chris, one of the OF developers, and he confirmed. It depends on the case but it's better use the physical ones. Regarding pyFoam, I am not disregarding its benefits but you can do all that it does with bash and OF utilities so why not exploit them? I usually use python scripts for data analysis only. Anyhow, cores and memory are usually the bottlenecks in these situations where you don't have to worry about IO communications.