Parallelization and Processor Interaction

kalyangoparaju · January 17, 2012, 22:13

Friends,

I was wondering how exactly parallalization works in CFD.

I understand that the domain is decomposed and each decomposed section is allotted to a particular processor. It is obvious that the simulation of the downstream domain cannot proceed without the simulation results from the upstream domain. Does that mean that the other processors remain silent till the iteration happens in the first processor and take that information and then proceed?

I know that is not what happens but am not really sure what exactly happens !!. Can someone else what exactly happens.

Docfreezzzz · January 18, 2012, 11:37

Generally speaking the domain is decomposed but a small overlap is kept on both processors on each interface. A timestep is taken and the updates are synced via MPI, etc. and the solution continues. Thus, each process knows what its neighboring solutions look like on the interfaces. This is only possible b/c the Navier-Stokes equations are hyperbolic (local stencils)... if they were parabolic the updating is more complicated. This is exactly what happens in an explicit solver but implicit methods have to perform updates during the linear system solve to update the deltaQ values or the solution will not converge.

cfdnewbie · January 18, 2012, 11:42

Quote:

Originally Posted by Docfreezzzz

. This is only possible b/c the Navier-Stokes equations are hyperbolic (local stencils)... if they were parabolic the

Just a quick note: The Navier-Stokes equation include the viscous term, so they are not purely hyperbolic. I assume you meant the Euler (non-viscous Navier-Stokes), which are indeed purely hyperbolic!

Cheers!

Docfreezzzz · January 18, 2012, 13:15

Yes. You are correct that the Navier-Stokes equations aren't purely hyperbolic but my statement above still stands. The problem can be and is still handled in a localized way. Especially when considering the approximations made via RANS equations, the viscous terms not driving the solution, and generally a few Newton iterations if solving time accurate (transient) flows, we aren't committing an enormous error by treating them as approximately local.

cfdnewbie · January 18, 2012, 13:33

I agree with you, and didn't doubt your statement.
Just as a tangent to that: Ff you use diffusive Riemann solvers for the viscos terms, there's no difference in terms of parallelization for hyperbolic or parabolic terms.

Martin Hegedus · January 18, 2012, 14:54

I thought steady state incompressible inviscid flow was elliptical?

And pressure waves go upstream. So I'm confused about the original post in regards to downstream and upstream. Are we talking purely supersonic flow? If that is the case than a space marching Euler method can be used.

Yes, compressible flow is hyperbolic, i.e. the eigenvalues are real, in theory. However, I thought in practice it depends on the speed of the waves and how they are bouncing around. If you put two bodies close together (or a lot of interference), things may get a little stiff. The grid needs to get fine or the problem is easier to solve with central differencing. Is this not true?

cfdnewbie · January 18, 2012, 15:11

Quote:

Originally Posted by Martin Hegedus

I thought steady state incompressible inviscid flow was elliptical?

Yes, you are right, it is because the pressure waves are not resolved in time, just propagated at infinite speed.

Quote:

And pressure waves go upstream. So I'm confused about the original post in regards to downstream and upstream. Are we talking purely supersonic flow? If that is the case than a space marching Euler method can be used.

Yes, I also think that's what the guy was asking about...the equations become "ODE-like" in space, and you can just integrate them along the direction of the characteristics...

Quote:

Yes, compressible flow is hyperbolic, i.e. the eigenvalues are real, in theory.

sorry to be picky here. The eigenvalues of the convective Jacobian are real, so that part is indeed hypersonic. If you have a viscous contribution, that adds a parabolic term. That's probably what you meant, just trying to avoid confusion!

Quote:

However, I thought in practice it depends on the speed of the waves and how they are bouncing around. If you put two bodies close together (or a lot of interference), things may get a little stiff. The grid needs to get fine or the problem is easier to solve with central differencing. Is this not true?

not sure what you meant by that....if you put two bodies close together, are you refering to the boundary layer / viscosity dominated layer between them? Or are you talking about a inviscid flow, so sth like a shock train in a duct?

Cheers!

Martin Hegedus · January 18, 2012, 15:39

Yes, I was talking about pure Euler. Viscous terms are not hyperbolic.

In regards to the last part, I was referring to inviscid flow. Solid boundaries are modeled by reflecting the waves. When two solid boundary conditions "see" themselves there is the opportunity for a lot reflections to occur and information is passed between them very rapidly. From what I understand this can cause an issue (i.e. stability) with flux splitting methods. I don't know the ins and outs of it though.

cfdnewbie · January 19, 2012, 05:18

Yeah, I can see that "too many waves, to little resolution" might trhow any Riemann solver of track... Especially with the upwind bias that convective discretizations usually have, right? So if there are two waves crossing in a single cell, the lower part of the face might be "upwind", while the upper half might be "downwind". I can see how that would be a problem!
Only solution I can think of is the one you mentioned: more resolution or higher order schemes on the same grid

(Yeah, I like them, I admit

)

Cheers!

Martin Hegedus · January 19, 2012, 11:04

In regards to flux splitting, it would be interesting if someone would try it... I've not seen it addressed in papers, but that does not mean it does not exists.

In regards to the original topic, there are two types of parallalization, parallelization between machines and parallelization between processors. Between machines one needs to use domain decomposition. Between processors one can either use domain decomposition or multiple threads on one big domain.

Two types of solution methodologies are implicit and explicit. Explicit methods only rely on information from the previous time step. Explicit methods are just one big loop where each point is updated individually. They lend themselves to domain decomposition and GPUs. Implicit methods rely on surrounding information from the current time step and require a matrix inversion. Because of this, they are more challenging for decomposition and GPUs to solve. From what I understand, matrix inversion does not lend itself to large scale parallelization, such as GPUs.

In all cases, processors do not wait nor do they remain silent. In general, each domain is calculated separately with boundary conditions being set from values from the previous iteration. In other words, the boundary values are lagged. This is not an issue with explicit methods since that's what they do. But, it does introduce an error (and instability) with implicit methods.

Well, those are the methodologies I'm familiar with. There probably are others.

cfdnewbie · January 19, 2012, 12:53

Quote:

Originally Posted by Martin Hegedus

Two types of solution methodologies are implicit and explicit. Explicit methods only rely on information from the previous time step. Explicit methods are just one big loop where each point is updated individually. They lend themselves to domain decomposition and GPUs. Implicit methods rely on surrounding information from the current time step and require a matrix inversion. Because of this, they are more challenging for decomposition and GPUs to solve. From what I understand, matrix inversion does not lend itself to large scale parallelization, such as GPUs.

Just a remark along those lines:
For explicit methods, the limiting factor (in terms of parallelization efficiency) is the message passing latency (i.e. the communication time) while for implict methods, the limiting factor is often the RAM available, since they need to fit a large matrix (or parts of it) in the memory. Explicit methods are generally easier to parallelize and scale better than implicit ones, which require more sophisticated strategies.

With regards to this, there's somewhat of a paradigm shift in supercomputing: The trend goes not towards higher clock frequencies, but to more and faster CPUs with less RAM. The idea is to have O(10⁵ - 10⁶) simple, not too fast procs, with only little RAM. Since implicit methods are in desperate need of RAM, it will be a real challenge to get these methods to scale well on the next generation of supercomputers.

I realize that this is not of relevance for (most) engineering applications, but I find it interesting!

Cheers!

Docfreezzzz · January 19, 2012, 15:19

A bit of a note here. Implicit methods are generally what "big" CFD codes as well as research codes utilize. The time step is limited only by the physics you wish to capture with implicit methods while explicit methods are severely limited and thus take a very long time to converge. Implicit methods are nearly always cheaper in the long run. Also, implicit methods have been scaled satisfactorily to well over 100,000 cores. This is not a trivial task but it is well within the grasp of modern CFD practitioners.

On a side note, I have a research code which is implicit and scales well up to several hundred processors (separate machines). This is not beyond the realm of possibility and in fact, is the norm.

Martin Hegedus · January 19, 2012, 15:33

Is the code implicit structured or implicit unstructured?

I'm familiar with structured implicit codes and they do scale up well. Structured implicit solvers, when they use a factorization scheme, also don't have high memory overhead. Not sure about non factorized schemes. And I'm not sure about the ins and outs of parallelizing unstructured implicit methods.

Docfreezzzz · January 19, 2012, 15:37

It is implicit unstructured with multi-polyhedral element type capability. I do store the entire left hand side matrix structure using a CRS approach. It is expensive but the only way to do it if you don't want to be sitting around watching the solver wasting machine time all day. Explicit methods are just not robust enough to do the large scale simulations we are interested in.

cfdnewbie · January 19, 2012, 15:38

Quote:

Originally Posted by Docfreezzzz

A bit of a note here. Implicit methods are generally what "big" CFD codes as well as research codes utilize. The time step is limited only by the physics you wish to capture

yes, many codes are indeed implicit, and its a true art to get it right (and efficient). However, as you pointed out, you are limited by the physics you wish to capture. In turbulence research, the physical time steps can indeed become very small, and that's where explicit time stepping shines. In fact, at a recent AIAA workshop on high order methods, the explicit time stepping codes beat the implicit ones in terms of speed. I guess it depends completely on the physics you are interested in whether explicit or implicit is the strategy to go - plus the hardware you have available. In my area of research, it is close to 50 - 50 between explicit and implicit.

Quote:

with implicit methods while explicit methods are severely limited and thus take a very long time to converge.

to steady state? of course, you can't beat implicit to steady state, no question!

Quote:

Implicit methods are nearly always cheaper in the long run.

From my experience, no, but that depends again on the physics, the programming tricks and the hardware...

Quote:

Also, implicit methods have been scaled satisfactorily to well over 100,000 cores.

wow, ok, I would like to see that...are we talking strong or weak scaling? and what percentage? weak scaling is not that difficult to achieve, but if you can show me a strong scaling of an implicit code on over 100k procs, I'd rest my case

This is not a trivial

Quote:

On a side note, I have a research code which is implicit and scales well up to several hundred processors (separate machines). This is not beyond the realm of possibility and in fact, is the norm.

again, strong or weak scaling? and hundred is respectable, but I was talking more about 100k

Interesting discussion, folks

Seems a few of us tend to hijack threads lately

Cheers!

cfdnewbie · January 19, 2012, 15:40

Quote:

Originally Posted by Docfreezzzz

Explicit methods are just not robust enough to do the large scale simulations we are interested in.

Could you describe the type of physics you are interested in a little bit? You are the first one I hear talking about explicit not being robust, so I'm curious to find out. In my community, people tend to see implicit as fickle and explicit as robust....

Docfreezzzz · January 19, 2012, 15:52

By robust I am meaning at reasonable time steps. For instance, modeling turbomachinery or combustion chambers we are interested in the temperature loading of the walls or the aero-elastic effects on the blades, etc. Modeling extremely small time scales is superfluous even in unsteady cases. Explicit methods require time scales which are very small and with complicated physics the time step size is driven to a ridiculously low size at which the physics are not interesting to us.

Implicit methods on the other hand allow us to make decisions based on the physics we are trying to capture almost independent of numerical stability concerns. Hence I call them robust. Also, please note that explicit methods do not allow us to have time accurate BCs and implicit methods allow us to place each step in a Newton loop and therefore keeps the BCs time accurate as well.

I'll definitely give you that if your time scale of interest is already below the explicit stability limit then it makes no sense to run an implicit method. We are almost never in this region, even with combustion modeling.

I can't give you specific examples because most of the codes I'm referring to are not for public release and you wouldn't recognize the names anyway. However, they do exhibit good scaling.

As far as strong vs. weak... Strong scaling is nearly impossible to show from 1 out of 100,000 procs. We can't even load the case on one machine. Also, if we could load it on one machine, the whole case would be in cache by the time we reached 100,000 procs and that is hardly a fair comparison.

So, my code does show strong scaling within these limits. That is, small enough to load on a single node out to the point where we get superlinear speedup due to cache effects. I'd definitely say that it is more than plausible if you can take care of the I/O at that level.

cfdnewbie · January 19, 2012, 16:04

Quote:

Originally Posted by Docfreezzzz

By robust I am meaning at reasonable time steps. For instance, modeling turbomachinery or combustion chambers we are interested in the temperature loading of the walls or the aero-elastic effects on the blades, etc. Modeling extremely small time scales is superfluous even in unsteady cases. Explicit methods require time scales which are very small and with complicated physics the time step size is driven to a ridiculously low size at which the physics are not interesting to us.

Yes, ok, I understand, if your small time scales don't really matter as they do not govern the physics, then you are way better of with implicit methods.
I just always note that people doing implicit solvers spend way more time optimising their time integrator than doing simulations or analyzing physics. At least that's my impression, maybe I'm wrong.... it just seems like the whole implicitness brings with it so many parameters to optimize that it is just no use doing and that it is kind of arbitrary how you set your limits. For example, I recently overheard two very well known professors of well known US institutions argue about whether 10E-8 or 10E-6 should be set as a convergence criteria for their implicit solver and whether one solution was "correct" and the other one was not....that's what really makes me shake my head when it comes to implicit... but I can see that it does make sense in certain situations.

Quote:

Implicit methods on the other hand allow us to make decisions based on the physics we are trying to capture almost independent of numerical stability concerns. Hence I call them robust. Also, please note that explicit methods do not allow us to have time accurate BCs and implicit methods allow us to place each step in a Newton loop and therefore keeps the BCs time accurate as well.

Oh, there are indeed explicit integrators with time accurate BCs as well! I know of the top of my head of an RK3 method, but I have read a paper about extension to higher O() as well a while ago!

Quote:

We are almost never in this region, even with combustion modeling.

yes, I can see that my view on the topic is probably limited to a turbulent world

=

Quote:

As far as strong vs. weak... Strong scaling is nearly impossible to show from 1 out of 100,000 procs. We can't even load the case on one machine. Also, if we could load it on one machine, the whole case would be in cache by the time we reached 100,000 procs and that is hardly a fair comparison.

I have seen explicit codes scale form 1 to 125k with about 88%... that's really impressive. I have no idea how implicit codes would compare to that, though...

Cheers!

Martin Hegedus · January 19, 2012, 16:05

In general I do agree with what your saying, but I didn't understand this part.

Quote:

Originally Posted by Docfreezzzz

Also, please note that explicit methods do not allow us to have time accurate BCs and implicit methods allow us to place each step in a Newton loop and therefore keeps the BCs time accurate as well.

As the local CFL number for an implicit solver gets much below one it becomes explicit, i.e. the off diagonal terms go to zero, or so I thought. I think that an explicit scheme could be used within a Newton loop. Not that one would want to, but that is a different story.

Martin Hegedus · January 19, 2012, 16:12

Oh, the implicit solvers I'm aware of lag the boundary conditions. Also, for domain decomposition, aren't the fringe boundaries usually lagged?

January 17, 2012, 22:13	Parallelization and Processor Interaction	#1
kalyangoparaju Member Kalyan Join Date: Oct 2011 Location: Columbus, Ohio Posts: 53 Blog Entries: 1 Rep Power: 14	Friends, I was wondering how exactly parallalization works in CFD. I understand that the domain is decomposed and each decomposed section is allotted to a particular processor. It is obvious that the simulation of the downstream domain cannot proceed without the simulation results from the upstream domain. Does that mean that the other processors remain silent till the iteration happens in the first processor and take that information and then proceed? I know that is not what happens but am not really sure what exactly happens !!. Can someone else what exactly happens.

January 18, 2012, 11:37		#2
Docfreezzzz Member Join Date: Jul 2011 Location: US Posts: 39 Rep Power: 14	Generally speaking the domain is decomposed but a small overlap is kept on both processors on each interface. A timestep is taken and the updates are synced via MPI, etc. and the solution continues. Thus, each process knows what its neighboring solutions look like on the interfaces. This is only possible b/c the Navier-Stokes equations are hyperbolic (local stencils)... if they were parabolic the updating is more complicated. This is exactly what happens in an explicit solver but implicit methods have to perform updates during the linear system solve to update the deltaQ values or the solution will not converge. __________________ CFD engineering resource

January 18, 2012, 13:15		#4
Docfreezzzz Member Join Date: Jul 2011 Location: US Posts: 39 Rep Power: 14	Yes. You are correct that the Navier-Stokes equations aren't purely hyperbolic but my statement above still stands. The problem can be and is still handled in a localized way. Especially when considering the approximations made via RANS equations, the viscous terms not driving the solution, and generally a few Newton iterations if solving time accurate (transient) flows, we aren't committing an enormous error by treating them as approximately local. __________________ CFD engineering resource

January 19, 2012, 15:19		#12
Docfreezzzz Member Join Date: Jul 2011 Location: US Posts: 39 Rep Power: 14	A bit of a note here. Implicit methods are generally what "big" CFD codes as well as research codes utilize. The time step is limited only by the physics you wish to capture with implicit methods while explicit methods are severely limited and thus take a very long time to converge. Implicit methods are nearly always cheaper in the long run. Also, implicit methods have been scaled satisfactorily to well over 100,000 cores. This is not a trivial task but it is well within the grasp of modern CFD practitioners. On a side note, I have a research code which is implicit and scales well up to several hundred processors (separate machines). This is not beyond the realm of possibility and in fact, is the norm. __________________ CFD engineering resource

January 19, 2012, 15:37		#14
Docfreezzzz Member Join Date: Jul 2011 Location: US Posts: 39 Rep Power: 14	It is implicit unstructured with multi-polyhedral element type capability. I do store the entire left hand side matrix structure using a CRS approach. It is expensive but the only way to do it if you don't want to be sitting around watching the solver wasting machine time all day. Explicit methods are just not robust enough to do the large scale simulations we are interested in. __________________ CFD engineering resource

January 18, 2012, 13:33		#5
cfdnewbie Senior Member cfdnewbie Join Date: Mar 2010 Posts: 557 Rep Power: 20	I agree with you, and didn't doubt your statement. Just as a tangent to that: Ff you use diffusive Riemann solvers for the viscos terms, there's no difference in terms of parallelization for hyperbolic or parabolic terms.

January 18, 2012, 14:54		#6
Martin Hegedus Senior Member Martin Hegedus Join Date: Feb 2011 Posts: 500 Rep Power: 19	I thought steady state incompressible inviscid flow was elliptical? And pressure waves go upstream. So I'm confused about the original post in regards to downstream and upstream. Are we talking purely supersonic flow? If that is the case than a space marching Euler method can be used. Yes, compressible flow is hyperbolic, i.e. the eigenvalues are real, in theory. However, I thought in practice it depends on the speed of the waves and how they are bouncing around. If you put two bodies close together (or a lot of interference), things may get a little stiff. The grid needs to get fine or the problem is easier to solve with central differencing. Is this not true?

January 18, 2012, 15:39		#8
Martin Hegedus Senior Member Martin Hegedus Join Date: Feb 2011 Posts: 500 Rep Power: 19	Yes, I was talking about pure Euler. Viscous terms are not hyperbolic. In regards to the last part, I was referring to inviscid flow. Solid boundaries are modeled by reflecting the waves. When two solid boundary conditions "see" themselves there is the opportunity for a lot reflections to occur and information is passed between them very rapidly. From what I understand this can cause an issue (i.e. stability) with flux splitting methods. I don't know the ins and outs of it though.

January 19, 2012, 05:18		#9
cfdnewbie Senior Member cfdnewbie Join Date: Mar 2010 Posts: 557 Rep Power: 20	Yeah, I can see that "too many waves, to little resolution" might trhow any Riemann solver of track... Especially with the upwind bias that convective discretizations usually have, right? So if there are two waves crossing in a single cell, the lower part of the face might be "upwind", while the upper half might be "downwind". I can see how that would be a problem! Only solution I can think of is the one you mentioned: more resolution or higher order schemes on the same grid (Yeah, I like them, I admit ) Cheers!

January 19, 2012, 11:04		#10
Martin Hegedus Senior Member Martin Hegedus Join Date: Feb 2011 Posts: 500 Rep Power: 19	In regards to flux splitting, it would be interesting if someone would try it... I've not seen it addressed in papers, but that does not mean it does not exists. In regards to the original topic, there are two types of parallalization, parallelization between machines and parallelization between processors. Between machines one needs to use domain decomposition. Between processors one can either use domain decomposition or multiple threads on one big domain. Two types of solution methodologies are implicit and explicit. Explicit methods only rely on information from the previous time step. Explicit methods are just one big loop where each point is updated individually. They lend themselves to domain decomposition and GPUs. Implicit methods rely on surrounding information from the current time step and require a matrix inversion. Because of this, they are more challenging for decomposition and GPUs to solve. From what I understand, matrix inversion does not lend itself to large scale parallelization, such as GPUs. In all cases, processors do not wait nor do they remain silent. In general, each domain is calculated separately with boundary conditions being set from values from the previous iteration. In other words, the boundary values are lagged. This is not an issue with explicit methods since that's what they do. But, it does introduce an error (and instability) with implicit methods. Well, those are the methodologies I'm familiar with. There probably are others.

January 19, 2012, 15:33		#13
Martin Hegedus Senior Member Martin Hegedus Join Date: Feb 2011 Posts: 500 Rep Power: 19	Is the code implicit structured or implicit unstructured? I'm familiar with structured implicit codes and they do scale up well. Structured implicit solvers, when they use a factorization scheme, also don't have high memory overhead. Not sure about non factorized schemes. And I'm not sure about the ins and outs of parallelizing unstructured implicit methods.

January 19, 2012, 15:52		#17
Docfreezzzz Member Join Date: Jul 2011 Location: US Posts: 39 Rep Power: 14	By robust I am meaning at reasonable time steps. For instance, modeling turbomachinery or combustion chambers we are interested in the temperature loading of the walls or the aero-elastic effects on the blades, etc. Modeling extremely small time scales is superfluous even in unsteady cases. Explicit methods require time scales which are very small and with complicated physics the time step size is driven to a ridiculously low size at which the physics are not interesting to us. Implicit methods on the other hand allow us to make decisions based on the physics we are trying to capture almost independent of numerical stability concerns. Hence I call them robust. Also, please note that explicit methods do not allow us to have time accurate BCs and implicit methods allow us to place each step in a Newton loop and therefore keeps the BCs time accurate as well. I'll definitely give you that if your time scale of interest is already below the explicit stability limit then it makes no sense to run an implicit method. We are almost never in this region, even with combustion modeling. I can't give you specific examples because most of the codes I'm referring to are not for public release and you wouldn't recognize the names anyway. However, they do exhibit good scaling. As far as strong vs. weak... Strong scaling is nearly impossible to show from 1 out of 100,000 procs. We can't even load the case on one machine. Also, if we could load it on one machine, the whole case would be in cache by the time we reached 100,000 procs and that is hardly a fair comparison. So, my code does show strong scaling within these limits. That is, small enough to load on a single node out to the point where we get superlinear speedup due to cache effects. I'd definitely say that it is more than plausible if you can take care of the I/O at that level. cfdnewbie likes this. __________________ CFD engineering resource

January 19, 2012, 16:12		#20
Martin Hegedus Senior Member Martin Hegedus Join Date: Feb 2011 Posts: 500 Rep Power: 19	Oh, the implicit solvers I'm aware of lag the boundary conditions. Also, for domain decomposition, aren't the fringe boundaries usually lagged?