"HPC is dying, and MPI is killing it"

praveen · April 14, 2015, 15:07

That is the title of an article by Jonathan Dursi

http://www.dursi.ca/hpc-is-dying-and-mpi-is-killing-it/

Since MPI is widely used in CFD, I want to know what CFD people think about this.

The article mentions a HPC programming language called Chapel being developed by Cray. Has anybody here used it to write CFD codes ?

mprinkey · April 14, 2015, 19:56

I think there is a coding crisis coming in HPC, but I'm not sure if the article provides any solution for real world CFD applications. Certainly, SIMD parallelism, multi-core parallelism, and multi-system parallelism all have to be addressed to provide efficient use of current and future generation cpus and clusters. And the process of tweaking all of them and the nonlinear interactions among them make this all more difficult than ever. These are months-long or even years-long tasks for very smart people.

It is therefore no surprise that many production codes are not well optimized at all. But they work, and we use them. Refactoring (or from-scratch recoding) these existing codes can be very difficult and expensive because extracting good SIMD/many-core efficiency may require wholesale changes to data structures...look at the AoS vs SoA vs AoSoA discussions around optimization for Intel MIC processes as an example. Those kind of changes can be made in your 1000-line solver over a long weekend. But for an existing 100k line code, that represents a huge time investment both to code the changes and to debug/regression test and then finally tune it.

The situation even calls into question the fundamental algorithms and approximation schemes that we use. For example, Discontinuous Galerkin may make a lot of sense in many-core/SIMD implementations because they give a block of calculations that reuse data and are local in ways that map to vector instructions. Blocking, sub-solving, tiered Krylov methods all represent attempts to make local chunks of computation that are still coupled but also be independent enough to leverage the available CPU parallelism. I look over these lists of techniques and I don't see how this gets distilled down into a cleaner, smarter, more expressive parallel coding scheme without leaving a lot by the wayside.

sbaffini · April 15, 2015, 17:10

My opinion is that the point of view in the article is wrong on several levels:

1) The topic is MPI vs. PGAS languages for parallel processing. When talking of parallel, my mind goes to speed and scalability. There is no single quantititative, verifiable statement in the article about any possible comparison on this matter.

2) There is nothing wrong with PGAS languages (i would probably use Fortran Co-Arrays if they were fully implemented in most compilers) but saying that MPI is slowing down the parallel programming is like saying that assembly is slowing down the diffusion of other programming languages. It is simply no sense. Maybe, there was no serious supported PGAS language until yesterday.

3) As a simple programmer i somehow accepted what the MPI-Forum decided. I have no reason to think that those people would waste their time and would develop MPI while thinking that it is inefficient. Moreover, it simply fits my way of thinking parallel.

4) If you want to go to 10k cores or higher, you better know any damn bit of your application. There are cases where mpi_alltoall are less efficient than programmer side implementations (e.g., on some architectures, crystal_router, Fox et al. 1988). I think nobody doing serious programming for performance should rely on external libraries/higher level languages that, than, show to be the performance bottleneck. Again, it is no sense. Have you ever seen a F1 team using a non proprietary engine winning a GP?

5) I can write far better code for shared memory in MPI than OpenMP, without even thinking of shared memory. Who's low level here?

6) The overall leading argument is no stronger than the typical arguments in C++ vs Fortran discussions. If you don't know Fortran don't do Fortran, but do not bother the other people. If you don't like MPI don't do MPI.

7) Are we really considering a parallel discussion where the examples are made in python?

mprinkey · April 15, 2015, 20:55

I largely agree with you. I lost interest too when I saw no scaling plots in the article...and I too love Python, but... 8)

However, there is a point to be made about the increasing level-of-effort associated with HPC CFD coding--HPC coding in general as well. There was a nice article about VASP and the intrinsic algorithmic limitations it is facing, and it raise the question of whether or not future generations of CPUs will be able to run VASP faster than current generation. Of course, VASP is not CFD, and they may be hitting limitations sooner than we are, but the point is valid. Their present is likely our near future.

https://www.nsc.liu.se/~pla/blog/2014/07/18/peakvasp/

Quote:

Originally Posted by sbaffini

5) I can write far better code for shared memory in MPI than OpenMP, without even thinking of shared memory. Who's low level here?

I certainly understand your point here, but, in my opinion, three things start to advocate for OpenMP. First, OpenMP 4.0 implements #simd pragmas that (we hope) pave the way to express vectorization without coding to SSE/AVX intrinsics or relying on the compiler to divine vectorization opportunities. I think this gets more and more important as vector registers get wider.

Second, as the per-system core counts approach 10^2 (not even MIC, just broadwell and beyond), the total number of cores in an MPI run could commonly reach 10k cores in near-generation clusters. And, as you point out, MPI does not always make the best tradeoffs as core counts grow so large--robustness, start-up times, and global collective/gather/scatter performance all start to suffer.

Finally, dealing with that many thousands of MPI processes makes a real mess for distributed IO. Run a few 1000-core openfoam jobs and see how long it takes to run your filesystem out of inodes. OpenMP (hold your nose if you must) helps alleviates a lot of these problems.

sbaffini · April 16, 2015, 15:51

I also agree with you but, when i was saying that i was not considering 'shared memory' with MPI, i was not actually saying that this is how things should be done (still, my OpenMP capabilities are so low that naive MPI implementations are sufficient to overcome them). I usually feel more comfortable in creating my own communication patterns (rings, carts, minimum spanning trees, etc.) instead of always relying on those of the MPI implementation. I never tried optimizing for groups of processors sharing a common node (because, as said, trivial MPI already outperforms my basic OpenMP capabilities and because i never met a node with more than 16 processors), but that must not be that complicated if properly designed.
Hence, i think, when PGAS will show their potentiality i will simply switch to them and will remember the brave MPI times with some nostalgy (like people who used punchcards remember their old days).

However, while i am not a huge fun of openfoam, i really appreciate Nek5000 (https://nek5000.mcs.anl.gov) which, among other things, is clearly showing how an appropriate programming can reach very high levels of scalability. They do not use shared memory protocols but have their own communication protocols (which however, should be unaware of any 'sharing'). We are talking about 1 Mln processes. Not exactly a small number and, as far as i can say, well above the average count of the medium jobs running on the top machines available today (and, probably, at least for the next 10 years).

Obviously, vectorization will be the route for the next step (nodes with large processor counts) but, if i can ask (sarcasm mode off, this is a real question for which i am interested in more opinions), is then the old story:

'parallel is better with commodity hardware (instead of vector machines), which promotes portability on multiple systems, does not require specialized programming, etc.'

not true anymore? Even the playstation is now based on commodity processors (after they realized that developing on the cell processor was harder and with larger costs).

I undesrtand the energy efficiency issue but, how many people out there are now investing their programming skills on the Phi coprocessor or whatever? Is the final cost of this operation really affordable by the community (and its investors) considering the number of commercial applications which can actually use such large systems (none i am aware of)? I mean, even at the research level (here in Italy), very few people can actually produce code to use the unique BG/Q here in Italy. Most people simply use someone else's codes. Large homogeneous systems allow, in my opinion, also better overall usability.

Also, as additional question, has GPU computing really paid off the investment in terms of programming efforts all around the world? An answer to this could provide some hints for the future.

As a final note, reliability also is one of the major topics: with 1 Mln processes or above, your simulation is likely to fail any 5 minutes or so. But i don't understand how this is gonna change, at least radically, with large shared systems. When 512 processors per node will be the norm, there will be someone trying to run on 32 Mln processes, etc.

I never even tought seriously about this but, maybe, an approach requiring at least consideration would be programming a code to be fault tolerant (with some sort of dynamic repartitioning) and/or investing in the overall system reliability (which, as said, soon or later is gonna be challenged again). I don't really feel comfortable with the solution: there is a reliability barrier, let "Pareto front" over it. At least, i would like to know now that there is certainly an unsurmountable barrier (so that i can reschedule my parallel programming plans better).

wyldckat · April 19, 2015, 12:33

Greetings to all!

I'm actually wrote about this here on the forum a few days before this thread started, when trying to answer a few questions about using Hadoop for CFD... which is a pretty big platform mostly written in Java, pretty much developed for HPC in general.

The first detail is that the author of the original blog post did write an additional blog post as a reply to many of the comments that were given to the original blog post: http://dursi.ca/objections-continued/

Now, quoting some of the more relevant details from my answer the other day on the other thread:

Quote:

Originally Posted by wyldckat

In addition, those blog posts refer to Chapel, which is a programming language that has already found it's way to here on the forum: http://www.cfd-online.com/Forums/mai...languages.html

[...]
But as the blog post defends, with today's CPUs and how things have evolved, this language overhead might not be what's stopping us any more, it's actually how long things take to code. In fact, there are already optimization strategies embedded into these languages, that we are unlikely to be able to reproduce with C/C++/FORTRAN without some considerable effort (or at least a matter of searching for the right library).

[...]
Using C++ and other languages to connect to Hadoop is also possible, but after a quick search, it seems that it requires some investigation into what should be really used as the base library for making the connection; MapReduce-MPI, Hadoop Pipes and MR4C (Google's implementation) are just to name a few, over the few dozens that already exist.

Then there is also complete alternatives to any of the above, such as:

UPC: http://en.wikipedia.org/wiki/Unified_Parallel_C
MPI-RMA, which is allegedly (one of) the next generation(s) of MPI.

All of this just to say that using Hadoop as a building block for creating CFD applications is something that perhaps might happen in 3-5 years from now, or be used in the back-office in cloud services that provide CFD software as an online service, without us even knowing about it.

As for the other details already mentioned on this thread:

Using GPGPU for CFD with Python is something that Lorena Barba has been doing for quite some time now. If you Google for:
Code:
```
Lorena Barba Python CFD Cuda
```
you'll find why her research group has been using this.
Coding specifically for GPUs is possibly also something that might be heading to a mute development point, because of the development attempts ongoing by AMD, with their HSA (Heterogeneous System Architecture) compiler features, which pretty much allows to have a mixed structure of hardware built into the CPU, such as including a powerful GPU integrated along with the CPU, both accessing the same RAM. In other words, having the integrated GPU acting as a really big and parallel vector processing unit. Which by the way is pretty much what the Playstation 4 and latest XBox generation consoles are using! Although they only have 8GB of RAM...
HSA aims to also contemplate other options such as having an ARM CPU instead of an x86 CPU, along side something else, such as a RISC unit or something like that.
As for the compiler features... I haven't checked it myself, but it seems that it's oriented to having also a single piece of code that has more than one compiled version, so that the same code runs on all HSA supported hardware.
As I mentioned in the quote, namely this thread: http://www.cfd-online.com/Forums/mai...languages.html - on that thread, you can check yourselves if Chapel and other similar programming languages are worth it, by re-implementing the example case given there in Python and/or hard-core C/C++ with MPI or anything else.

Best regards,
Bruno

sbaffini · April 20, 2015, 16:17

I took some time to read and reflect. I have to say that, honestly, i still don't get to the point. Most of the critics can be resumed in few points that, however, i understand are just a personal view:

1) It is worth remembering that i have nothing against PGAS. But maybe we work in different fields with different targets. I would likely program from scratch a matrix-matrix multiply if it is necessary for performance. Even a different one for each of the most probable combinations, if necessary. I would switch to PGAS, if necessary for performance. Because we are talking about High Performance Computing.

2) The moment you forget about the parallel data layout and exchange you start producing bad code. If you can't forget about them, then i can't see the point in so much discussion.

3) The whole discussion still seems too much pretentious. I haven't seen yet any clear advantage in using PGAS, except that they might be easier to use. This sounds like 'you should use C++ instead of Fortran', which simply reflects unfamiliarity with programming (where you just use whatever fits your needs).

4) I am aware of a lot of groups working on GPUS. Actually almost all the CFD groups i have heard of have worked on this. Which means that several research grants have been spent on this (in the last 10 years or so). I was thus wandering if such effort produced something more than a mere 2-3x speedup for practical cases... I mean, is it worth buying a GPU on purpose for computing, or just if i'm going to use it also in a different way (for visualization)?

5) If you had to follow the same reasoning promoted in the articles, then OpenFOAM should be the last CFD code produced by the mankind. Now, imagine the mankind stopping any literary production after Dante Alighieri, or any painting after Leonardo (just examples, no italianism intended). The result would have been illiteracy, which i fight more than anything else.

April 14, 2015, 15:07	"HPC is dying, and MPI is killing it"	#1
praveen Super Moderator Praveen. C Join Date: Mar 2009 Location: Bangalore Posts: 342 Blog Entries: 6 Rep Power: 18	That is the title of an article by Jonathan Dursi http://www.dursi.ca/hpc-is-dying-and-mpi-is-killing-it/ Since MPI is widely used in CFD, I want to know what CFD people think about this. The article mentions a HPC programming language called Chapel being developed by Cray. Has anybody here used it to write CFD codes ? Quasar_89 likes this. __________________ http://cpraveen.github.io http://twitter.com/cfdlab http://github.com/cpraveen

April 14, 2015, 19:56		#2
mprinkey Senior Member Michael Prinkey Join Date: Mar 2009 Location: Pittsburgh PA Posts: 363 Rep Power: 25	I think there is a coding crisis coming in HPC, but I'm not sure if the article provides any solution for real world CFD applications. Certainly, SIMD parallelism, multi-core parallelism, and multi-system parallelism all have to be addressed to provide efficient use of current and future generation cpus and clusters. And the process of tweaking all of them and the nonlinear interactions among them make this all more difficult than ever. These are months-long or even years-long tasks for very smart people. It is therefore no surprise that many production codes are not well optimized at all. But they work, and we use them. Refactoring (or from-scratch recoding) these existing codes can be very difficult and expensive because extracting good SIMD/many-core efficiency may require wholesale changes to data structures...look at the AoS vs SoA vs AoSoA discussions around optimization for Intel MIC processes as an example. Those kind of changes can be made in your 1000-line solver over a long weekend. But for an existing 100k line code, that represents a huge time investment both to code the changes and to debug/regression test and then finally tune it. The situation even calls into question the fundamental algorithms and approximation schemes that we use. For example, Discontinuous Galerkin may make a lot of sense in many-core/SIMD implementations because they give a block of calculations that reuse data and are local in ways that map to vector instructions. Blocking, sub-solving, tiered Krylov methods all represent attempts to make local chunks of computation that are still coupled but also be independent enough to leverage the available CPU parallelism. I look over these lists of techniques and I don't see how this gets distilled down into a cleaner, smarter, more expressive parallel coding scheme without leaving a lot by the wayside. robo likes this.

April 15, 2015, 17:10		#3
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,151 Blog Entries: 29 Rep Power: 39	My opinion is that the point of view in the article is wrong on several levels: 1) The topic is MPI vs. PGAS languages for parallel processing. When talking of parallel, my mind goes to speed and scalability. There is no single quantititative, verifiable statement in the article about any possible comparison on this matter. 2) There is nothing wrong with PGAS languages (i would probably use Fortran Co-Arrays if they were fully implemented in most compilers) but saying that MPI is slowing down the parallel programming is like saying that assembly is slowing down the diffusion of other programming languages. It is simply no sense. Maybe, there was no serious supported PGAS language until yesterday. 3) As a simple programmer i somehow accepted what the MPI-Forum decided. I have no reason to think that those people would waste their time and would develop MPI while thinking that it is inefficient. Moreover, it simply fits my way of thinking parallel. 4) If you want to go to 10k cores or higher, you better know any damn bit of your application. There are cases where mpi_alltoall are less efficient than programmer side implementations (e.g., on some architectures, crystal_router, Fox et al. 1988). I think nobody doing serious programming for performance should rely on external libraries/higher level languages that, than, show to be the performance bottleneck. Again, it is no sense. Have you ever seen a F1 team using a non proprietary engine winning a GP? 5) I can write far better code for shared memory in MPI than OpenMP, without even thinking of shared memory. Who's low level here? 6) The overall leading argument is no stronger than the typical arguments in C++ vs Fortran discussions. If you don't know Fortran don't do Fortran, but do not bother the other people. If you don't like MPI don't do MPI. 7) Are we really considering a parallel discussion where the examples are made in python? mprinkey and anon_h like this.

April 20, 2015, 16:17		#7
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,151 Blog Entries: 29 Rep Power: 39	I took some time to read and reflect. I have to say that, honestly, i still don't get to the point. Most of the critics can be resumed in few points that, however, i understand are just a personal view: 1) It is worth remembering that i have nothing against PGAS. But maybe we work in different fields with different targets. I would likely program from scratch a matrix-matrix multiply if it is necessary for performance. Even a different one for each of the most probable combinations, if necessary. I would switch to PGAS, if necessary for performance. Because we are talking about High Performance Computing. 2) The moment you forget about the parallel data layout and exchange you start producing bad code. If you can't forget about them, then i can't see the point in so much discussion. 3) The whole discussion still seems too much pretentious. I haven't seen yet any clear advantage in using PGAS, except that they might be easier to use. This sounds like 'you should use C++ instead of Fortran', which simply reflects unfamiliarity with programming (where you just use whatever fits your needs). 4) I am aware of a lot of groups working on GPUS. Actually almost all the CFD groups i have heard of have worked on this. Which means that several research grants have been spent on this (in the last 10 years or so). I was thus wandering if such effort produced something more than a mere 2-3x speedup for practical cases... I mean, is it worth buying a GPU on purpose for computing, or just if i'm going to use it also in a different way (for visualization)? 5) If you had to follow the same reasoning promoted in the articles, then OpenFOAM should be the last CFD code produced by the mankind. Now, imagine the mankind stopping any literary production after Dante Alighieri, or any painting after Leonardo (just examples, no italianism intended). The result would have been illiteracy, which i fight more than anything else. deepak.kn1990 and samurai_01 like this.

April 16, 2015, 15:51		#5
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,151 Blog Entries: 29 Rep Power: 39	I also agree with you but, when i was saying that i was not considering 'shared memory' with MPI, i was not actually saying that this is how things should be done (still, my OpenMP capabilities are so low that naive MPI implementations are sufficient to overcome them). I usually feel more comfortable in creating my own communication patterns (rings, carts, minimum spanning trees, etc.) instead of always relying on those of the MPI implementation. I never tried optimizing for groups of processors sharing a common node (because, as said, trivial MPI already outperforms my basic OpenMP capabilities and because i never met a node with more than 16 processors), but that must not be that complicated if properly designed. Hence, i think, when PGAS will show their potentiality i will simply switch to them and will remember the brave MPI times with some nostalgy (like people who used punchcards remember their old days). However, while i am not a huge fun of openfoam, i really appreciate Nek5000 (https://nek5000.mcs.anl.gov) which, among other things, is clearly showing how an appropriate programming can reach very high levels of scalability. They do not use shared memory protocols but have their own communication protocols (which however, should be unaware of any 'sharing'). We are talking about 1 Mln processes. Not exactly a small number and, as far as i can say, well above the average count of the medium jobs running on the top machines available today (and, probably, at least for the next 10 years). Obviously, vectorization will be the route for the next step (nodes with large processor counts) but, if i can ask (sarcasm mode off, this is a real question for which i am interested in more opinions), is then the old story: 'parallel is better with commodity hardware (instead of vector machines), which promotes portability on multiple systems, does not require specialized programming, etc.' not true anymore? Even the playstation is now based on commodity processors (after they realized that developing on the cell processor was harder and with larger costs). I undesrtand the energy efficiency issue but, how many people out there are now investing their programming skills on the Phi coprocessor or whatever? Is the final cost of this operation really affordable by the community (and its investors) considering the number of commercial applications which can actually use such large systems (none i am aware of)? I mean, even at the research level (here in Italy), very few people can actually produce code to use the unique BG/Q here in Italy. Most people simply use someone else's codes. Large homogeneous systems allow, in my opinion, also better overall usability. Also, as additional question, has GPU computing really paid off the investment in terms of programming efforts all around the world? An answer to this could provide some hints for the future. As a final note, reliability also is one of the major topics: with 1 Mln processes or above, your simulation is likely to fail any 5 minutes or so. But i don't understand how this is gonna change, at least radically, with large shared systems. When 512 processors per node will be the norm, there will be someone trying to run on 32 Mln processes, etc. I never even tought seriously about this but, maybe, an approach requiring at least consideration would be programming a code to be fault tolerant (with some sort of dynamic repartitioning) and/or investing in the overall system reliability (which, as said, soon or later is gonna be challenged again). I don't really feel comfortable with the solution: there is a reliability barrier, let "Pareto front" over it. At least, i would like to know now that there is certainly an unsurmountable barrier (so that i can reschedule my parallel programming plans better).