CFD Online Discussion Forums - Intelbs MPI and performance tools in OpenFOAM

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- OpenFOAM (https://www.cfd-online.com/Forums/openfoam/)

- - Intelbs MPI and performance tools in OpenFOAM (https://www.cfd-online.com/Forums/openfoam/60881-intelbs-mpi-performance-tools-openfoam.html)

Hi, I'm working as an appli

Hi,

I'm working as an applications engineer in Intel and was involved in running OpenFOAM on Intel platforms, in particular check a replacement of Open MPI with Intel's product MPI.

I saw OpenFOAM (simpleFoam) running over 30% faster with Intel's MPI, as compared to OpenMPI, on crucial benchmarks of an important enduser of OpenFOAM, and Intel coop partner.

Also were this way the performance analysis tools of Intel's enabled with OpenFOAM.

Is this interesting for you? Have you ever thought about a model to link OpenFOAM with
commerical libaries of MPI?

Would appreciate yout interest in this story ...

Hello Hans, It was interest

Hello Hans,

It was interesting to note that intel's mpi runs faster thatn openmpi implementation in OpenFoam. I would be keen to check the same performance improvement on Opteron based platforms. It would be great if you can help me in getting intel's mpi version for OpenFoam 1.4.1.

Hans, when you say 30% faster

Hans, when you say 30% faster I would like to know what is the basis you use for that comparison. Do you mean to imply that the parallel speedup for say a 2-CPU job is 30% higher when using Intel's MPI as opposed to OpenMPI? If it's just the solution time, then I don't think there is anything surprising there as Intel's compilers/libraries are optimized to work exceptionally well on Intel platforms (which by the way are rarely used in any of the clusters at my university). On a related note, what exactly is your stand on using multi-core systems (dual/quad/octa) for parallel CFD computing. Is Intel aware that memory-bandwidth is the real bottleneck when switching to multi-core systems?

Hello, to respond to the co

Hello,

to respond to the comments of a a saha, Srinath:

- Intel MPI is a commercial product, not just
as easy to get as OpenMPI. That's why I was wondering whether the OpenFOAM creators could think of a model to support other MPI's for building that users might have a license of

- The over 30% comes >>just<< out of the MPI.
It compares 2 runs on exactly the same Cluster
(here an Intel 4 core cluster with 16 nodes,
that is 64 processors) OpenMPI vs IntelMPI -
all timings except the MPI coincide, only the MPI makes the difference.

Hello Hans, I am not sure I

Hello Hans,

I am not sure I understand your answer. When you compare the speedup (for example see [1]) I wish to know if Intel's MPI gives a 30% increase when compared to Open MPI.

Secondly, you mention that you ran the tests on 16 nodes (each node featuring a quad core CPU). How did you assign the processes:

i) Did you schedule one MPI process per node so that each process would then communicate through an interconnect. If so, which interconnect (gigabit, infiniband, quadrics etc.)?

ii) Did you schedule MPI processes by filling each node and then moving to the next one? In this case, for 2 and 4 processes, all program instances would run on the same quad-core node. Only when moving to 6 or 8 processes, would the interconnect be used.

iii) What case did you run (as in how big)? How much RES memory did the serial run consume? How many time steps did you run the case for?

[1] http://www.cfd-online.com/OpenFOAM_D...es/1/4626.html

I have run my parallel cases o

I have run my parallel cases on both Xeon and Opteron based machines. The parallel performance with Opteron based machines are far superior to Xeon based machines with OpenMPI.

Hi, the comparisons are as

Hi,

the comparisons are as follows:

I run simpleFoam 2 times; in both runs
>> everything<< coincides:

- test case, compilers, compiler flags,
run command, the cluster to run on,
mapping of processes to the
parallel nodes of the cluster

>> except <<

- I link Intel's MPI in the first, OpenMPI in
the second run

Then, the first run, just by managing the message
passing better, gives 30% better runtime (not speedup), e.g 230 s instead of 300 s wall clock time

This happens throughout all different styles of mapping, be in 1 process per node, 2, or 4.

I'm afraid I cannot disclose details of the test case as it's confidential.

Thank you Hans. As I mentioned

Thank you Hans. As I mentioned before, if there is no improvement in the speedup, I doubt anyone will be that interested. Throw in a few more processors and you will always get better runtimes if the speedup stays close to linear. Besides, to my knowledge the intel architecture is rarely used in clusters anymore. We had a very prominent Xeon based cluster in our university, but that was like 3-4 years ago. Now everything has been changed to AMD (hypertransport technology) and/or IBM POWER (far superior memory bandwidth, generous L3 cache etc.).

Srinath, I'm sorry but I don't

Srinath, I'm sorry but I don't agree with you.
If you can save 30% of computational time just by changing the MPI library, it's a huge improvement. Even if the parallel speedup doesn't change. Saving 30% of times means running 30% more simulations in the same time, and in the end it means saving money for buying and mantaining 30% more hardware and having the same performance.

About the cluster, it's not true that Intel CPUs are not used anymore, trust me!

@Hans: what if you have a propretary network (like Myrinet or Quadrics, instead of a Gigabit)?

Francesco

Hi, Of course it would be n

Hi,

Of course it would be nice if somebody has already a license for Intel's mpi to get their licensed software to work with (in this case) OpenFOAM. Since OpenFOAM is under gnu license what keeps intel from supplying an interface between their proprietary mpi and OpenFOAM in source code or binary form (like the NVIDIA drivers for linux)?

One point I didn't get: The 30% walltime reduction was for ethernet interconnect only or is this also valid for infiniband interconnect? If this is only for ethernet their is also a promissing (also up to 30% walltime reduction) "free" alternative: Gamma which is already implemented/supported by OpenFOAM.

@Srinath We did some Benchmarking with Intel and Opteron systems and the Intel systems are for different cases mostly as fast or even faster.

Jens

Francesco, I respect your opin

Francesco, I respect your opinion. My observations are merely based on experience and discussions with people who have been working on High Performance Computing (HPC) for quite some time. AMD still rules over Intel when you factor in the price and power consumption and compare the performance. IBM still rules over both AMD and Intel when it comes to processors suited for HPC. The price of IBM servers are of course exorbitant.

I'm no AMD fanboy. I still respect Intel and their processors but only when it comes to desktop use. In fact I chose Intel over AMD when I bought my Fujitsu notebook simply because Intel supports free software 3D graphics drivers. Nevertheless I will elaborate the reason for my skepticism. Without mentioning the size of the test case there really is no reason to get exited over 30% improvement. Intel is famous for posting benchmarks of commercial CFD codes (e.g Fluent) and claiming superiority over AMD. However, their benchmarks are based on relatively small test cases. Increase the size of the problem and Intel struggles to match the performance that AMD can deliver (thanks to its hypertransport technology which reduces the FSB bottleneck by providing separate path to memory and and all other PC components through the motherboard chipset). This is also why Intel processors have higher L2 cache in an attempt to offset the loss in performance that comes from having to use the Front Side Bus (which manages both both memory and I/O communications) every time. I chose to believe neither Intel nor AMD when it came to benchmarking. I did all the tests myself (some which I could get around to summarizing, I posted in this form). In majority of the parallel tests I saw that AMD gave much better results. Let me see Intel give me a better speedup at reasonable price and I will gladly recommend it.

As regards to Intel-MPI I will admit outright that I am biased towards open solutions. The very fact that Intel releases its very own compiler and MPI libraries is clear evidence that its processors have some performance pathways that are not documented so that gcc and other free alternatives cannot exploit it. In other words, Intel (like any other company including AMD) wants to get additional revenue by promoting use of its compilers etc. Nothing surprising there, eh. I'm sure there are folks who love to get the best of both worlds (free and commercial) as long as it benefits them. But then again this is the consequence of practical choices, isn't it. I choose open solutions not just because they are free but also because they promote growth and productivity more than commercial alternatives http://www.cfd-online.com/OpenFOAM_D...part/happy.gif

All, interesting discussion

All,

interesting discussion. In terms of judging the 30%, I agree with Francesco del Citto - what counts is the run time and #simulations per time you can run - speedup is a largely over estimated (and also abused) measure. I can easily
make an application 2x slower but improve its parallel speedup ....

I also understand scepticism - let me ensure that measurements were done with significant test cases, but sorry no more details possible

As to the interconnect used for those 30%: it was an Infiniband ...

Hans, thank you for this st

Hans,

thank you for this statement. IMHO run time on a given number of CPUs is the only measure that counts. It goes with speedup hand in hand anyway.

Talking about Intel-MPI: I suppose you will have difficulties introducing this into an open source community...
But what about a contibution, lets say a "Open-Foam-Intel-MPI-Special-Edition"....that would be nice, right? ;-))

Kind regards
Christian

Christian, thanks for the c

Christian,

thanks for the comment.
Well, it could be like OpenFOAM's installs
provide for a branch linking other MPI-s -
if a client has this library, it works, if not, it doesn't, so he continues to use OpenMPI - actually pretty easy. Due to the nice encapsulation in the libPstream.so this could keep 99.9% of the code / compilation untouched.

That's true. I compiled OpenFO

That's true. I compiled OpenFOAM on a propetary network using their MPI library. It's really very easy!
And the performance improvement was fantastic, in my case, especially on small cells/cpu number.

Just a small suggestion. Th

Just a small suggestion.

There are various simple test cases in the literature, with all details needed to reproduce them. For example, a simple but computationally intensive test case is a direct numerical simulation in a channel flow, with predetermined flow conditions and solver settings.

This kind of test case is easily scalable, adaptable to high computational resources, and not covered by secrecy agreements.

This would allow Intel to make results public, with detailed information and specific hints on how to get them, increasing its credibility.

With kind regards,
Alberto

Just some link of test cases:

Just some link of test cases:

- Ercoftac database: http://cfd.mace.manchester.ac.uk/cgi-bin/cfddb/ezdb.cgi?ercdb+search+retrieve+&& &*%%%%dm=Line

- iCFD database (cases with detailed results too) http://cfd.cineca.it/cfd