meet OpenHyperFLOW2D

(include serial and parallel MPI,OpenMP and CUDA versions)

(Click image for animation of transient solution)

For the number of involved cores > 4 dependence of the efficiency is fairly steady linear and can be approximated by

Code:

E(n) = E0 - k * n; (1)

Code:

Sp(n) = E(n) * n; (2)

Code:

Sp'(n) = E0 - 2 * k * n = 0; (3) n = E0 / (2 * k); (4)

For this particular case we have the following coefficients of the approximation

E0 = 0.762 - Parallelization efficiency for a single node (4 cores)

k = 0.0019737 - Rate of incidence of parallelization efficiency depending on the number cores.

Substituting these coefficients in the above formula (4) we find that the speedup stops when using ~193 cores

Reconstruct on the basis of (2) the dependence of speedup to the number of cores and compare it with the experimentally obtained dependence.

Seems to be true?

As far as universality coefficients obtained in dependence of the efficiency and how they depend on the size and nature of the current domain as well as on your hardware it will be possible only after further tests on other test cases.

This simple estimates model of scalability of solver can provide more accurately assess and allocate computing resources in case of planning a series of numerical experiments.

Significantly improves the scalability of the solver, using the Halo ehchanges, the use of non-blocking MPI calls Isend() and Irecv(). This allows data transfer between subdomains is really parallel.

The following figure shows the so-called "efficiency parallelization" calculated as the ratio of real to the ideal speedup as a function of the number of cores involved.

In analyzing the above figures, we see that despite the increase in net performance solver "parallelization efficiency" decreases

to be continued...

The computational domain is divided into several subdomains along one (usually the longest) dimension. The number of such subdomains usually corresponds to the number cores involved. The main data exchanges take place between neighboring subdomains. The size of transmitted data between adjacent subdomains in the general case depends on many factors (number of equations in the system, the size of the computational domain in the other dimensions of the order of the numerical scheme). However, this value does not depend on the number of subdomains on which the computational domain is divided and remains constant for a single subdomain, regardless of its size in the dimension of the partition.

Consider the unit subdomain in more detail. In the case of 1D decomposition data is exchanged between the "head" and "tail" of the neighboring subdomains (aka "Halo exchange"). The only exceptions are the first and last subdomains one of which has only the "head" and the latter only the "tail" (See Fig 1)

Fig. 1

Code:

// --- Halo exchange --- if(rank < last_rank) { // Send Tail MPI::COMM_WORLD.Send(TailSendBuffer, TailSize, MPI::BYTE, rank+1,tag_Tail); // Recive Head MPI::COMM_WORLD.Recv(HeadRecvBuffer, HeadSize, MPI::BYTE, rank+1,tag_Head); } if(rank > 0) { // Recive Tail MPI::COMM_WORLD.Recv(RecvTailBuffer, TailSize, MPI::BYTE, rank-1,tag_Tail); //Send Head MPI::COMM_WORLD.Send(SendHeadBuffer, HeadSize, MPI::BYTE, rank-1,tag_Head); }

Thus, the operation of data transfer it is "not parallelizable" piece of code which linked with factor in Amdahl's law. Moreover, increasing the number of subdomains proportion of this part of the code increases for each subdomains. This fact is well illustrated by the experimental curves below.

to be continued...

which shows the dependence of the speedup parallel code on number of processor cores for different fraction of time () spent in the part that was not parallelized. This diagram is interested in me for the reason that just before I did test the scalability of one of my old 2D research parallel solvers.

We call it the T-DEEPS2D (Transient, Density based Explicit Effective Parallel Solver for 2D compressible NS). For the test used the problem of simulating Von Karman vortex street near the 2D cylinder on a uniform mesh 500x500 nodes.

This solver uses various types of balances of 1D domain decomposition with halo exchanges. The measurements were performed on two clusters with different numbers of nodes, different types of processors. The exec code was produced by same compiler (Intel C++ 10.1) but for different processors. Solver using non-blocking MPI calls (there are also versions of the solver with blocking MPI calls and OpenMP version, but more on that later). In both clusters for interconnect using Infiniband. So, my attention was attracted by speedup figures in the graphs. According to my tests turned out that my parallel solver code for 100 cores much better than 95%. I just took the formula of the Amdahl's law and knowing real acceleration and the number of cores was built a dependence of on the number of cores. See results below:

It turns out that is not constant but depends on the number of used cores/subdomains ... and tends to zero:)

Likely for other solvers picture might be different.

Who has any ideas on this case ?