CFD Online Discussion Forums - No different caculation speed between serial and openMP code?

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- Main CFD Forum (https://www.cfd-online.com/Forums/main/)

- - No different caculation speed between serial and openMP code? (https://www.cfd-online.com/Forums/main/226857-no-different-caculation-speed-between-serial-openmp-code.html)

No different caculation speed between serial and openMP code?

Dear experts,

I wrote my own fortran code for multiphase flow problems. My code was originally in serial form. I submit the batch job file to compiler to use 20cpus for my code. The calculation time was much faster than the ifort command only. Here is the batch job file:
#!/bin/bash
#SBATCH --job-name=testOMP
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=20
#SBATCH --exclusive
#SBATCH --time=0-20:00:00
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
time ./2D-test

Then I modified the serial code to openMP parallel used the !$omp parallel do/end parallel do. However, there was nos different calculation speed between serial and parallel code. The way I modified shown in picture (part of code).

Please give advice. Did I modified the code correctly?

That's not a whole lot of information. And I don't know what to make of this statement:

Quote:

The calculation time was much faster than the ifort command only

What's the point of comparing compile time to run time? And it immediately begs the question: is your job large enough to scale across 20 threads.

There is not too much to work with, and to be honest, I don't have a whole lot of first-hand experience with OpenMP for this type of structured grids. Here is a more general check list for "my OpenMP code doesn't scale". In no particular order, and definitely incomplete.

did you remember to compile with an openmp flag?
Is your case large enough to scale across N cores?
Are you measuring run time in a meaningful way? Using "time" alone on the whole code, you might be measuring a lot of serial setup time. Instrument your code with timing measurements and/or use profilers
Are you sure the subroutine you parallelized is the run time bottleneck? Again, Instrument your code with timing measurements and/or use profilers
Do you have a load balancing problem? A static schedule for the OpenMP loop with a smaller chunk size can help
Is NUMA getting in your way? Setting the loop schedule to static, and applying proper "first touch" initialization can help. Along with binding the OpenMP threads to individual cores. The latter is always a good idea anyway.
Did you produce race conditions or some other nasty bugs that hinder parallelization? Intel does provide handy tools like Inspector. Use them!
Still not sure what is holding your code back? Intel VTune Amplifier. Last time I checked, both Inspector and VTune were free to download and use.

Further reading: https://moodle.rrze.uni-erlangen.de/...iew.php?id=310
Yes, that's the material for a a full 3-day course. Turning a serial code into OpenMP parallel is the easy part. Getting decent performance and scaling is the harder part.

You are only parallelizing the outer loop. You would need to have something like:

Code:

$OMP DO COLLAPSE(2)

to parallelize the two loops... I don't think that will solve your problem, but you never know.

Quote:

Originally Posted by pcosta (Post 770675)

You are only parallelizing the outer loop. You would need to have something like:

Code:

$OMP DO COLLAPSE(2)

to parallelize the two loops... I don't think that will solve your problem, but you never know.

Pcosta,

I did make the COLLAPSE(2) and (3), the same resulted.
I tried to optimize the serial code structure, the calculation speed up now (a bit).
Still dont know how to improve the parallel run. Maybe the parallel calculation cannot show the improvement for such small number of grid point (279 x 70) or maybe I made the wrong BATCH job for openMP.

I strongly suggest to avoid testing on a system with a job scheduler which might be shared with others... at least not this kind of test, where you don't know if everything works correctly (even if you might be alone on a given node).

These days there is no single pc, or even cell phone for that matter, that has a single processor anymore, so you are plenty of possibilities when it comes to testing.

Only go to larger machines when you are absolutely sure that everything works on the smaller ones and, of course, when you know how to use their job schedulers.

In this specific case, however, the size of the problem looks absolutely small to achieve any relevant speedup in parallel. This would be true in MPI but it is even more so in openmp.

If you are absolutely true that the only problem might be the size of your task, just throw something bigger at it, like 10 or 100 bigger.