CFD Online Discussion Forums - mpirun, best parameters

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- Hardware (https://www.cfd-online.com/Forums/hardware/)

- - mpirun, best parameters (https://www.cfd-online.com/Forums/hardware/100260-mpirun-best-parameters.html)

mpirun, best parameters

Hello,

I am running a test on 64 cores machine (it is 4 x 16 cores), so every core has a single machine. My mesh is optimized to run in 8 cores (partition 8), so i can run 8 jobs at the same time, something like;

mpirun -np 8 myApp test1
mpirun -np 8 myApp test2
.
.
.
mpirun -np 8 myApp test8

My problem is that the performance is going down (the speed is going slow) when i put more than 4 jobs at the same time.

Any idea what is the best choice parameters to make mpirun better in localhost.

Best regards,

Pablo

Pablo, I'm really glad that you are running this test, and I think there are a lot of us who would love to see your results, if you are able to share them with us. Can you give more detail on the machine? 4X16 sounds like a 4-socket Opteron, but it would help to know more. As you may have seen from previous discussions on this forum, there are some reservations about machines like this (many cores SMP). The one concern is related to the amount of memory bandwidth that is required. We are seeing with some (most, all?) of the CFD codes that the performance is constrained more by memory speed than by clockspeed or even number of cores. Secondly, if these are Bulldozer Opterons, there are only half as many floating point processors as there are cores, so it may not be strange if the performance drops off once you get to 50% loading. Having said that, I think the reason that AMD made this decision is precisely because memory is more of a bottleneck than processing capability.

In terms of answering the actual question that you asked, take a look at processor and memory affinity as discussed in this page: http://www.open-mpi.org/faq/?categor...paffinity-defs

It seems very likely that your system could be slowing down as threads and data are shifted around between the different physical CPU's and memory banks. I would love to know if setting the affinity has an effect on the performance.

It is the machine 4 x AMD OpteronTM 16 Core 6274. 2.2Ghz 16MB 115W 32nm
Now the machine is working, tomorrow i will try affinity like
--cpu-set etc ...

Please let us know how it works out. There are some people on this forum who are very interested in these results, because this machine on paper looks like a cost-effective way of getting 64 parallel processes. However, some, like myself, have had very unsatisfactory experiences in the past with computers using more than two sockets per system, and it would be very valuable to know if setting the affinity appropriately helps.

If really floating point is shared by 2 processors, there is no more to comment, how can i figure out this question? it is 6274 model.

It is indeed AMD's design philosophy with the latest family of "Buldozer" CPU's to work with core "modules", where each module consists of two integer processing cores and a single floating point unit. See http://en.wikipedia.org/wiki/Bulldoz...rchitecture%29 and http://en.wikipedia.org/wiki/File:AM...ore_CPU%29.PNG .

It is not quite as simple as saying that there is just a single floating point processor for every two cores, because of the way that the instruction pipelines work. In practice, it appears that modern floating point processors can process data much more quickly than the data can be provided by memory, so one can see why doing it this way may be a good idea. However, some benchmark results have been disappointing, and there is some uncertainty about whether this is really fundamentally due to the "shared" FPU being a bad idea, or if it is a case of the compilers and (to a lesser degree) operating systems still catching up to the way the new hardware works. There are benchmarks (see SpecFPrate at spec.org) that indicate that these CPU's do in fact scale very well for parallel floating point processing, but it should be noted that the SPEC benchmark codes can be specially compiled for the hardware. The good SPEC results have been obtained with AMD's Open64 compiler, which I think is perhaps not widely used to compile most commercial codes.

Well with -O3 -mprefer-avx128 -ftree-vectorize i got improve the performance on speed from 15 to 20 %.

I am using gcc 4.6, but like allways if i use more than 50% cores, the performance goes down Drastically, no doubt that FPU shared is not a good idea for numerical methods.

Still i go to try to compile with Open64, never used before.

Too, i was thinking to compile in single precision, waiting unload the FPU, what do you think?

Well, what I think is that I would really like to see what results you get from this experiment! It may actually be an idea to talk to AMD about it. It has been their decision to change to this rather unusual design, and I think they have a responsibility to support the HPC community with advice about how to get the best out of it.

How to run concurrent MPI jobs within a node or set of nodes

Hello cfd community.
I've been reached by this CFD community in order to try to answer on this thread. I titled my response with the solution to the problem.
The response applies to not only parallel CFD applications but to any MPI application.
Most MPIs do not apply by default process/thread affinity. You must explicitly pass to mpirun/mpiexec command the appropriate flag to enable specific affinity.
openMPI though is an exception where by default it will apply affinity.
In any case, whether you use the whole machine or part of it for your mpi runs it is good practice to enforce affinity.
OpenMPI will by default associate MPI_RANKS 0,1,2.. to core ids 0,1,2..
It depends on how the BIOS of a given system "discovers" the processors and its cores that it will label them differently. You can use hwloc toolset, provided also under openMPI to find where physically each core is located on what numanode and socket. For AMD processors (6100 / 6200 series), 2 numanodes go into the same package or socket. lstopo tool will show you for each socket all the numanodes and all the cores.
Although I will focus on AMD processor, the same applies to any other processor technology. Say I take 16 core interlagos (6200 series processor). As I was saying, some BIOS will map on a 16 core processor Interlagos as 0,1,2..7 (first numanode), .,14,15 (second numanode). Then the second socket 16,17,..22,23 (third numanode),24,25,..30,31 (fourth numanode). If the systems has 4 sockets, then it will go until core ids ...62,63 (eight numanode). On linux you can use numactl --hardware command to see also the mapping of core ids with numanode ids.
See below the output for a 2P system with Interlagos processors with such order of core ids.
># numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 16382 MB
node 0 free: 15100 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 16384 MB
node 1 free: 14975 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 16384 MB
node 2 free: 15451 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 16384 MB
node 3 free: 15239 MB
node distances:
node 0 1 2 3
0: 10 16 16 16
1: 16 10 16 16
2: 16 16 10 16
3: 16 16 16 10

Let's call this labeling "natural" labeling of core ids.
Other BIOS developers may decide to apply a "round robin" labeling where it goes per numanode

So it would look like

node 0 cpus: 0 4 8 12 16 20 24 28
node 1 cpus: 1 5 9 13 17 21 25 29
node 2 cpus: 2 6 10 14 18 22 26 30
node 3 cpus: 3 7 11 15 19 23 27 31

It is important to understand that the memory controller associated per numanode has a limited processing capability of X GB/s (ie. memory bandwidth). Without entering into a competitive analysis of systems that would deviate our attention, I would say that CFD applications are in nature, fairly if not extremely, memory bandwidth bound. For any multicore processor, 3 cores are sufficient out of the 4,6,8 cores within the numanode to saturate the memory controller for such applications. This is important to understand since it will be factored in the performance analysis when using affinity to control where you place the MPI processes.

Now lets assume your system has like this one a "natural" labeling.
when you use -cpu_bind on hpmpi or when you use defaults on openmpi a 8 mpi process job (ie. -np 8) will use core ids 0,1,..7

It will place the mpi job on those first 8 core ids, which happen to be on the first numanode of the 2 socket Interlagos based system. That will use only the memory controller of the first numanode and deliver a performance Y (inverse of elapsed time or jobs per day) proportional to the memory controller's performance X.

If the labeling was "round robin", then you would still run on core ids 0,1...6,7 but they would physically correspond to all the numanodes 0,1,2 and 3 using 2 cores per numanode. Therefore, all the memory controllers will be used and the 8 mpi process job will enjoy the memory throughput of all the memory controllers, delivering performance of 4Y (eg. jobs per day), which would be proportional to 4X (4*X memory bandwidth).

Now we want to add a second mpi job since the machine has a total of 32 cores and only 8 are in use. So we can have 3 more jobs each of them running with 8 mpi-processes.

But you must first understand where the MPI runtime is placing the processes.
As I mentioned earlier, when you apply affinity, the default mechanism is to put them on 0,1,2,..6,7. So it is obvious that if you have the first job0 already running on core ids 0,1,..6.7, when you add the other jobs1,2 and 3. They would be all placed on the same physical core ids 0,1,2..6,7 overloading those cores with 4 workloads that the OS will have to associate 1/4 of their performance to each job.

That is certainly not what you want. So you need better control of the binding of mpi-processes and specific per mpi job where those mpi processes will get bound to what core ids.
What you want to do then is to place each job on its own socket or even go to a higher granularity, at the numanode level.
So I would associate mpi job0 to numanode 0, mpi job 1 to numanode 1, and so on.
You can have further granularity by telling to the binding command what core ids you want. This is actually better since on the binding per socket you could allow a mpi process to bounce to another core ids as long as it is in the same numanode, then loosing what you have in the cache at one instant since it bounced to another core with different data in the cache.
So core binding is very precise, but you got to know where the set of cores you want to use for each mpi job are physically located.

In order for performance and productivity make sense on your experiments, you want to be careful on "round robin" labeling since you will have to tell explicitly for the first mpi job to use core ids 0,4,8,12,16,20 and 24 while on natural labeling it will be 0,1,2,3,4,5,6,7

depending on the MPI implementation, it will ask you to pass a CPU_MAP=the list of core ids (that is for hpmpi) and for openmpi it will rely on hwloc tool where you can say something like -bind-to-core -cpu-set=the list of core ids.

I personally have little capacity of remembering things so I use the same approach for all MPIs regardless of their interface for binding.

I use for that the application file and numactl command which works on all systems and MPIs.
So I would do something like the following
"mpirun --appfile appfile_job0" for the first job
"mpirun --appfile appfile_job1" for the second job

and so on until job3

and the contents of appfile_job0 would be:
-np 8 numactl --cpunodebind=0 ./path_to_application arguments

if I did the binding at the granularity of numanode.
but usually I would do binding at core id so I would write for natural labeling
cat appfile_job0
-np 1 numactl --physcpubind=0 ./path_to_application arguments
-np 1 numactl --physcpubind=1 ./path_to_application
-np 1 numactl --physcpubind=2 ./path_to_application
-np 1 numactl --physcpubind=3 ./path_to_application
-np 1 numactl --physcpubind=4 ./path_to_application
-np 1 numactl --physcpubind=5 ./path_to_application
-np 1 numactl --physcpubind=6 ./path_to_application
-np 1 numactl --physcpubind=7 ./path_to_application

For round robin labeling
cat appfile_job0_round_robin
-np 1 numactl --physcpubind=0 ./path_to_application arguments
-np 1 numactl --physcpubind=4 ./path_to_application
-np 1 numactl --physcpubind=8 ./path_to_application
-np 1 numactl --physcpubind=12 ./path_to_application
-np 1 numactl --physcpubind=16 ./path_to_application
-np 1 numactl --physcpubind=20 ./path_to_application
-np 1 numactl --physcpubind=24 ./path_to_application
-np 1 numactl --physcpubind=28 ./path_to_application

for the last job, job3 running on numanode 3, the application file on natural labeling would look like
cat appfile_job3
-np 1 numactl --physcpubind=24 ./path_to_application arguments
-np 1 numactl --physcpubind=25 ./path_to_application
-np 1 numactl --physcpubind=26 ./path_to_application
-np 1 numactl --physcpubind=27 ./path_to_application
-np 1 numactl --physcpubind=28 ./path_to_application
-np 1 numactl --physcpubind=29 ./path_to_application
-np 1 numactl --physcpubind=30 ./path_to_application
-np 1 numactl --physcpubind=31 ./path_to_application

For HPMPI that's it, but on openMPI, since by default it does binding, it will enter into conflict the default binding with the application file binding. Therefore you must disable the default binding passing to the mpirun command --bind-to-none

So for openMPI you would run each job as
mpirun --bind-to-none --appfile appfile_job0
mpirun --bind-to-none --appfile appfile_job1
mpirun --bind-to-none --appfile appfile_job2
mpirun --bind-to-none --appfile appfile_job3

Then you can run them concurrently either by submitting through job scheduler or manually for example as:
mpirun --bind-to-none --appfile appfile_job0 > job0.log &
mpirun --bind-to-none --appfile appfile_job1 > job1.log &
mpirun --bind-to-none --appfile appfile_job2 > job2.log &
mpirun --bind-to-none --appfile appfile_job3 > job3.log &

The elapsed time or performance of each of the runs should be the same for each of them if you are running the same problem. In this way you guarantee that numanode 0 is being used only by job0, delivering Y performance. The other jobs 1,2,3 should deliver same Y performance, being your total productivity 4Y.

Last thing: If by mistake you use natural labeling application file on round robin labeling, what will happen is that each numanode will be pounded with 4 jobs rather than 1 job per numanode, but since you are placing in a properly distributed/balanced way each mpi process to specific core ids you will still get the same productivity. But if you do a scaling analysis on round robin labeling you will find out that when you have the first job only running, you are providing the power of all the memory controllers to the job, delivering excellent performance, and around 4Y. When you add the second job, now the memory controller has to share its processing capability for both concurrent jobs, so very likely each of those 2 jobs will complete in 2Y performance only. when you add 2 more jobs, each will complete only on 1Y performance, still being the total productivity 4Y. If you did the benchmarking experiment on natural labeling, then when you run only 1 job, you will only get 1Y, since you run it only using one single memory controller. Then you add the second job, on a different resource, ie. a different memory controller, so it wont impact on the performance of the first job0. And so on until job3 on memory controller 3. So your total productivity still remains 4Y,ie. 1Y per job.

Hopefully it helped the CFD community understand better how this works and how to exploit NUMA systems.

Ah, If you want to extend the utilization of application file to cluster runs with concurrent jobs per machine, you just need to pre append before -np 1 -host machinename to each line of the application file.
The application file allows you in fact to do MPMD, ie. different executables with different arguments placed anywhere you want talking to each other.

Best regards,
Joshua Mora.

Thank you very much Joshua,

I will try, if it works it is going to be a great news.

Pablo

Hello Joshua,

I followed your indications, now i can run 8 jobs at the same time (8 cores per job), every job spending same time to finish, so without loose performance speed dramatically (it is a great step).

But every job needs double computational time that if i send only one job (8 cores/job).

Method 1:
1 second of simulation needs 1000 seconds computational time for only one job using;
mpirun -np 8 myapp
I guess that it does not find troubles with memory.

Method 2;
1 second of simulation needs 2000 seconds computational time for one job using;
mpirun -np 8 numactl --cpunodebind=0 myApp
mpirun -np 8 numactl --cpunodebind=1 myApp
.......
......
.......
mpirun -np 8 numactl --cpunodebind=7 myApp
I guess that it finds saturation at memory.

In Method 1 has very poor performance speed when more than 1 job, more than 4 drasctcally goes down.

Method 2, it is slower, but every job needs same time without lost performance even with all cores working.

Any idea to improve a little bit more the AMD?

Pablo

Pablo, do you have the specifications for the memory? It seems as if most of the 62-family Opteron motherboards will allow you to use up to 1600 MHz memory, but DDR3 ECC 1600 is not actually that common. So if the system is using 1333 MHz memory, there would be some performance to be gained from by upgrading to the faster memory.

Charles, it is the machine;
4 x AMD OpteronTM 16 Core 6274. 2.2Ghz 16MB 115W 32nm
16 x 4GB DDR3 1333 ECC REG
Is it possible to known how can improve it to move to 1666?

Pablo, I think you can just replace the DDR3 1333 ECC memory with DDR3 1600 ECC, and check that the BIOS actually implements the 1600 MHz speed. As far as I know (but check the documentation ....) if the modules are not bigger than 4 GB each, the memory does not need to be of the "Registered" variety. It still needs to be ECC though. This is an example of the type: http://www.crucial.com/store/partspe...=CT51272BD160B If what people are saying about the memory bandwidth limitation is true, this might get you close to a 20% improvement.

Not all boards support 1600MHz. Check with your hw vendor of boards or system integrator. But it is true that 1600MHz will uplift the performance of CFD apps about 9-10%.

Pablo, did you miss to add the flag --bind-to-none for each mpi job ?
Also I'd like to mention that the analysis assumed all the work is happening between CPUs and memories. No other resources are being used. I mean, you could be doing heavy usage of storage for reading, writing (such as checkpointing). In that case, you got several resources in that File I/O activity that is being shared among all the mpi jobs. you may want to have for instance 1 HD per MPI job.
Without seeing your workload I cannot guess further what it is going on and see how resources are being used so I can understand if there is true contention somewhere.

Hi Joshua,

For tests, i disconected everything who was writing to hard disk, just when finish the calculation.

Only i used mpirun -np 8 numactl --cpunodebind=0 myApp, i was trying writing file with cores and --bind-to-none like in your explication and times are really similar.

Pablo

Outstanding tutorial and a great primer for utilizing numactl's other options such as cpunodebind, huge help this was. Thanks.