hardware for cfd

November 1, 2007, 23:48

I've run a cfd application on a single computer consisting of a motherboard+daugterboard containing 8 dual core opteron processors (16 cores total), running linux fedora core 4 or 5. each of the eight processors has it's own memory bank (NUMA architecture) with plenty of memory (<40% used). open mpi is used to spread the computation over N cores. the application&case i'm trying to run scales up very linearly for everyone i've talked to, but for the system i'm using (above), the speedup is something like:

1 processor ~ 1X;

4 processors ~ 2.3X (2.3 times fatser than 1 processor);

8 processors ~ 3.3X;

16 processors~ system crashes, must be restarted;

Everyone else claims to get almost 8X for 8 processors (even 16X for 16 processors) using the same software and the same input; just a different system/hardware. i think these other people are using seperate computers, each containing 1 or 2 single core processors, and mpi communication for them is via 100g ethernet or faster.

does anyone know why the system i've mentioned above would scale up so poorly? any ideas as to what the bottleneck might be? when i do a top it always shows all processes to be using 99+% CPU, so i thought that meant it was scaling well, but then when i look at the runtimes it's clear that something is wrong.

Thanks

November 2, 2007, 13:07

I obviously don't know what CFD you are using, but the odds are very good that you are memory bandwidth limited. There is no way that 16 cores in one machine is as good as 16 cores in 4 or 8 separate machines. If you do any I/O then all your cores are trying to write through 1 single I/O bus to the disk. If you do much mpi style communication between cores, its all going over the same shared memory path which is also in contention for the memory to supply your cores. Finally, while NUMA machines allow any core to grab any memory, the fastest memory is that directly attached to each core. If your program or mpi does nothing to deal with the need to keep memory local, your time will go up. You have too many cores in 1 machine.

November 2, 2007, 16:02

Firstly, your computation should be partitioned into nearly amounts, so that the load on any core is nearly equal to that on any other.

Secondly, as TG suggests, you may have less memory bandwidth between any two cores, than in other multi-rack / multi-box systems. Computation scales as the subdomain volume while inter-subdomain communication scales as the shared surface area between neighboring subdomains. Optimally, the ratio of computation to communication ought to be scaled (by selection) as the ratio of arithmetic operation speed (MFLOPS) to communication bandwidth per core (MB/s). With a fixed total communication bandwidth, if with 2 cores you are already limited by the total bandwidth, then when you try to scale to 4 or 8 cores by repartitioning a fixed-size problem, you are hit with a double whammy: (a) the communication bandwidth available per core goes down by a factor of 2 or 4, and (b) the ratio of required communication to computation also goes up by a factor depending additionally on the dimensionality of the simulation.

If you have control over the source code of your application, you can check the linear scalability of your machine by (a) cutting back on your overall problem size (total number of mesh cells) so that when partitioned into 16, the required communication is just less than total available bandwidth, and (b) perhaps introducing redundant computational work in between communication events so that even when partitioned into 16 parts, the progress is paced by the computation rather than the communication. By testing like this, you can find the limits when the simulation starts to scale poorly.

November 2, 2007, 16:33

Another question, how big is your problem? It needs to be reasonably big to really benefit from a lot of processor cores, for a lot of the commercial codes the suggestion is that you gain very little when partitions drop below about 200 000 cells. Obviously Ananda has a much better understanding of what the real situation is, but there are some rough rules about the size of the grid to consider.

November 2, 2007, 19:45

Thanks a lot for the inputs! Here is some additional info:

The speedup I quoted for 1, 4, and 8 processors was after removing the daughterboard, so at that point there were 4 processors (8 cores) total. Each processor has it's own corresponding MMU with memory and every MMU has a direct communication line to every other MMU (4*3/2 = 6 pathways). Based on the latter, I was thinking that memory requests not on the local memory bank wouldn't hurt too bad -- there are 6 different pathways between the 4 memory units + 4 pathways from each memory unit to its CPU. The OS does have NUMA support and should allocate memory locally as much as possible. I checked the NUMA miss rate (when a CPU tries to access non local memory), but it was pretty low, so I think that's OK.

I realize that the dual cores have to compete for memory using the same pathway, but the reason I wasn't considering this too much is because I thought others used dual cores for CFD. Is that true?

In CFD computing, is the cache hit rate generally lower than other applications (more need to go out to RAM => penalty for dual cores)?

Thanks

November 2, 2007, 19:48

And I forgot to mention that there are over 4.5 million cells, so for 8 CPUs it's well over 500K cells per process.

November 5, 2007, 16:26

Sounds like your NUMA is performing well with localization and with a highly-connected network topology within the box. Also, at over 500K cells per process, the ratio of volume of computation to area of communication is much higher than where you should see the parallel efficiency penalty from communication. In my tests of several years ago on SGI Origin cluster systems, I found that the parallel efficiency began to drop below 90% (speedup was less than linear) when the number of cells per process dipped below about 3000, which is much lower than what you have. Of course, those tests involved inter-processor communication using an explicit time-marching method. If your time-marching is implicit and you parallelize exactly (i.e., not using Schwarz-type subdomain decomposition/iteration of the globally implicit technique), then your communication volume could be significantly more than mine. Also, those SGI Origin systems had relatively slower processors and relatively high-bandwidth network connections. But I would still say off-hand that for such a problem size and for that many cells per process, your computations should scale almost linearly.

One possibility is that the communications between processes have not been scheduled in an orderly fashion, and you may be causing MPI to do a lot of buffering because Processor A send a message to Processor B but the latter is not ready for it but is instead waiting for a message from Processor C. This excessive buffering would look like near dead-lock and could cause page-fault conditions (swaps to disk) if you are operating near the limit of physically available memory. I used a map-coloring routine to schedule the interprocess communication in stages, so that at each stage disjoint pairs of processors communicate with each other. If that is too much trouble to implement, you could try serializing the communication: basically designate a master communicating processor (say, Processor 0)acting as a hub. All messages among the other processors would be relayed through the hub. This would be the slowest system of communicating, but also the simplest, and with 500K cells per process, the communication should still be negligible compared to the computation. I am assuming of course that you are exchanging only surface data at inter-subdomain boundaries, and are not attempting to share the volume (entire subdomain) data between neighboring subdomains.

November 9, 2007, 08:26

Hi,

I looked into buying an 8 way (16 core) opteron system for CFD simulation. When we looked into the systems performance (we looked at systems from a few vendors - I will not go into details for obvious reasons) we noted that the scalability of the system was *less than we had hoped for*. It appears that the 8 processors cannot talk directly to each other - data passed between 2 processors may have to pass through a third processor en route - this creates bandwidth problems and bottlenecks resulting in poor performance.

I discussed scalability with other 8-way system owners and they noted that stability also was an issue - I understand that changing compler flags and linux kernal settings can help but, as I decided not to buy the system, I don't have the details.

Sorry this isn't much help...

Tim.

November 10, 2007, 12:58

Tim,

Do you recall if the same-motherboard scalability was worse than going through a network? That's what's happening in our case.

With the dauterboard detached, all 4 remaining nodes (8 cores) have a direct hyperlink connection to one another and I don't think we are suffering from data having to go through additional processors. There's a utility call "numastat" and one of the things it will tell you is how often memory is requested from a particular node and not found there (causing it to go to another node for the memory) and that actually looks pretty good -- only around 1/1000 times if I remember correctly. So it's almost always either finding memory on the local node or "1 hoping" to another node.

Trying to use the daughterboard (with its additional 4 nodes) is a big problem with what you mentioned before -- one node NM on the motherboard has a link to one node ND on the daughterboard so that, when a given node on the motherboard wants data it may need to go through NM, then to ND, then to the the desired node on the daughterboard.

November 1, 2007, 23:48	hardware for cfd	#1
max Guest Posts: n/a	I've run a cfd application on a single computer consisting of a motherboard+daugterboard containing 8 dual core opteron processors (16 cores total), running linux fedora core 4 or 5. each of the eight processors has it's own memory bank (NUMA architecture) with plenty of memory (<40% used). open mpi is used to spread the computation over N cores. the application&case i'm trying to run scales up very linearly for everyone i've talked to, but for the system i'm using (above), the speedup is something like: 1 processor ~ 1X; 4 processors ~ 2.3X (2.3 times fatser than 1 processor); 8 processors ~ 3.3X; 16 processors~ system crashes, must be restarted; Everyone else claims to get almost 8X for 8 processors (even 16X for 16 processors) using the same software and the same input; just a different system/hardware. i think these other people are using seperate computers, each containing 1 or 2 single core processors, and mpi communication for them is via 100g ethernet or faster. does anyone know why the system i've mentioned above would scale up so poorly? any ideas as to what the bottleneck might be? when i do a top it always shows all processes to be using 99+% CPU, so i thought that meant it was scaling well, but then when i look at the runtimes it's clear that something is wrong. Thanks

November 2, 2007, 13:07	Re: hardware for cfd	#2
TG Guest Posts: n/a	I obviously don't know what CFD you are using, but the odds are very good that you are memory bandwidth limited. There is no way that 16 cores in one machine is as good as 16 cores in 4 or 8 separate machines. If you do any I/O then all your cores are trying to write through 1 single I/O bus to the disk. If you do much mpi style communication between cores, its all going over the same shared memory path which is also in contention for the memory to supply your cores. Finally, while NUMA machines allow any core to grab any memory, the fastest memory is that directly attached to each core. If your program or mpi does nothing to deal with the need to keep memory local, your time will go up. You have too many cores in 1 machine.

November 2, 2007, 16:02	Re: hardware for cfd	#3
Ananda Himansu Guest Posts: n/a	Firstly, your computation should be partitioned into nearly amounts, so that the load on any core is nearly equal to that on any other. Secondly, as TG suggests, you may have less memory bandwidth between any two cores, than in other multi-rack / multi-box systems. Computation scales as the subdomain volume while inter-subdomain communication scales as the shared surface area between neighboring subdomains. Optimally, the ratio of computation to communication ought to be scaled (by selection) as the ratio of arithmetic operation speed (MFLOPS) to communication bandwidth per core (MB/s). With a fixed total communication bandwidth, if with 2 cores you are already limited by the total bandwidth, then when you try to scale to 4 or 8 cores by repartitioning a fixed-size problem, you are hit with a double whammy: (a) the communication bandwidth available per core goes down by a factor of 2 or 4, and (b) the ratio of required communication to computation also goes up by a factor depending additionally on the dimensionality of the simulation. If you have control over the source code of your application, you can check the linear scalability of your machine by (a) cutting back on your overall problem size (total number of mesh cells) so that when partitioned into 16, the required communication is just less than total available bandwidth, and (b) perhaps introducing redundant computational work in between communication events so that even when partitioned into 16 parts, the progress is paced by the computation rather than the communication. By testing like this, you can find the limits when the simulation starts to scale poorly.

November 2, 2007, 16:33	Re: hardware for cfd	#4
Charles Guest Posts: n/a	Another question, how big is your problem? It needs to be reasonably big to really benefit from a lot of processor cores, for a lot of the commercial codes the suggestion is that you gain very little when partitions drop below about 200 000 cells. Obviously Ananda has a much better understanding of what the real situation is, but there are some rough rules about the size of the grid to consider.

November 2, 2007, 19:45	Re: hardware for cfd	#5
max Guest Posts: n/a	Thanks a lot for the inputs! Here is some additional info: The speedup I quoted for 1, 4, and 8 processors was after removing the daughterboard, so at that point there were 4 processors (8 cores) total. Each processor has it's own corresponding MMU with memory and every MMU has a direct communication line to every other MMU (4*3/2 = 6 pathways). Based on the latter, I was thinking that memory requests not on the local memory bank wouldn't hurt too bad -- there are 6 different pathways between the 4 memory units + 4 pathways from each memory unit to its CPU. The OS does have NUMA support and should allocate memory locally as much as possible. I checked the NUMA miss rate (when a CPU tries to access non local memory), but it was pretty low, so I think that's OK. I realize that the dual cores have to compete for memory using the same pathway, but the reason I wasn't considering this too much is because I thought others used dual cores for CFD. Is that true? In CFD computing, is the cache hit rate generally lower than other applications (more need to go out to RAM => penalty for dual cores)? Thanks

November 2, 2007, 19:48	Re: hardware for cfd	#6
max Guest Posts: n/a	And I forgot to mention that there are over 4.5 million cells, so for 8 CPUs it's well over 500K cells per process.

November 5, 2007, 16:26	Re: hardware for cfd	#7
Ananda Himansu Guest Posts: n/a	Sounds like your NUMA is performing well with localization and with a highly-connected network topology within the box. Also, at over 500K cells per process, the ratio of volume of computation to area of communication is much higher than where you should see the parallel efficiency penalty from communication. In my tests of several years ago on SGI Origin cluster systems, I found that the parallel efficiency began to drop below 90% (speedup was less than linear) when the number of cells per process dipped below about 3000, which is much lower than what you have. Of course, those tests involved inter-processor communication using an explicit time-marching method. If your time-marching is implicit and you parallelize exactly (i.e., not using Schwarz-type subdomain decomposition/iteration of the globally implicit technique), then your communication volume could be significantly more than mine. Also, those SGI Origin systems had relatively slower processors and relatively high-bandwidth network connections. But I would still say off-hand that for such a problem size and for that many cells per process, your computations should scale almost linearly. One possibility is that the communications between processes have not been scheduled in an orderly fashion, and you may be causing MPI to do a lot of buffering because Processor A send a message to Processor B but the latter is not ready for it but is instead waiting for a message from Processor C. This excessive buffering would look like near dead-lock and could cause page-fault conditions (swaps to disk) if you are operating near the limit of physically available memory. I used a map-coloring routine to schedule the interprocess communication in stages, so that at each stage disjoint pairs of processors communicate with each other. If that is too much trouble to implement, you could try serializing the communication: basically designate a master communicating processor (say, Processor 0)acting as a hub. All messages among the other processors would be relayed through the hub. This would be the slowest system of communicating, but also the simplest, and with 500K cells per process, the communication should still be negligible compared to the computation. I am assuming of course that you are exchanging only surface data at inter-subdomain boundaries, and are not attempting to share the volume (entire subdomain) data between neighboring subdomains.

November 9, 2007, 08:26	Re: hardware for cfd	#8
Tim. Guest Posts: n/a	Hi, I looked into buying an 8 way (16 core) opteron system for CFD simulation. When we looked into the systems performance (we looked at systems from a few vendors - I will not go into details for obvious reasons) we noted that the scalability of the system was less than we had hoped for. It appears that the 8 processors cannot talk directly to each other - data passed between 2 processors may have to pass through a third processor en route - this creates bandwidth problems and bottlenecks resulting in poor performance. I discussed scalability with other 8-way system owners and they noted that stability also was an issue - I understand that changing compler flags and linux kernal settings can help but, as I decided not to buy the system, I don't have the details. Sorry this isn't much help... Tim.

November 10, 2007, 12:58	Re: hardware for cfd	#9
max Guest Posts: n/a	Tim, Do you recall if the same-motherboard scalability was worse than going through a network? That's what's happening in our case. With the dauterboard detached, all 4 remaining nodes (8 cores) have a direct hyperlink connection to one another and I don't think we are suffering from data having to go through additional processors. There's a utility call "numastat" and one of the things it will tell you is how often memory is requested from a particular node and not found there (causing it to go to another node for the memory) and that actually looks pretty good -- only around 1/1000 times if I remember correctly. So it's almost always either finding memory on the local node or "1 hoping" to another node. Trying to use the daughterboard (with its additional 4 nodes) is a big problem with what you mentioned before -- one node NM on the motherboard has a link to one node ND on the daughterboard so that, when a given node on the motherboard wants data it may need to go through NM, then to ND, then to the the desired node on the daughterboard.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Looking for a pimpleFoam tutorial using Salome (and hardware recommendations?)	madact	OpenFOAM	1	May 27, 2010 01:24
Hardware recommendation? AMD X2, Phenom, Core2Duo, Quadcore?	rparks	OpenFOAM	0	April 22, 2009 09:10
Hardware Recommendation for Parallel Processing	Brian Bian	CFX	2	February 7, 2006 17:27
hardware for fluent	Christian	FLUENT	9	December 3, 2001 16:05
hardware	Guus Jacobs	Main CFD Forum	24	May 3, 2000 16:42