Optimal memory configuration
It is my impression that many CFD hardware solutions are memory bandwidth limited. Thus, it ought to be an important task to find the highest performing memory configuration for each workstation/cluster nodes in question.
I am in the process of specifying a 32-core mini-cluster which will be based on Xeon E5-2667 v2, which has 4 memory channels running up to 1866 MHz. Below article mentions type of DIMM (UDIMM, RDIMM and LRDIMM) and number of ranks on each DIMM as an important parameter:
I understand that it is vital to utilize all 4 memory channels for each CPU, as 2 channels will only have approx. half the transfer rate. But the importance of number of ranks and DIMM type is new to me.
The article above basically points out the following guidelines:
1) The more ranks on each DIMM the more memory can be accessed parallel.
2) Quad ranks should be avoided as they cannot run on full speed. Thus, dual rank DIMMs gives the highest performance.
3) Thus bus supports up to 8 ranks per channel.
4) The bus supports up to 3 banks of DIMMs per channel. However, only 2 banks can be used at full speed.
5) Use UDIMMs as they are faster than RDIMMs and LRDIMMs
Use 2 DIMMs of dual rank per memory channel – that is 8 DIMMs per CPU.
Does anyone have CFD performance data comparing different memory configurations covering the number of ranks and DIMM types? Would be great to know the impact on performance by these parameters.
Good question, I would be interested in the answer as well.
Currently I am running a cluster configuration with 6 core i7's and 4x8 Gbs of RAM. Perhaps it would be better to run 6x6 Gbs.
I haven't thought about it greatly but input is appreciated from anyone with experience.
I found this very valuable document answering all my questions:
1) Populate all memory channels. Memory bandwidth is almost proportional to the ratio og populated channels (Figure 31).
2) UDIMMs has only 2% less latency than RDIMMs (Figure 10).
3) UDIMMs and RDIMMs has equal effective bandwidth (Figure 11).
4) For RDIMMs it has no impact on latency whether there is 1 or 2 DIMMs per memory channel (DPC) (Figure 36 og 37)
5) Single rank RDIMMs has 11% less bandwidth than dual rank (at 1 DPC) (Figure 37)
Use Dual Rank UDIMMs or RDIMMs, 1-2 DIMMs in each memory channel.
I have been told F1 teams will buy a xeon cluster, but only run CFD tasks using half the cores. They are limited by the computing power and wind tunnel time they are allowed to use, and get 20% better performance be core hour used (probably some bench mark they have to adhere to). The increase in memory bandwidth per core is said to be the reason for the 20% increase over a fully used cluster half the size.
|All times are GMT -4. The time now is 18:08.|