CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

single i7 MUCH faster than dual xeon E5-2650 v3 !!!

Register Blogs Community New Posts Updated Threads Search

Like Tree25Likes

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   December 14, 2014, 06:24
Default
  #1
Retired Super Moderator
 
Bruno Santos
Join Date: Mar 2009
Location: Lisbon, Portugal
Posts: 10,975
Blog Entries: 45
Rep Power: 128
wyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to allwyldckat is a name known to all
Greetings to all!

I'll be trying to answer on this post the questions posed by acasas and by Chris Lee:

@acasas:
Quote:
Originally Posted by acasas View Post
But yet, can you believe, the company kept claiming I was wrong? It is a very important server producer and workstation company from USA, but I won't tell their name. They claim that for their applications, big data storage, labs, etc, this is not an important issue. Are they right??
For the average range of applications, they are somewhat correct. The performance difference is in the range of 1-10%, depending on the application. Problem is that CFD requires a very optimized (or at least very good) system, not an average system In which case, memory access is critical and 2 vs 4 channels can mean something in the range of 10 to 30% performance, depending on the cases.

Quote:
Originally Posted by acasas View Post
Any way, I just wanted to ask one 2 more things. This motherboard have 16 memory modules. Do I need to fill ALL of them (16) OR 8 (4 per each processor) will be enough? Of course I´ll populate them as in the motherboard specification, and yes, they are ECC DDR4.
Each processor/socket has 4 memory channels, which implies that a minimum of 4 modules/slots should be occupied; above that, it should be a multiple of 4.

Quote:
Originally Posted by acasas View Post
The 2nd question is related with discs and storage disposal. I do have 3 SSD 250 GB each. One for the system and software and the other 2 in RAID 0 mode for the working and scratching folders. Is it a good configuration for best performance?
Seems OK. It depends on how frequently you need to write the data to disk and how big your cases are for each time/iteration snapshot. It might make more sense to have smaller SSDs for RAID 0 and to have a 2-4TB hard-drive for off-loading data after writing to SSD is complete. But again, it strongly depends on your work-flow and file frequency+sizes.

-----------------------
@Chris Lee:
Quote:
Originally Posted by Chris Lee View Post
The system I am shopping has options to go up to 12 cores. There are, for example, options on the configuration which include
E5-1660 V2 , and
i7-4960X, and
E5-2680 V2 .
We might not yet have the technology to make a car compact itself into a briefcase, but at least computers are getting there Perhaps the cartoons "The Jetsons" were actually referring to teleworking...

Quote:
Originally Posted by Chris Lee View Post
For RAM, being a laptop, this system uses DDR3, with the best option being 32 GB (4 x 8G) 204-pin "quad channel" memory.
Mmm... if you might go up to 30 GB for a case, you might eventually see a need to go even further to 64GB of RAM... but I guess that if you ever need that, you will use a cluster or a server to do the mesh and calculations.

Quote:
Originally Posted by Chris Lee View Post
Now I don't understand well the architecture of how the RAM channels and CPU communicate, but I think you need to have at least 4 DIMM slots filled to get 4-channel functionality out of the RAM.
Yes.

Quote:
Originally Posted by Chris Lee View Post
The question is, am I not spending my $ efficiently if I go a number of cores greater than the number of channels in the memory? (If so, why would anyone ever go with more than 4 cores?)
I did a bit of lengthy mathematics on this topic yesterday: http://www.cfd-online.com/Forums/har...tml#post523825 - post #10

The essential concept is that you have to think that mores cores will be running slower, but they will also be responsible for lesser RAM to be crunched. Then you have to take into account for the total available memory bandwidth. Beyond that, it starts depending on the complexity of your case... this to say that in some crazy situations, overscheduling a 12 core machine with 18-36 processes might provide results slightly faster, because of an alignment in memory accesses.

Quote:
Originally Posted by Chris Lee View Post
I'm guessing that as long as you have 4 DIMM slots filled (for any of these single physical CPUs) there is no bottleneck being made as in the example above with two physical CPUs. Is that right?
The idea is that each socket should use 4 DIMMs for itself. In your case, you only have 1 socket

Quote:
Originally Posted by Chris Lee View Post
I was going to get a 10 core system (or 12 core, if I can find the budget for it) but I want to make sure I'm not throwing money away if I get more than 4 cores.
As I mentioned a bit above about the mathematics I did yesterday, it really depends. For example, if you search online for:
Code:
OpenFOAM xeon benchmark
I guess it's quicker to give the link I'm thinking of: http://www.anandtech.com/show/8423/i...l-ep-cores-/19
there you might find that a system with 12 cores @ 2.5GHz that costs roughly 1000 USD gives a better bang-for-your-buck than 8 cores @ 3.9 GHz that cost 2000 USD (not sure of the exact values). But the 8 core system gives the optimum performance of RAM bandwidth and core efficiency, but the 12 core system costs a lot less and spends a lot less in electrical power consumption, while running only at 76% CPU compute performance of the 8 core system.

In such a case, you might want to weigh in an additional and very important factor: how fast do you want your meshes to be generated, if they can only be generated in serial mode, not in parallel?

Quote:
Originally Posted by Chris Lee View Post
Note, I'm assuming the E5-2680 v2 is a "single CPU" with 10 cores, and so I would still have 4 channels of RAM available to all 10 cores, or in terms similar to yours above, I would still have the full 59.7 GB/s max memory bandwidth.
Yes, and at 32 GB of total RAM, would equate to 3.2 GB per core at roughly 5.97 GB/s access speed.

For comparison, the i7-4960X with 6 cores would be using 32 GB, with 5.33 GB per core at roughly 9.95 GB/s.

Now that I look more closely at the 3 CPUs you proposed for comparison, the only major difference is:
  1. How much maximum RAM do you really want to use.
  2. Are you willing to pay the extra cost for ECC memory. This can give you a greater piece of mind when running CFD cases, but it will make a bigger hole in the wallet as well.
For 32GB of RAM, from these 3, I would vote on the i7-4960X, which you could potentially be overclocked on situations where you need a little bit more performance and are willing to spend more electricity to achieve it... although on a laptop, this isn't easily achieved, and OC is a bit risky (namely it takes some time to master). Either way, it roughly gives you the same performance as the other 2 CPUs and you save a lot of money. Just make sure you keep your workplace clean and once a year have your laptop cleaned in the fans and heat-sinks, to ensure that it's always properly being cooled.



Quote:
Originally Posted by Chris Lee View Post
As a side question, with regard to the limiting factor in time to solution, I guess what I don't really know is how much time in the solution is spent with the cpu cores cranking away on the equations, vs updating the information in the RAM, . . . but I'll suppose for the time being that my CFD problem will be memory bandwidth limited. If you've got some rules of thumb on how to figure where the overall bottleneck is, i'd be most grateful.
Already mentioned on this post. Nonetheless, the primary rule of thumb is that it can strongly depend on the kind of simulations you need to perform. Some cases are easily parallelised, others aren't.
And don't forget about the time it takes to generate the mesh, when using a CPU that has more cores, but less top speed when running in single core.


-------------------
@acasas:
Quote:
Originally Posted by acasas View Post
Hey Chris! You see? It was not bad hijacking your thread even by mistake. Now you can ask interesting things in mine and I dont mind ;-)
You might not mind, but others might and probably will. It's considerably hard to be talk/discuss about two or more different topics on the same thread, without loosing track of whom the questions are being asked/answered to. The only reason why I (as a moderator) haven't moved the implied posts was because it seemed it wasn't a complete hijack and the details were still somewhat related.

Best regards,
Bruno
acasas and HyperNova like this.
wyldckat is offline   Reply With Quote

Old   January 13, 2015, 12:58
Default
  #2
Member
 
acasas's Avatar
 
Antonio Casas
Join Date: May 2013
Location: world
Posts: 85
Rep Power: 13
acasas is on a distinguished road
guys, check out Erik´s benchmark thread

http://www.cfd-online.com/Forums/har...quad-xeon.html
acasas is offline   Reply With Quote

Old   January 15, 2015, 04:58
Default
  #3
Member
 
acasas's Avatar
 
Antonio Casas
Join Date: May 2013
Location: world
Posts: 85
Rep Power: 13
acasas is on a distinguished road
Hi guys, I came up with some results over, what from now on I would like to call the "Erik´s Benchmark" , wich you can find at http://www.cfd-online.com/Forums/har...quad-xeon.html

Model:
Geometry: 1m x 1m x 5m long duct
Mesh: 100 x 100 x 500 "cubes" all 1x1x1cm (5M cells)
Flow: Default Water enters @ 10m/s at 300K, goes out other side at 0Pa. Walls are 400K.
High Resolution Turbulence and advection
Everything else default.
Double Precision: ON
20 iterations (you must reduce your convergence criteria or it will converge in less iterations.
)


I did perform the "Erik´s Benchmark" over a single i7 3820 and over a dual xeon E5-2650 v3, both under Windows 7 Pro 64 bits
On the i7 3820 @ 3.6 Ghz and DDR3 SDRAM PC·-12800 @ 800 MHZ, with 4 real cores and 8 threads, and with affinity fully set, it took 1598 sec wall time.
On the dual Xeon E5-2650 v3, 20 real cores, no hyper threading, overclocking on, RAM memory DDR4-2133 (1066 MHz), it took 533 sec wall time.

On the dual Xeon for other amount of cores, affinity was not automatically set, so the run time wouldn´t be useful for this benchmark comparison. In some cases the computer was almost not doing any progress until I did set manually the affinity for every single core on the task manager for the "solver-pcmpi.exe" tasks.
If any of you guys, would like I do run this "Erik´s Benchmark" over my dual Xeon for any other amount of cores than 20, and post in here the results, please, could you explain how to establish or set the affinity "in advanced" before running the test. Is there any way to program or define the affinity for the solver-pcmpi.exe in advance?

thank´s a lot
wanrui likes this.
acasas is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Xeon e5-2403 (Dual) vs. single i7 zx9cp Hardware 7 February 26, 2014 14:59
Dual cpu workstation VS 2 node cluster single cpu workstation Verdi Hardware 18 September 2, 2013 03:09
Performance of dual xeon 2643 tally_ho Hardware 7 December 17, 2012 12:01
Dual Xeon PIV 3.8Ghz vs 2x Dual Core E5130 2.0 GHz Michiel Hardware 4 July 31, 2009 06:06
P4 1.5 or Dual P3 800EB on Gibabyte board Danial FLUENT 4 September 12, 2001 11:44


All times are GMT -4. The time now is 07:37.