CFD Online Logo CFD Online URL
Home > Forums > Hardware

One thread on two cores?

Register Blogs Members List Search Today's Posts Mark Forums Read

Like Tree2Likes
  • 2 Post By kyle

LinkBack Thread Tools Display Modes
Old   April 15, 2013, 10:15
Default One thread on two cores?
New Member
Krzysztof Jagiełło
Join Date: Apr 2013
Location: Warsaw/Poland
Posts: 2
Rep Power: 0
kris_jag is on a distinguished road
Hi all,

We recently replaced our four "ourself made" i7-980X 3.33 GHz computers with two Dell PowerEdge R820 (each with 4 Xeon E5-4650 2.70 GHz) servers. On old i7 computers were installed Windows Server 2008 R2 Standard, on R820 we have Windows Server 2008 R2 Enterprise. On R820 there is installed MS HPC, but it is not used yet - we are carrying out the Fluent computations by Remote Desktop (RDP).

Unfortunately the time computation on new servers is rather poor. According to CFP2006 Rates benchmarks from SPEC.ORG (parallel floating point calculations) R820 should be significantly faster than old computers (table.gif).
I know that they are only benchmarks, but...
Our R820 at the beginning were slower than two years' old i7. After some bios and operating system tuning by our hardware provider they are more or less the same as old (still a little slower) - instead of 50-100% faster...

That was only background. Main reason of this post is a question - do someone know, why one thread could be calculated by two cores?

One Fluent calculation on 4 processes fl_mpi1400.exe (processes.gif)

Each process has 7 threads, but only one of them is "power demanding" (threads.gif).
3.14% of processor load equals 100% of one core load (32 cores in system).

But for these four processes/threads the eight cores is used, of average 50% usage (graphs.gif).

Why one thread is calculated by two cores (at least it looks like that)? Does anybody know?

I suppose that could(?) be a (one of) reason for poor performance.

Just one comment: Hyper Threating on both old and new computers is disabled.

Thanks for any help/advise.
Attached Images
File Type: gif table.gif (10.6 KB, 19 views)
File Type: gif processes.gif (3.1 KB, 16 views)
File Type: gif threads.gif (6.0 KB, 20 views)
File Type: gif graphs.gif (13.5 KB, 16 views)
kris_jag is offline   Reply With Quote

Old   April 15, 2013, 12:58
Senior Member
Join Date: Mar 2009
Location: Austin, TX
Posts: 147
Rep Power: 11
kyle is on a distinguished road
You are seeing about what I would expect. Even though the 980X came out over three years ago, it is still pretty close to the fastest processor you can buy.

The bottleneck for CFD on unstructured meshes is typically random memory access speed, and your Dell machine has several things working against it in this area...

Both the i7 980X and the E5-4650 are capable of running one memory channel per two cores, but you are using a quad socket machine. That means you have to have 16 memory channels to feed your four CPUs as effectively as the three channel 980X. Your monster 4 socket motherboard might only be capable of running 8 memory channels, or you might not have 16 sticks of memory installed. If I had to guess, your new cluster is running with 8 channels per machine, or 16 total, vs 12 total channels of memory on the old cluster.

But why is the 12 faster than the 16? Two more reasons. You likely have faster memory in the old cluster. The 980X can run overclocked, low latency memory, whereas the Xeon machine is likely using much slower registered ECC memory at a lower frequency. Multi-socket systems are also bad for random memory access. Each socket is directly attached to its own group of memory channels, but for the other 3/4 of the system's memory it must ask another processor to relay the data. The memory for the calculations each processor is handling may not be directly accessible by that processor, and this necessarily adds latency to memory access.

Bottom line is consumer grade "gamer" hardware is significantly faster per dollar for CFD than the big $10,000 machines that Dell and HP want to sell you. CFD puts very different demands on a machine than most applications. Most applications are not memory access bound, so servers are not built to maximize memory accesss speed. For my startup I built a 15 node, 60 CPU core cluster of i7 machines for $12,000, and it calculates faster than a $100,000 cluster that Dell would spec for you.

Edit - And to answer your question about one thread appearing to be shared by two cores, this does slow down the calculations. What is likely happening is the thread is jumping around to different cores, which you obviously don't want. In Linux you can lock a thread to a specific core, but I am not sure how to do it in Windows. While this will help, I would not expect to see huge gains.
evcelica and kris_jag like this.
kyle is offline   Reply With Quote

Old   April 16, 2013, 11:01
New Member
Krzysztof Jagiełło
Join Date: Apr 2013
Location: Warsaw/Poland
Posts: 2
Rep Power: 0
kris_jag is on a distinguished road
Thanks Kyle for an explanation. I was not aware of so large importance of memory access speed - I thought that CPU is definitely the most important. We have to better consider upgrade of our computers in future...

But one thing is steel intrigue me - why four processes/threads (where in its properties it is indicated, that they are calculated as 100% of one core - 3.14% of total CPU) are actually calculated by eight cores (average 50% load of each)?
kris_jag is offline   Reply With Quote

Old   April 16, 2013, 20:57
Senior Member
Join Date: Feb 2011
Location: Earth (Land portion)
Posts: 696
Rep Power: 13
evcelica is on a distinguished road
My computer does the same thing, when I use 4 cores it uses partial load on all six cores, to equal 66% of the CPU, not 4 at 100% and 2 cores sitting idle. I wouldn't worry about it, I'm sure it is supposed to act this way.
evcelica is offline   Reply With Quote


core, parallel computing, thread

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

Similar Threads
Thread Thread Starter Forum Replies Last Post
udf problem jane Fluent UDF and Scheme Programming 17 September 21, 2017 07:56
Guide: Getting Started with the CFD Online Discussion Forums pete Site Help, Feedback & Discussions 8 July 29, 2016 05:00
Which is better for CFD 4 core i7-2600 or AMD 8 core FX-8150? GregShaffer Hardware 3 May 7, 2015 13:26
Superlinear speedup in OpenFOAM 13 msrinath80 OpenFOAM Running, Solving & CFD 18 March 3, 2015 06:36
Phase locked average in run time panara OpenFOAM 2 February 20, 2008 15:37

All times are GMT -4. The time now is 18:38.