CFD Online Discussion Forums

CFD Online Discussion Forums (http://www.cfd-online.com/Forums/)
-   CFX (http://www.cfd-online.com/Forums/cfx/)
-   -   Intel Core-i7 Hyperthreading and CFX (http://www.cfd-online.com/Forums/cfx/112916-intel-core-i7-hyperthreading-cfx.html)

murx February 7, 2013 10:11

Intel Core-i7 Hyperthreading and CFX
 
1 Attachment(s)
Hey,

I've read several discussion about Hyperthreading (HT) in Intel Core-i7 processors and it's impact on the CFX performance. Since I purchased a new i7 and do have enough licenses available, I ran a Benchmark. I thought the results could be useful for other, so here they are.

The diagram below shows the relative speed (inverse of the CFD solver wall clock time for the CFX Benchmark case, normalized with the value for i3770 HT on with 1 process) on the y-axis. The irregular trend is probably a result of round-off errors by the CFD solver wall clock time... it only gives two significant numbers.

Bottom line is:

If HT is disabled, the maximum speed is achieved with 4 processes. This configuration is just as fast as with HT on and 6 processes. So you cannot improve the speed by disabling HT but you can save licenses.

Daniel C February 7, 2013 11:05

You do not have to disbale HT at all. Just assign the real cores to the threads in the Task Manager. They can be identified by their number, namely core 0, 2, 4 and 6.

But I think this topic belongs to the hardware forum.

evcelica February 7, 2013 22:58

And how would you assign the real cores to the threads in the Task Manager?

Daniel C February 8, 2013 02:20

Quote:

Originally Posted by evcelica (Post 406646)
And how would you assign the real cores to the threads in the Task Manager?

This is quite easy, just go to the Task Manager and click on the processes tab. There you have to look for the ansys cfx solver process or processes (if solving in parallel mode) and execute a right click. Select the Set Affinity command (Don't know the exact translation because I am using WIN7 in german) and tick on the desired core for each process.

You can seperate the real from the hyperthreading cores, if you shut down all of your application and bring the computer into the idle state. Then go to the Performance tab and there you click on "Resource Monitor...". Each hyperthreading core is marked with "Parked". Usually the cores with an even number (including 0) are the real cores.

It is a common mistake that people just using one real core and the hyperthreading unit for multithreading an a dual core intel cpu.

murx February 8, 2013 02:45

I see... good to know!
But I guess you have to do this again everytime you start a new simulation, right? So this might be an option for someone doing long lasting simulations. For me it is unfortunately not since I have to run a lot of simulations which each only take about 5-10 minutes.

ghorrocks February 8, 2013 04:30

In my experience assigning processes to core just makes it go slower. Best leave the OS sort that out. But benchmark it and find out for yourself, things may have changed.

But if you have a quad core processor, don't be fooled by the hyperthreading virtual cores. You can only run at 4 processes with any reasonable efficiency, and in fact even the last physical core is of marginal benefit. But again, this is different for different machines so benchmark it on your machine and work out what works best on your system.

murx February 8, 2013 04:40

Quote:

Originally Posted by ghorrocks (Post 406684)
But if you have a quad core processor, don't be fooled by the hyperthreading virtual cores. You can only run at 4 processes with any reasonable efficiency, and in fact even the last physical core is of marginal benefit.

My experience tells me that you do get a significant speedup by increasing the number of processes over the number of physical cores when hyperthreading is active.
On the ivy bridge i7, that i used for this benchmark, the maximum performance was at 6 processes (see chart attached to my first posting) and on a sandy bridge i7 that i used some time ago, the maximum was at 7 processes.

ghorrocks February 8, 2013 04:47

This is where you have to factor in your specific case. In most commercial applications the additional speed from the last few processes is not worth the cost of the parallel license. In fact the license cost is many times the hardward cost. So the performance per $ makes the optimum quite a few less processes. In fact the optimum is often a single (or maybe two) processes per machine and a fast network connection to run distributed parallel.

But for academic applications they often have lots of licenses (as they get them cheaply) so then it makes sense to run whatever is the fastest, even if the last few processes are not really adding much.

Daniel C February 8, 2013 05:53

Quote:

Originally Posted by ghorrocks (Post 406684)
In my experience assigning processes to core just makes it go slower. Best leave the OS sort that out. But benchmark it and find out for yourself, things may have changed.....

If you don't assign the cores to the threads, you will find that e.g. Windows 7 will utilize just one core with its corresponding hyperthreading unit for two threads. That is a waste of one core. I have experienced it myself.

Moreover it is inconvenient for me to disable the hyperthreading capability each time I want to simulate a case, since I want to benefit from the additional performance that hyperthreading offers me when I am working with other applications like Power Point etc.

ghorrocks February 8, 2013 06:03

I see. I have done exactly the same test years ago with earlier generations of hyperthreading and CFX and found that assigning processes to cores slows things down. Just goes to show you have to test these things for yourself on your system, as some systems behave totally differently to others.

Daniel C February 9, 2013 04:46

Regarding the additional speed I totally agree with ghorrocks and it really depends on the system.

I have an Ivy Bridge i3770 too, but don't benefit from two additional threads compared to only four threads on my quad core. Rather I get a speed drop in my simulation if I use more threads than physical cores are available.

Nevertheless in the Internet I found:

>> Hyper-Threading is now called Simultaneous multithreading or SMT. Customers are recommended to leave SMT enabled on their systems but not over-subscribe physical cores for parallel simulations. While some improvement is possible, the extra performance from the virtual threads is not cost-effective and incommensurate with the additional license costs (which are per process)."

Basically, if a section of the CPU core is not being used it tries to run a second task on these sections. For example, if one process only needs to do floating point operations while another only needs to do integer operations they can run both concurrently. For FLUENT, there is no consequence to performance if it is turned off. If SMT is on, and you run 16x (instead of 8x; assuming dual cpu quad-core nodes), you can get an additional 20% or so (compared to 8x) improvement. This is not recommended since you only get 20% more for 2x licenses (license is per process). in this scene rio, leave SMT on and run 8 way. This is the recommended approach <<

This comes from

http://www.simutechgroup.com/Technic...e-support.html

Shawn_A March 9, 2013 11:16

I've done the same tests myself. CFX code has been pretty well optimized for parallel use. Hyperthreading, as you've said, attempts to make use of idle CPU resources by utilizing an independent front end of each physical core to prepare data for the shared math unit of each physical core, but, due to the parallel efficiency of the CFX solver, there is basically no idle CPU time.

I've found that if you have, for example, and quad-core CPU with 4 physical cores and 8 locigical cores with hyperthreading enabled, there will be a VERY VERY small, if any, performance difference between running a simulation with 4 cores hyperthreading off and 8 cores hyperthreading enabled. Also, you will get VERY non-linear performance speedup with additional cores with hyperthreading enabled.

As Glenn said, if your simulations are not limited by the available licenses, you can leave hyperthreading enabled, but you MUST run your job will ALL the cores available on your system. If your have a limited number of licenses, then DISABLE HYPERTHREADING. Also, if you want to run MULTIPLE simultaneous jobs, DISABLE HYPERTHREADING.

oj.bulmer March 15, 2013 06:13

1 Attachment(s)
I have done this benchmarking of CFX and Fluent, although not on hyperthreading but on physical cores. Thought it may be useful. The cores are physical cores of cluster, which has 4 “boxes” each having a quad core Intel i7-2600 processor and 16 GB RAM, connected to each other by Infiniband SDR 4X using RDMA 10 Gbits/s (latency ~5 microseconds).

The mesh was roughly 4 million and physics, same for both codes, was : porous region, second order discretization schemes, 2-equation models. Attached is the snap of results. It is evident that not only Fluent is a faster for same physics and computational resource, but also is a lot more efficient in leveraging the multicore processing. Nearly twice as efficient :cool:

Agreed, CFX being coupled solver reaches convergence faster (smaller no. of iterations). Yet, the difference is a lot.

OJ

ghorrocks March 16, 2013 06:06

These results are strange. CFX usually parallelises very well when properly set up. I do not think this result is typical, and I suspect something is wrong with your benchmark.

oj.bulmer March 16, 2013 08:56

The time study was done on one of the models. I have observed the same trend in numerous other models - with smaller and bigger meshes than the one used for this study. Typically, 2000 iterations of Fluent used to be finished within say 2-3 hours. CFX did close to 1000 (+/- 200) iterations in the same timeframe.

Is it fair to say that the porous jump in Fluent and porous interface in CFX have different computational requirements?

OJ

ghorrocks March 17, 2013 05:42

Your other results are as expected:
* CFX takes much longer per iteration than Fluent (this is because the coupled solver in CFX is much more complex than Fluent's default SIMPLE based solver)
* CFX converges faster than Fluent (again, due to CFX's coupled solver).

So your comment that CFX does half as many iterations as Fluent in the same time is as expected.

The comment I am surprised about is your comment that the parallel speedup factor is much lower for CFX than it is for Fluent. They should both be similar, and for hardware with few bottlenecks should be close to ideal speedup. If you are reporting CFX is off ideal scaling then I suspect either your benchmark is dodgy or the result is throttled by your hardware somehow.

oj.bulmer March 17, 2013 10:20

Quote:

So your comment that CFX does half as many iterations as Fluent in the same time is as expected.
Agreed, I should have kept the comparison limited to code's efficiency in leveraging more cores rather than the iterations part, which would be obvious as you stated. The additional bits of information do digress the message here.

Quote:

The comment I am surprised about is your comment that the parallel speedup factor is much lower for CFX than it is for Fluent
Well, the increase in speed when processing power was quadrupled (4 to 16 cores), for my exercise, was 3.5 times which is actually more than the one for Marx's exercise, 2.6 times, when he quadrupled the processing power(1 to 4 cores). Now I know that the relationship is not exactly linear, and towards fewer number of cores, the curve of speed-boost is steeper. So though I didn't do the bench-marking of 1 to 4 cores, I suspect, the speed-boost may exceed 3.5 in that area for my case. By that logic, isn't Marx's speedup factor smaller than you'd think?

Or, is it that FLUENT's pace compared to CFX is surprising?

OJ

ghorrocks March 17, 2013 18:17

Doing a 4-way simulation on a single quad core CPU will end up in a speedup factor around 2.5. This is due to memory bottlenecks on the CPU and motherboard and has little to do with the software. You will find running 4 totally independant processes simultaneously on a quad core CPU will run about 2.5 times faster than a single process.

Be aware of the new Intel technology, I forget its name, where it runs at a higher CPU clock speed when running single core versus multi core. This can distort speedup benchmarks.

To get speedups in the 3.5 and higher range for an ideal 4 times acceleration you need to remove the CPU/motherboard memory bottleneck. An easy (but expensive) way of doing this is by running 4 machines, each using a single core of the CPU. Note you will also need a reasonable network for this to work. Under this setup I would expect both CFX and Fluent to have speedup efficiencies of 95% in the simulation size have here.


All times are GMT -4. The time now is 05:12.