CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   Hardware (https://www.cfd-online.com/Forums/hardware/)
-   -   AMD Ryzen Threadripper 1920X vs. Intel Core i7 7820X (https://www.cfd-online.com/Forums/hardware/194831-amd-ryzen-threadripper-1920x-vs-intel-core-i7-7820x.html)

bennn October 25, 2017 05:22

AMD Ryzen Threadripper 1920X vs. Intel Core i7 7820X
 
Hi all,

After all the talks about these two new core families, I had the opportunity to build two new stations, one with each.

AMD Ryzen Threadripper 1920X, 3.5 GHz, 12 cores, 24 threads, 658.25€ in France
http://www.amd.com/fr/products/cpu/a...adripper-1920x

Intel Core i7 7820X, 3.6 GHz, 8 cores, 16 threads, 541.58€ in France
https://www.intel.fr/content/www/fr/.../i7-7820x.html

Motherboard for AMD is 38 euros more expensive, the cooling is 30 euros more expensive, and the power supply is bigger so 15 euros more expensive. So let's assume the overall cost is 741.25 for AMD

Both cores were tried hyperthreaded.

They have the exact same memory fitted :
Corsair Mémoire PC Vengeance LPX - DDR4 - Kit 32Go (4x 8 Go) - 3200 MHz - CL16 -
The memory was more than enough for all cases tested.

And the exact same drives. No overclocking was used.

The results on OpenFOAM are

Motorbike simpleFOAM (OF v5.0) on 6 cores

AMD : ExecutionTime = 153.31 s ClockTime = 155 s
Intel : ExecutionTime = 148.6 s ClockTime = 155 s

DTCHull interDyMFOAM (OF v1706) on 8 cores
AMD : ExecutionTime = 56577.9 s ClockTime = 56665 s
Intel : ExecutionTime = 52854.7 s ClockTime = 52888 s

If you compute the "euros * time /core" index you get :

AMD : 3500244
Intel : 3580385

So it is very close, but AMD is still a good choice.

I'd like to add that AMD temperature sensing was messy, with lm-sensors not reading it. But after managing to see the temperature during the runs, Intel reached 70deg C while AMD was only 50.

flotus1 October 25, 2017 05:41

Thanks for sharing your results.
However, I am not quite convinced by your metric. So far, the Intel chip (let alone the platform) costs less and is faster. I would be more interested in a comparison running with the maximum amount of physical cores available.
Which exact memory are you using? Did both cases fit in the memory?

bennn October 25, 2017 08:14

Well my understanding is that, thinking in not hyperthreaded logic, AMD can do one and half DTC hull case in 56000 sec, while INTEL can do one of those in 52000s. Compared to the price paid, I think AMD is at least as efficient.

Ho and by the way the motherboard is 38 euros more expensive for AMD now. I should add that indeed.

I've updated my initial post with answers re dimms

I'm open to any feedback or test that you think make sense.

flotus1 October 25, 2017 08:29

Quote:

Originally Posted by bennn (Post 669135)
AMD can do one and half DTC hull case in 56000 sec, while INTEL can do one of those in 52000s. Compared to the price paid, I think AMD is at least as efficient.

:confused::confused::confused:
Because it still has 4 cores left idling? That seems like quite a daring extrapolation. Go ahead and try it, you might be surprised. CFD performance usually does not scale linearly with the number of cores. That's why I would be more interested in a comparison with the full amount of physical cores. 12 for AMD, 8 for Intel.

bennn October 25, 2017 08:35

You understand though that I can't just increase the amount of parallel domains just for one chip, otherwise the results are biased right ?

Is it ok for you if I launch concurrently 2 of the same motorbike case on 8 cores on Intel, and 3 on AMD ?

flotus1 October 25, 2017 08:45

Biased in which sense? Higher communication overhead due to a larger number of smaller domains? That is exactly why I always prefer a smaller number of faster cores over a larger number of slower cores.
Running several cases concurrently, the results will also be "biased" due to a lack of total memory bandwidth. Plus you need 50% more memory in total if you want to run 50% more cases simultaneously. Which increases the hardware cost.
When I need a result, I am interested in how fast my computer can provide it. Avoiding biases caused by parallel efficiencies <100% is usually the least of my worries and sounds more like cherry-picking to me.

lac October 26, 2017 10:52

I'm also interested in some results for these chips with some specific settings:
1. Hyperthreading turned off.
2. All cores are used on both cpus, but for only one job/CPU.
3. Run the parallel threads with affinity set (mpirun -np (number of cores) -bind-to hwthread)

As I have read it on this forum many times, and experienced it myself too, hyperthreading is most of the time useless for CFD.
I think that all cores should be used if possible. Off course it will be biased in some way, but you won't buy hardware with 12 cores to have 4 idling.
The last thing, affinity will help the AMD CPU most likely, as due to the architecture it acts like as multiple CPUs (considering the higher latency communication between the different CCX-es ).
Also, I don't know, if the different available instruction sets (AVX2 vs AVX512) have influence on the results, but it's possible that they do.

bennn October 26, 2017 11:57

Hi all, latest tests :

motorBike on all CPUs :
AMD : 113s
Intel : 135s

and now that is counter-intuitive for me, but using --bind-to hwthread actually makes computation time twice as long for AMD and 1.5 for Intel. Using --bind-to none solves the issue, and is the way to get for several single-threaded jobs.

RobertB October 27, 2017 06:25

Perhaps a stupid question but since you appear to have hyperthreading on did you core lock to only the physical cores?

If it is half as fast it looks like you might of locked to both the physical and hyperthreaded core and left half the cores unused.

Iirc (and I may not) you need to lock to every other core 0,2.

We always found core locking worked better on the Xeons, admittedly dual processor systems where a thread being pushed to the other core would cause a major loss in cache efficiency.

JBeilke October 30, 2017 08:17

Hi Benoit,

we ran the Motorbike case on a Xeon E5-1650 v3 (6 core processor) with hyperthreading turned off on 6 cores and got:

ExecutionTime = 167.03 s ClockTime = 169 s

How does this compare to your machines, with HT disabled?

Thanks
Jörn

lac October 30, 2017 10:44

Quote:

Originally Posted by RobertB (Post 669393)
Perhaps a stupid question but since you appear to have hyperthreading on did you core lock to only the physical cores?

You can try to run it with -bind-to core if HT was turned on. It would explain why you had this slow down.
On my WS the results (Clocktime, Motorbike case, OFv5):
73s (with -bind-to hwthread)
110s (withouth it)
The machine is:
Dual Xeon E5-2673 v3 (all-core turbo 2.7 GHz, 12core/cpu)
8x8GB single rank dimms
HT off

bennn November 2, 2017 03:28

Ok so the results with HT off is exactly the same. With HT on running with 8 or 16 cores for intel chip, and 12 or 24 cores for AMD chip, all give the same results as well.

No improvement with any bind-to setting for now.

Testing multiple single CPU jobs in the next few days.

Simbelmynë November 24, 2017 10:51

Quote:

Originally Posted by lac (Post 669740)
You can try to run it with -bind-to core if HT was turned on. It would explain why you had this slow down.
On my WS the results (Clocktime, Motorbike case, OFv5):
73s (with -bind-to hwthread)
110s (withouth it)
The machine is:
Dual Xeon E5-2673 v3 (all-core turbo 2.7 GHz, 12core/cpu)
8x8GB single rank dimms
HT off

Just curious. When you make comparisons, using different decomposition of the motorbike case, how do you know that you are decomposing the domain similarly? Or is this just indication of the performance of -bind-to hwthread?

lac November 27, 2017 07:33

I have used the same, default hierarchical decomposition (with n = (6 4 1)) with the same number of domains. So yes, it show the 'performance' of process binding.

Simbelmynë November 27, 2017 08:25

So do you time the simpleFoam execution or is it everything in the Allrun script file?

Using 14 threads on a 7940X (HT enabled), with decomposition (7-2-1), I have done some benchmarks.

Assuming you time the simpleFoam only then:

Code:

$ time mpirun -np 14 -bind-to none simpleFoam -parallel
Gives a real time of 117s.

Code:

$ time mpirun -np 14 -bind-to hwthread simpleFoam -parallel
Yields a real time of 150s.

A simple
Code:

$ time ./Allrun
Results in 157s of real time. (this is without -bind-to hwthread)

lac November 27, 2017 08:36

If you use -bind-to-hwthread with HT turned on, I guess processes will be bind to the 'real' and 'HT' cores as well. So it may be better to use bind to cores. I only timed the simpleFoam execution btw.

The_Sle January 23, 2018 22:07

Hi and thanks for this and other similar conversations, buying kit can be a pain without some information beforehand, and this forum eases that pain quite significantly :D

I'd like to add the overclocking capabilities of Skylake-X to this conversation. I recently purchased a 7820X and am running OpenFOAM with it, quite succesfully. My chip (and pretty much all of them) will run 4,5 GHz on all cores on air cooling with ease. This is of course true (with some limitations) on the i9 chips as well, and the results improve beyond their AMD counterparts.

With 32 GB of 3200 MHz memory, I can run the simpleFoam part of motorBike-tutorial in 121 seconds on 8 threads, which in my mind makes the Skylake look better value than Threadripper for OF use at least, when considering the disparity in motherboard and cooling costs.

Cheers

JBeilke January 24, 2018 01:01

Thanks for sharing the results. We usually used 6 cores for this benchmark. So it is easier to compare the results.

It would be interesting to see some results from the Epyc for this benchmark.

Simbelmynë January 24, 2018 03:37

Thank you for sharing the OC results. Was it with Allrun or with just the solver?

The_Sle January 24, 2018 14:57

6 cores run in 134 seconds.

Both results are for just the solver, with

Code:

time mpirun -bind-to none -np 6 simpleFoam -parallel

Simbelmynë January 25, 2018 02:54

Nice, so you managed about 12% speedup compared to the stock frequency.

Did you OC the memory as well, or did you run it at 3200?

Also, did you delidd the CPU or was it not necessary?

Did you change the voltage?

A fair comparison vs the Threadripper is to run the Threadripper @ 3.9 GHz, since that is a fair mark to achieve with OC (perhaps 4 GHz with some luck).

I have never ordered from Silicon Lottery, but they guarantee a certain OC potential from their offerings:

https://siliconlottery.com/collectio...intel-i7-7820x

Finally, since AMD is releasing Zen+ very soon, we might see a speedup in the AMD line with about 10% on stock operation, before the summer.

Now, if only the DDR prices would go down.... :D

The_Sle January 25, 2018 13:46

The memory is a HyperX 3000 MHz CL15 kit, with XMP enabled and then 3200 MHz overclock. It takes that speed no problem, which is nice considering it's about 50€ cheaper than "true" 3200 MHz kits :D

CPU delidding is not necessary (for the 8 core model at least) unless you are aiming for 5 GHz+. My Noctua D15 can't keep up after 4.7 GHz, but a proper AIO watercooler should be able to run 4.8, maybe 4.9 on a good chip. However, the system stability demands for CFD use are much higher than pretty much any other use, and this is why the mileage varies greatly.

My chip runs stable 4.5GHz@1.15 V, or 4.7@1.25 at which point the temps are the issue before I can test system stability properly. I can't even run simpleFoam because the CPU shuts down due to thermals immediately :D

And considering Zen+, the TR models of that architecture won't come in a while. And Intel will release the next gen X299 CPUs late this year (or early next year with usual delays :p) keeping the competition hot. It's great that AMD can make proper CPUs again, as Intel can't just milk the customers so bad now :rolleyes:

I'd say that right now, the Skylake-X CPUs are excellent value for what they are, IF overclocked. TR is a better deal if you can find motherboards and memory for reasonable prices, and don't want to mess with overclocking.

JBeilke January 26, 2018 09:43

Thanks for your work.

So we just get about 25% improvement on 6 cores compared to the Xeon E5-1650 v3. And therefore we need to overclock the machine :confused:

I still hope that someone from the Epyc fraction might run this benchmark.

flotus1 January 26, 2018 11:32

I heard you. What I did so far was installing the latest OpenFOAM docker image and copying the motorbike files to a run directory. From here on I need specific instructions :confused:

Simbelmynë January 26, 2018 15:37

I can try to explain (without actually having access to any linux box to verify my memory :p). You have all the information in the Allrun file. In order to run the test without meshing etc. you can simply comment out the solver line and then execute Allrun. This will setup the case for the simpleFoam solver.

In the system folder you can edit the file that ends with ".6" and set the correct number of processes (default is 6). You also need to determine how to partition the domain (with 32 cores, you might wish to go with 4 4 2, or 8 2 2). Save and then also make a copy of the file where you remove the ".6" ending (this is needed since simpleFoam will look for a file without ".6" ending if you do not execute it from the "Allrun" script file).

After that you can simply run

Code:

time mpirun -np 32 simpleFoam -parallel
Update:

OK so I logged into one of my machines to look at my setup.

1. edit the decomposeParDict.6 and change the number of subdomains. Make sure that the product in "n" equals the number of subdomains.
2. simply add "time" in front of the call to the solver. The call will then be:
Code:

time runParallel $decompDict $(getApplication)
(the original method is perhaps better if you wish to test bind-to-core or bind-to-hwthread)

flotus1 January 26, 2018 17:10

OK thanks, maybe I got it right.
On 32 threads simpleFoam takes 74s
With bind-to core: 71s
With bind-to hwthread: 71s

A large portion of the time seems to be I/O. Load balancing might be far from optimal and I have no idea if the case is large enough to scale on 32 cores.

JBeilke January 27, 2018 07:37

Dear Alex,

thanks so much for this. Can you please also run the 6 core variant. I'm still looking for a machine which might be suitable for my ccm+ simulations :-)

Viele Grüße
Jörn

flotus1 January 27, 2018 09:18

All right, a fresh start with 6 threads:

no core binding: 153s
binding to the first 6 cores of one CPU: 196s
distributing across one CPU (2+2+1+1 threads per NUMA node): 166s
distributing across both CPUs (1 thread per NUMA node, two left idle): 159s

Run times may vary by up to 10s when repeating runs. A SATA SSD is used to store the data.

With no core binding option, the case ran on cores 1,12,16,21,24,26
NUMA nodes are located on cores 1-4, 5-8, 9-12, 13-16, 17-20, 21-24, 25-28, 29-32 when we assume the first core is number 1 instead of 0.

So it may not come as a surprise that Epyc is not the best choice for low core count CFD simulations ;)

Edit: forgot to mention that I changed one line in controlDict: "startFrom startTime;" in order to perform the same iterations for all re-runs. I hope this was appropriate.

JBeilke January 28, 2018 13:22

Many thanks Alex,

the 6 core variant is somehow in line with the other results posted over various threads. Nearly the same speed as the Threadripper.

This seems to be ok since you are using not the fastest RAM and not the fastest CPU.

But your result with 32 cores (71 - 73 sec) is a bit disappointing. I would expect someting around 40 seconds (30 sec for linear scaling).

So we might have a little problem there. One option might be, that the case does not scale very well (decomposition method or computing overhead) or that the machine is already saturated at a lower core count than 32.

To check this you can try two tests:

  1. run several instances of the 6 core variant at the same time or
  2. check the speedup for 12 / 16 / 24 / ... cores and see where the efficency drops.

Viele Grüße
Jörn

flotus1 January 28, 2018 14:13

I am pretty sure poor scaling here is particular to this example. As stated earlier, there is quite some I/O overhead, load balancing might be far from ideal due to my non-existent OpenFOAM experience and the case itself might be too small.
Running several instances seems to be the easiest way for me to get around all of these issues. I will give this a shot...maybe tomorrow. Is it ok if I use 4 threads per instance and then run 1-8 instances?
Edit: but to be honest, I would expect pretty much linear scaling with this method unless the I/O saturates my SATA SSD at some point.

JBeilke January 28, 2018 14:54

Alex, there is no need for urgent actions :-)

I would stay with 6 cores and go to a maximum of 5 instances. We know from the theory of discrete simulations that whenenver we try to use a finite resource up to 100% the queue in front of it grows to infinity. So with something like 80% we get the best throughput.

Unfortunately many managers never heard about this basic knowledge of resource planning and always try to use the workers and machines up to 100%.


All times are GMT -4. The time now is 20:09.