CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > General Forums > Hardware

OpenFOAM benchmarks on various hardware

Register Blogs Members List Search Today's Posts Mark Forums Read

Like Tree579Likes

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   November 20, 2024, 12:37
Default
  #821
New Member
 
DS
Join Date: Jan 2022
Posts: 15
Rep Power: 5
Crowdion is on a distinguished road
Asus ROG STRIX G713PV, Ryzen 7945HX, 2 x 48GB DDR5 5600Mhz,
OpenFoam2406 (precompiled), Ubuntu 24.04.1, Motorbike_bench_template.tar.gz (default settings)

Meshing (real)
# cores | Meshing Wall time(s)| Solver Wall time(s):
------------------------
1 | 399 | 513
2 | 278 | 278
4 | 177 | 167
6 | 141 | 143
8 | 120 | 142
12 | 112 | 139
16 | 108 | 139
Crowdion is offline   Reply With Quote

Old   December 21, 2024, 01:45
Default
  #822
New Member
 
Join Date: Nov 2016
Posts: 16
Rep Power: 10
nmc1988 is on a distinguished road
Quote:
Originally Posted by gumersindu View Post
Hi all,

I finally modified the motorbike tutorial to match the same configuration as in the benchmark from the original post. I've attached the modified code which worked for v2312.

These are the results I got for this PC config: HP Z840 | 2 x Intel Xeon E5-2690 v4 (14 cores, 2,6 GHz, 35Mb L3) | 8 x 16GB DDR4 ECC | 1TB HDD | Ubuntu 24.04 LTS

Code:
cores  MeshTime(s)     RunTime(s)     
-----------------------------------
1      1403.79         1098.68        
2      949.89          551.16         
4      495.73          246.11         
6      361.35          163.72         
8      293.58          128.46         
12     244.06          99.28          
16     229.99          84.12          
20     186.59          78.14          
24     183.3           74.44          
28     177.25          72.7
This is still a good investment for under 1000usd budget. I have similar pc and thinking to change but still do not find a better performance/price choice
nmc1988 is offline   Reply With Quote

Old   December 26, 2024, 00:18
Talking Whats the deal with apple silicon fanboyism for CFD
  #823
Senior Member
 
Sultan Islam
Join Date: Dec 2015
Location: Canada
Posts: 145
Rep Power: 11
EternalSeekerX is on a distinguished road
It seems a lot of people (on Reddit) and some here praise Apple silicon because of memory bandwidth for CFD. However I have only ever seen openfoam numbers, none from fluent (built-in solvers and/or custom solver with UDF) or starccm+. I can use OpenFOAM numbers (I use OpenFOAM too) to guesstimate the performance but I'm not seeing the craze on Apple silicon. I looked into older posts in this thread and saw that someone with both an m1 (pro?/max?) and a 13900K shows that Apple Silicon got 81 seconds but the 13900K beat it with a high 70's in run time. Most eypcs here beat out Apple silicon, however, I'm not seeing any numbers for the latest desktop AMD or laptop ones (most newer ones should beat out or come close to 13900K easily).

I believe I also saw a post on results from a non-apple arm system, seems to show arm is not as well in floating point versus x86_64, is this true?

I know id need to stick with x86-64 for cad/pre-processing but is arm really good for cfd at this point or all just fluff?
EternalSeekerX is offline   Reply With Quote

Old   December 28, 2024, 19:42
Talking Followup
  #824
Senior Member
 
Sultan Islam
Join Date: Dec 2015
Location: Canada
Posts: 145
Rep Power: 11
EternalSeekerX is on a distinguished road
I was really looking into apple silicon. I use openfoam and my prop also has a consol license. However i haven't seen any numbers on fluent or starccm+ running on the silicon (i know it's going to need a linux vm with box64 to work). I also was looking into any numbers for cad running on vm, but none.

One thing to note, I feel the silicon is also over hyped (in a sense for cfd atleast). Here is an awesome video about it: https://youtu.be/fdvzQAWXU7A?feature=shared

It is seem the reported bandwidth is gpu mostly, cpu is less. This shouldn't matter such for LLMs and AI task but it's interesting.

For openfoam here I've seen some desktop cpu still beat it (even older intel ones). Not much on laptop cpu here.


Here is a comsol one:
https://forum.level1techs.com/t/cfd-...256/185?page=5

It's see with memory training and PBO x86_64 chips are faster still in cfd (both in pure cfd benchmark and the coupled EM one).

It's seem that arm might be not so good with double precision floating point and also lacks avx 512. I can't argue with the efficiency of arm chips though.

I'm curious how well the upcoming amd strixpoint apu will handle this, could actually mean a good x86_64 laptop for cfd.

Also alot of older server nvidia cards have superior fp64 performance and decent memory bandwidth to, so I guess can't really understand apple silicon hype for cfd.

Saying that I am still tempted by the m4 mini for everything else haha.
EternalSeekerX is offline   Reply With Quote

Old   December 30, 2024, 05:40
Default
  #825
New Member
 
Join Date: Dec 2024
Posts: 1
Rep Power: 0
WolSch is on a distinguished road
Not new, only to complete. Running with OF10.
OS Ubuntu 24.04 (needed to compile OF10 for that)
Board Gigabyte MZ73-LM0-000
CPU 2x EPYC 9634 84-Core Processor (168 Cores), L3 384 MB, TDP 290W
Mem 24*16 GB DDR5 4800MHz
BIOS settings default (exception NPS4 for NUMA) and no additonal mpirun commands like binding..ranking... Meaning SMT = ON and not the maximum performance settings. Latter one (max. performance) seems to be a disadvantage when conducting longer runs lasting hours/days. The simple BIOS default is faster and less noisy (air cooling). Maybe themal throttling, don't know, although the BMC says: "Everythig is thermally-wise fine" (I guess stuff for a different Thread)


# cores Wall time (s):
------------------------
1 640.831
6 76.3422
12 37.8541
24 19.9972
32 16.1443
48 11.6251
64 9.70698
72 9.21079
96 8.28556
120 8.4395
144 8.13012
168 8.16584
Crowdion likes this.
WolSch is offline   Reply With Quote

Old   January 16, 2025, 11:19
Default
  #826
Senior Member
 
Simbelmynė's Avatar
 
Join Date: May 2012
Posts: 563
Rep Power: 16
Simbelmynė is on a distinguished road
Quote:
Originally Posted by EternalSeekerX View Post
It seems a lot of people (on Reddit) and some here praise Apple silicon because of memory bandwidth for CFD. However I have only ever seen openfoam numbers, none from fluent (built-in solvers and/or custom solver with UDF) or starccm+. I can use OpenFOAM numbers (I use OpenFOAM too) to guesstimate the performance but I'm not seeing the craze on Apple silicon. I looked into older posts in this thread and saw that someone with both an m1 (pro?/max?) and a 13900K shows that Apple Silicon got 81 seconds but the 13900K beat it with a high 70's in run time. Most eypcs here beat out Apple silicon, however, I'm not seeing any numbers for the latest desktop AMD or laptop ones (most newer ones should beat out or come close to 13900K easily).

I believe I also saw a post on results from a non-apple arm system, seems to show arm is not as well in floating point versus x86_64, is this true?

I know id need to stick with x86-64 for cad/pre-processing but is arm really good for cfd at this point or all just fluff?
Not sure about the ”hype” for Apple Silicon CFD. The numbers are available in this thread.

I would say that Macbooks are S-tier for CFD while still maintaining treats that are important for a laptop. While there is no report on the M4 Max here, we have the M3 Max laptop doing the benchmark in 63 seconds - running on battery. I doubt the x86_64 camp has any similar offering. If you need something with wide compatibility though then MacOS is lowest tier..

Unfortunately, old Macs retain their value too well. Otherwise I am pretty sure that I would have a few Mac Studio M1 Ultras (about 40s on this benchmark) sitting on my desk with thunderbolt interconnect. Dead silent and each drawing approximately 100 W on full load. The fans on my 13900k are ramping up just by doing a regular system update. As soon as I get some spare time, that space heater will go into the server room

My 2c
Simbelmynė is offline   Reply With Quote

Old   January 26, 2025, 04:00
Default New Ryzen AI
  #827
New Member
 
Kaissar Nabbout
Join Date: Feb 2022
Posts: 2
Rep Power: 0
kaissar is on a distinguished road
I am not an expert in this area, so I would like to ask anyone who knows better about that to help me understand.

I have seen that the new AMD AI chips will have like 256 GB/s memory bandwidth, which is pretty good for a x86 consumer product that can natively run linux (I know Apple will have more, but it is not native linux and it is also extremely pricy in my opinion).

As openfoam's simulations directly scales up with memory bandwidth (according to all results I have seen so far), I would like to know if this memory bandwidth that they announce is really the one we can get for our simulations. Or, is there anything behind it that lowers the performance? I am asking that also because I see that everybody announces that with LPDDR5 memories and according to my research, those memory have smaller bus, and I don't know if this affects the performance. Besides that, I got suspicious with this memory bandwidth, because it means that a consumer product will basically have better performance than a single Epyc 7003 CPU.

Thanks a lot in advance for anyone who can help me understand that better.
kaissar is offline   Reply With Quote

Old   January 26, 2025, 04:20
Default
  #828
Senior Member
 
Simbelmynė's Avatar
 
Join Date: May 2012
Posts: 563
Rep Power: 16
Simbelmynė is on a distinguished road
Quote:
Originally Posted by kaissar View Post
I am not an expert in this area, so I would like to ask anyone who knows better about that to help me understand.

I have seen that the new AMD AI chips will have like 256 GB/s memory bandwidth, which is pretty good for a x86 consumer product that can natively run linux (I know Apple will have more, but it is not native linux and it is also extremely pricy in my opinion).

As openfoam's simulations directly scales up with memory bandwidth (according to all results I have seen so far), I would like to know if this memory bandwidth that they announce is really the one we can get for our simulations. Or, is there anything behind it that lowers the performance? I am asking that also because I see that everybody announces that with LPDDR5 memories and according to my research, those memory have smaller bus, and I don't know if this affects the performance. Besides that, I got suspicious with this memory bandwidth, because it means that a consumer product will basically have better performance than a single Epyc 7003 CPU.

Thanks a lot in advance for anyone who can help me understand that better.
I would not purchase anything "new" without it being tested first. NVIDIA has similar systems and they do look interesting. Picking from the benchmark list would be much lower risk imho. Used/refurbished systems are by far the best bang for the buck.
Simbelmynė is offline   Reply With Quote

Old   January 29, 2025, 07:32
Default
  #829
New Member
 
Kevin Nolan
Join Date: Nov 2012
Posts: 18
Rep Power: 14
Kolan is on a distinguished road
Apple M4 Mac mini M4 Pro 12 cores (4e 8p), 48 GB of RAM on macOS 15.3.

OpenFOAM v2412 compiled natively.

Code:
cores  MeshTime(s)     RunTime(s)     
-----------------------------------
1      407.23          334.47         
2      292.06          196.36         
4      181.5           101.12         
6      135.4           79.75          
8      128.51          62.74          
12     162.66          95.73
Kolan is offline   Reply With Quote

Old   January 29, 2025, 08:07
Default
  #830
Senior Member
 
Simbelmynė's Avatar
 
Join Date: May 2012
Posts: 563
Rep Power: 16
Simbelmynė is on a distinguished road
Quote:
Originally Posted by Kolan View Post
Apple M4 Mac mini M4 Pro 12 cores (4e 8p), 48 GB of RAM on macOS 15.3.

OpenFOAM v2412 compiled natively.

Code:
cores  MeshTime(s)     RunTime(s)     
-----------------------------------
1      407.23          334.47         
2      292.06          196.36         
4      181.5           101.12         
6      135.4           79.75          
8      128.51          62.74          
12     162.66          95.73
So roughly 85% of the M3 Max (12p-core version). 409 vs 273 GB/s bandwidth though. Perhaps Apple has made improvements in how much the CPU can address of the total memory bandwidth?

How are the thermals and noise when you run it for an extended period of time? The case size seems silly at this point, I rather have a low noise fan and larger heat sink, while retaining the old form-factor. Perhaps it is still a non-issue though?
Simbelmynė is offline   Reply With Quote

Old   January 29, 2025, 08:08
Default
  #831
New Member
 
Kevin Nolan
Join Date: Nov 2012
Posts: 18
Rep Power: 14
Kolan is on a distinguished road
It's very quiet, my 8 bay NAS next to me hums louder.
Kolan is offline   Reply With Quote

Old   January 29, 2025, 08:24
Default
  #832
Senior Member
 
Simbelmynė's Avatar
 
Join Date: May 2012
Posts: 563
Rep Power: 16
Simbelmynė is on a distinguished road
Nice to hear! Do you know the power draw when you run on the 8 performance cores? (I use iStats as monitoring, not sure how accurate it is though as it uses internal sensors)

Now, if we can get some results how well thunderbolt works as interconnect with these, then there is a clear upgrade path - just purchase another unit
bigphil and Kolan like this.
Simbelmynė is offline   Reply With Quote

Old   January 29, 2025, 09:37
Default
  #833
Senior Member
 
andy
Join Date: May 2009
Posts: 349
Rep Power: 19
andy_ is on a distinguished road
Interesting result showing a parallel efficiency above 50% with 8 cores. There's no increase in efficiency due to cache effects one would expect from a server with a better memory system but nor is there the rapid collapse in parallel efficiency with cores one usually sees with consumer hardware.

Code:
Apple M4 Mac mini M4 Pro 12 cores (4e 8p)
Cores  Time     Efficiency     
-----------------------------
1      334.47      1.0
2      196.36      0.85
4      101.12      0.83
8       62.74      0.67
In comparison with earlier

Code:
Apple M4 Mac mini base model 4P+6E
Cores  Time     Efficiency     
-----------------------------
1      315.54     1.0
2      191.29     0.82
4      118.64     0.66
8      111.61     0.35
Code:
Apple Macbook Pro with M1 Max
Cores  Time    Efficiency     
-----------------------------
1     433.18     1.0
2     240.02     0.90
4     135.12     0.80
8      85.57     0.63
Lets hope results like this will continue to put pressure on reducing the price of current server chips from their grossly inflated post covid values to something more inline with pre-covid prices. The supply of reasonably priced second hand dual Xeon workstations isn't going to last for long.
andy_ is offline   Reply With Quote

Old   January 29, 2025, 10:06
Default
  #834
Senior Member
 
Simbelmynė's Avatar
 
Join Date: May 2012
Posts: 563
Rep Power: 16
Simbelmynė is on a distinguished road
Interesting comparison. Perhaps the 8c base M4 Mac mini result is not so representative though, as it only has 4 performance cores. Running on E-cores naturally would tank the results in terms of scaling. I wonder how much power draw 4 E-cores add, for the minimal gains between 4P-cores (118 s) to 4P+4E cores (111 s).
Simbelmynė is offline   Reply With Quote

Old   February 4, 2025, 01:47
Default
  #835
Member
 
Yan
Join Date: Dec 2013
Location: Milano
Posts: 48
Rep Power: 13
aparangement is on a distinguished road
Send a message via Skype™ to aparangement
Quote:
Originally Posted by Simbelmynė View Post
So roughly 85% of the M3 Max (12p-core version). 409 vs 273 GB/s bandwidth though. Perhaps Apple has made improvements in how much the CPU can address of the total memory bandwidth?

How are the thermals and noise when you run it for an extended period of time? The case size seems silly at this point, I rather have a low noise fan and larger heat sink, while retaining the old form-factor. Perhaps it is still a non-issue though?
I guess M4 has a larger L3 (whatever Apple names it), this could have some major impacts on realistic bandwidth.

M3 also has a larger L3 compared with M2.
aparangement is offline   Reply With Quote

Old   February 4, 2025, 05:40
Default
  #836
Senior Member
 
Simbelmynė's Avatar
 
Join Date: May 2012
Posts: 563
Rep Power: 16
Simbelmynė is on a distinguished road
Not sure about the cache. The L2 did not change between M2 and M3. Not sure, but it seems M4 also uses 16 MB shared L2.

I cannot find a source regarding the CPU-addressable memory bandwidth between generations, but I remember M3 having a higher than M1/M2.

https://www.anandtech.com/show/21387...ts-on-ipad-pro
Simbelmynė is offline   Reply With Quote

Old   February 4, 2025, 21:19
Default
  #837
Member
 
Yan
Join Date: Dec 2013
Location: Milano
Posts: 48
Rep Power: 13
aparangement is on a distinguished road
Send a message via Skype™ to aparangement
It's very possible that L3 (called SLC?) of M3 is larger than M2 and M1.
https://forums.macrumors.com/threads.../post-32845968

L3 sometimes plays an important role for realistic memory bandwidth, at least this is true for AMD's zen processors.
I have no idea on the hardware design, but all the benchmark results here seem agree with it.

Hopefully M4 will enlarge L3 again

Quote:
Originally Posted by Simbelmynė View Post
Not sure about the cache. The L2 did not change between M2 and M3. Not sure, but it seems M4 also uses 16 MB shared L2.

I cannot find a source regarding the CPU-addressable memory bandwidth between generations, but I remember M3 having a higher than M1/M2.

https://www.anandtech.com/show/21387...ts-on-ipad-pro
aparangement is offline   Reply With Quote

Old   February 5, 2025, 06:35
Default
  #838
Senior Member
 
andy
Join Date: May 2009
Posts: 349
Rep Power: 19
andy_ is on a distinguished road
Quote:
Originally Posted by aparangement View Post
L3 sometimes plays an important role for realistic memory bandwidth, at least this is true for AMD's zen processors.
I have no idea on the hardware design, but all the benchmark results here seem agree with it.
Not sure I understand the comments about cache effects. This is a result taken from a few pages earlier showing typical cache effects for this benchmark.

Code:
Cores Time   Eff.
  1  546.46  1.00
  4  110.53  1.26
  8   51.49  1.32
 16   27.53  1.24
 32   15.38  1.11
 64    8.67  0.98
128    6.49  0.65
192    6.43  0.44
As the size of the problem on an individual core is reduced, more of the problem is in cache and the computation is performed quicker unless the speed of getting data from other processors is the bottleneck. In the example the increase in parallel efficiency above 1.0 is indicating that the benefits of the caches are stronger than the delays due to memory transfers upto about 64 processors. The parallel efficiency will change with problem size but is an important piece of information if you wish to turn round problems efficiently on a parallel processor. Another is where the efficiency starts to plummet below 50% and it is no longer worthwhile adding processors to a job. This varies with shared vs distributed memory.

In the Apple examples the efficiency is dropping steadily indicating moving data between processors is the limiting factor. It is unusual though in that the memory restriction is significant with 2 processors but only grows modestly on 4 and 8 cores rather than rapidly when memory bandwidth becomes insufficient. It would be interesting to know what they are doing.
andy_ is offline   Reply With Quote

Old   February 6, 2025, 13:26
Default 2 x EPYC 9684X
  #839
New Member
 
Join Date: Feb 2025
Location: Germany
Posts: 1
Rep Power: 0
Nolcera is on a distinguished road
Server HP ProLiant DL385 gen11 with 2 x AMD EPYC 9684X (2 x 96 = 192 physical cores, 1152 MB L3 cache per core), 24 x 64 GB RAM 4800 MHz.

Code:
# cores		snappyHexMesh (s)	simpleFoam (s)
192		95.092			7.14
160		68.389			7.23
128		61.932			7.54
96		56.130			8.09
64		55.193			9.45
56		62.801			10.62
48		56.863			12.01
40		56.618			13.84
32		58.931			16.50
28		57.512			18.33
24		60.457			20.70
20		65.944			24.24
16		73.347			29.12
12		83.844			37.81
8		103.656			56.16
4		179.401			110.04
1		480.004			526.93
Same configuration as post # 780 (other machine).

Software configuration
Ubuntu 24.04 LTS
OpenFOAM v2412
Basecase from OpenFOAM benchmarks on various hardware, post # 504 without any changes but processor numbers.
Crowdion likes this.

Last edited by Nolcera; February 7, 2025 at 02:13. Reason: Software configuration added
Nolcera is offline   Reply With Quote

Old   February 16, 2025, 00:23
Default
  #840
Member
 
Yan
Join Date: Dec 2013
Location: Milano
Posts: 48
Rep Power: 13
aparangement is on a distinguished road
Send a message via Skype™ to aparangement
I have no idea why cache (at least L3) plays an important role here, maybe as you said it is related to MPI overhead.

But this is true especially during the time of zen2, a lot of posts back then in this tread show that L3=256MB (e.g. epyc 7532) is about 30% faster than their L3=128MB varient, if other specs are almost the same.

-X models with even larger L3 should be even faster, but typically the price curve at -X modes is quite steep

Quote:
Originally Posted by andy_ View Post
Not sure I understand the comments about cache effects. This is a result taken from a few pages earlier showing typical cache effects for this benchmark.

Code:
Cores Time   Eff.
  1  546.46  1.00
  4  110.53  1.26
  8   51.49  1.32
 16   27.53  1.24
 32   15.38  1.11
 64    8.67  0.98
128    6.49  0.65
192    6.43  0.44
As the size of the problem on an individual core is reduced, more of the problem is in cache and the computation is performed quicker unless the speed of getting data from other processors is the bottleneck. In the example the increase in parallel efficiency above 1.0 is indicating that the benefits of the caches are stronger than the delays due to memory transfers upto about 64 processors. The parallel efficiency will change with problem size but is an important piece of information if you wish to turn round problems efficiently on a parallel processor. Another is where the efficiency starts to plummet below 50% and it is no longer worthwhile adding processors to a job. This varies with shared vs distributed memory.

In the Apple examples the efficiency is dropping steadily indicating moving data between processors is the limiting factor. It is unusual though in that the memory restriction is significant with 2 processors but only grows modestly on 4 and 8 cores rather than rapidly when memory bandwidth becomes insufficient. It would be interesting to know what they are doing.
aparangement is offline   Reply With Quote

Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to contribute to the community of OpenFOAM users and to the OpenFOAM technology wyldckat OpenFOAM 17 November 10, 2017 15:54
UNIGE February 13th-17th - 2107. OpenFOAM advaced training days joegi.geo OpenFOAM Announcements from Other Sources 0 October 1, 2016 19:20
OpenFOAM Training Beijing 22-26 Aug 2016 cfd.direct OpenFOAM Announcements from Other Sources 0 May 3, 2016 04:57
New OpenFOAM Forum Structure jola OpenFOAM 2 October 19, 2011 06:55
Hardware for OpenFOAM LES LijieNPIC Hardware 0 November 8, 2010 09:54


All times are GMT -4. The time now is 17:26.