OpenFOAM on Apple M1 Render Farm?

xuegy · May 20, 2021, 20:03

I came across an idea to run OpenFOAM on M1 Mac Mini render farm:
For each 16GB model with 10GbE port, it's $999 or $899(edu).
In Geekbench 5, M1 outperforms 12-core E5-2697 v2 so we don't need to worry about the CPU performance.
Pros:
1. For each $1000 you get 68 GB/s of memory bandwidth from LPDDR4X-4266.
2. 192 KB L1i, 128 KB L1d, 12MB L2, that's huge.
3. 25W super low power consumption.
4. Possible to do optimizations based on its unified memory, e.g. solving fvMatrix on GPU directly.
Cons:
1. Small RAM per computer.
2. High RAM latency ~100ns (~50ns on Intel/AMD.)
3. No infiniband. There're two options: 1. 10GbE switch 2. Thunderbolt 3(40Gb) but might be expensive as well.
Or maybe it's too early and I should wait until M2 brings HBM.

wkernkamp · May 21, 2021, 00:10

For a small cluster the 10 Gb/s ethernet should be fine. Don't need Infiniband. Before commiting to multiple units, try OpenFOAM benchmarks on various hardware on one node. Openfoam scaling is very linear with cluster nodes so you can do your test for just $1,000.

Good luck and post results!

Will

flotus1 · May 21, 2021, 03:37

Yeah, I would not be too optimistic about that.
If you want to go that route, definitely try to get your hands on one of these machines first, and see how that performs.
Here are my thoughts/questions:

Quote:

1. For each $1000 you get 68 GB/s of memory bandwidth from LPDDR4X-4266.

Do you? I can't tell at first glance whether that memory operates in dual-channel mode.
And for 1000$, I can spec out some rather powerful PCs from regular parts. The kind of parts that can be replaced if they break, instead of swapping the whole machine. With the freedom to use whatever you want: more memory? ECC support? Infiniband? regular GPUs?...

Quote:

2. 192 KB L1i, 128 KB L1d, 12MB L2, that's huge.

Compared to what? It's an entirely different CPU architecture compared to what Intel and AMD offer in x86 space. Hence comparing specs like that is rather pointless. But if we really want to do that: 12MB of L2 may sound huge compared to what current-gen mainstream CPUs have. Until you realize that this is the last-level cache, and its size should rather be compared to L3 of known CPUs. Plus the fact that one cache hierarchy is missing. But again, different architecture, that doesn't have to mean anything.

Quote:

3. 25W super low power consumption.

Which will lead to TDP throttling, especially if you want to leverage GPU compute in addition to CPU. And possibly thermal throttling aswell, since the cooling solution is designed by apple

Quote:

4. Possible to do optimizations based on its unified memory, e.g. solving fvMatrix on GPU directly.

Good luck!

xuegy · May 21, 2021, 14:35

Quote:

Originally Posted by flotus1

Yeah, I would not be too optimistic about that.
If you want to go that route, definitely try to get your hands on one of these machines first, and see how that performs.

I’ve ordered a M1 Mac Mini and will test the performance on a single computer. ( If I can successfully compile it)
For ANSYS users they don’t have a choice. But for OpenFOAM users, ARM64 is not far away.

flotus1 · May 21, 2021, 14:36

Please let us know how it turns out

xuegy · May 21, 2021, 14:51

Quote:

Originally Posted by flotus1

Which will lead to TDP throttling, especially if you want to leverage GPU compute in addition to CPU. And possibly thermal throttling aswell, since the cooling solution is designed by apple

Good luck!

I've seen a benchmark result using both CPU and GPU. It will be 15W for CPU and 10W for GPU. So yes the throttling could be an issue.
Not sure if PETSc GPU support is available on M1. It says OpenCL but who knows.

flotus1 · May 23, 2021, 03:55

Forgot to mention one thing about GPU compute...
One of the reasons why GPU computing is a thing for CFD is dedicated memory on those GPUs, with pretty high memory bandwidth. In the order of several hundred GB/s and more on high-end models. The GPU in question here doesn't have that, and instead has to share memory capacity and bandwidth with the CPU. Apple calls it "unified" memory, but it's just good old shared memory without fixed allocation. That's not a pro, but another downside from a possible bottleneck.

xuegy · May 28, 2021, 01:08

Quote:

Originally Posted by flotus1

Please let us know how it turns out

Here's the result:

Compiled OF-v2012 on native ARM64 in 36 minutes with all 4+4 cores. This part is blazing fast.

Then I ran the motorbike case. It took 155s to run 500 iterations with 4 big cores. If I use all 8 cores it's 168s so these little cores are really weak. During that I can't hear any fan spinning at all.

Tried my 2018 MBP (i5-8259U, 4 cores), took 232s and the fan was almost taking off.

Also tried my dual E5-2667v2 hackintosh workstation, with 8x DDR3-1333 channels:
4 cores - 232s
8 cores - 158s
16 cores - 135s

Overall, floating point of M1 is not proportionally good as other parts. However I didn't enable SIMD optimizations (no idea how to enable Neon on Apple Clang) so it may have some potential. Also I haven't tried GPU acceleration yet.

And I know this test case is too small to scale on my workstation. Is there any benchmark case that uses ~10GB RAM?

flotus1 · May 28, 2021, 02:31

There is always OpenFOAM benchmarks on various hardware
Also lets you compare the results to tons of other setups. And it has been proven to be large enough for good scaling with much higher core count.

xuegy · May 29, 2021, 01:03

The result is here:
OpenFOAM benchmarks on various hardware
The build is still buggy and there could be some more improvement.

Miguel Hernandez · October 4, 2021, 05:39

Quote:

Originally Posted by xuegy

Here's the result:

Compiled OF-v2012 on native ARM64 in 36 minutes with all 4+4 cores. This part is blazing fast.

Then I ran the motorbike case. It took 155s to run 500 iterations with 4 big cores. If I use all 8 cores it's 168s so these little cores are really weak. During that I can't hear any fan spinning at all.

Tried my 2018 MBP (i5-8259U, 4 cores), took 232s and the fan was almost taking off.

Also tried my dual E5-2667v2 hackintosh workstation, with 8x DDR3-1333 channels:
4 cores - 232s
8 cores - 158s
16 cores - 135s

Overall, floating point of M1 is not proportionally good as other parts. However I didn't enable SIMD optimizations (no idea how to enable Neon on Apple Clang) so it may have some potential. Also I haven't tried GPU acceleration yet.

And I know this test case is too small to scale on my workstation. Is there any benchmark case that uses ~10GB RAM?

It's been a while since I'm trying to install OpenFoam 9 on my MacBook Pro M1 without any success. I've used docker image, but it's seems to be emulated (x86) and so is very very slow...

Can you provide some tips on how you managed to install (natively?) OpenFoam on the new SoC M1?

It could be very useful...

Thanks in advance.

xuegy · October 7, 2021, 10:12

You'll need to remove the sigfpe part(seems like Apple silicon doesn't support this feature?) then it should compile.
https://github.com/mrklein/openfoam-...ment-850090915

Miguel Hernandez · October 8, 2021, 14:20

Quote:

Originally Posted by xuegy

You'll need to remove the sigfpe part(seems like Apple silicon doesn't support this feature?) then it should compile.
https://github.com/mrklein/openfoam-...ment-850090915

Thanks Xuegy... i tried what you suggested, it doesn’t work for me...

Miguel Hernandez · October 9, 2021, 06:39

Quote:

Originally Posted by xuegy

You'll need to remove the sigfpe part(seems like Apple silicon doesn't support this feature?) then it should compile.
https://github.com/mrklein/openfoam-...ment-850090915

Could you please provide some more details?

I've edited the sigFpe.C file --> I've removed all the __APPLE__;
I founded it in some #if include...
And what removing -ftrapping-math means?

Thanks in advance,
regards

May 20, 2021, 20:03	OpenFOAM on Apple M1 Render Farm?	#1
xuegy Member Join Date: Jun 2016 Posts: 99 Rep Power: 9	I came across an idea to run OpenFOAM on M1 Mac Mini render farm: For each 16GB model with 10GbE port, it's $999 or $899(edu). In Geekbench 5, M1 outperforms 12-core E5-2697 v2 so we don't need to worry about the CPU performance. Pros: 1. For each $1000 you get 68 GB/s of memory bandwidth from LPDDR4X-4266. 2. 192 KB L1i, 128 KB L1d, 12MB L2, that's huge. 3. 25W super low power consumption. 4. Possible to do optimizations based on its unified memory, e.g. solving fvMatrix on GPU directly. Cons: 1. Small RAM per computer. 2. High RAM latency ~100ns (~50ns on Intel/AMD.) 3. No infiniband. There're two options: 1. 10GbE switch 2. Thunderbolt 3(40Gb) but might be expensive as well. Or maybe it's too early and I should wait until M2 brings HBM.

May 21, 2021, 00:10	Interesting Idea!	#2
wkernkamp Senior Member Will Kernkamp Join Date: Jun 2014 Posts: 316 Rep Power: 12	For a small cluster the 10 Gb/s ethernet should be fine. Don't need Infiniband. Before commiting to multiple units, try OpenFOAM benchmarks on various hardware on one node. Openfoam scaling is very linear with cluster nodes so you can do your test for just $1,000. Good luck and post results! Will

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Map of the OpenFOAM Forum - Understanding where to post your questions!	wyldckat	OpenFOAM	10	September 2, 2021 05:29
Connect your Paraview Client to an autoscaling Paraview render farm on Google Cloud	FluidNumerics_Guy	Hardware	0	November 5, 2020 10:27
UNIGE February 13th-17th - 2107. OpenFOAM advaced training days	joegi.geo	OpenFOAM Announcements from Other Sources	0	October 1, 2016 19:20
Superlinear speedup in OpenFOAM 13	msrinath80	OpenFOAM Running, Solving & CFD	18	March 3, 2015 05:36
Adventure of fisrst openfoam installation on Ubuntu 710	jussi	OpenFOAM Installation	0	April 24, 2008 14:25

May 21, 2021, 14:36		#5
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	Please let us know how it turns out

May 23, 2021, 03:55		#7
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	Forgot to mention one thing about GPU compute... One of the reasons why GPU computing is a thing for CFD is dedicated memory on those GPUs, with pretty high memory bandwidth. In the order of several hundred GB/s and more on high-end models. The GPU in question here doesn't have that, and instead has to share memory capacity and bandwidth with the CPU. Apple calls it "unified" memory, but it's just good old shared memory without fixed allocation. That's not a pro, but another downside from a possible bottleneck.

May 28, 2021, 02:31		#9
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	There is always OpenFOAM benchmarks on various hardware Also lets you compare the results to tons of other setups. And it has been proven to be large enough for good scaling with much higher core count.

May 29, 2021, 01:03		#10
xuegy Member Join Date: Jun 2016 Posts: 99 Rep Power: 9	The result is here: OpenFOAM benchmarks on various hardware The build is still buggy and there could be some more improvement.

October 7, 2021, 10:12		#12
xuegy Member Join Date: Jun 2016 Posts: 99 Rep Power: 9	You'll need to remove the sigfpe part(seems like Apple silicon doesn't support this feature?) then it should compile. https://github.com/mrklein/openfoam-...ment-850090915