HPC Fluent Capabilities

MarcoC501 · September 11, 2022, 04:15

Hello,
I am currentily running a DES/LES with Fluent on a HPC cluster, and I would like to know from your experience what is the best compromise between number of nodes and cores.
The mesh has 70M cells, and for each node I can access maximum 128 CPUs: I saw that if I use 64 cores, the simulation is faster then if I use 96 cores, which makes me a bit doubtful since I heard people running on many more cores (1000+) and their simulation performance was clearly boosted.
For sure there is some technical information I am missig and I would really appreciate if you could help me
Thanks a lot

Marco

LoGaL · September 11, 2022, 07:07

how big is your mesh? (Number of elements). Basically the increase in computational speed is a trade off between distributing the mesh elements to multiple cores and adding an additional ovehead of comunication between the cores. (because sometimes to solve the equations in one mesh element you need access to information held by another element of another core)

Depending on the size of your mesh and type of equations being solved, there's an optimum number of cores.

MarcoC501 · September 11, 2022, 07:17

I can send you the example of partition when I use 1 node and 64 CPUs:

Multicore processors detected. Processor affinity set!

Reading from FILENAME in NODE0 mode ...
Reading mesh ...
Warning: reading 96 partition grid onto 64 compute node machine.
will auto partition.
69963316 cells, 1 cell zone ...
69963316 tetrahedral cells, zone id: 3
142010839 faces, 5 face zones ...
4080098 triangular wall faces, zone id: 1
137842425 triangular interior faces, zone id: 2
51624 triangular pressure-inlet faces, zone id: 6
16140 triangular pressure-outlet faces, zone id: 7
20552 triangular pressure-outlet faces, zone id: 8
12766611 nodes, 1 node zone ...
Done.

When I open the same case with 96 CPUs, the simulation is slower. I also tried to increase the number of nodes, but still the simulation is not faster than this. I need to increase the compuational time and use the nodes/cores I have in the best possible way, but I do not understand if it depends on Fluent or on the cluster where I am running

Thanks a lot for any help you can provide

Let me know if more information are needed.

LoGaL · September 11, 2022, 14:11

So each node has 64 cores in total? I am not really knowleadgeable in HPC cluster architecture, but what I know is that as soon as you use more than 1 node, there's an additional communication overhead between the nodes, because two different nodes physically exist in two different places. This may be the thing...

But perhaps somebody a bit more expert than me might reply

MarcoC501 · September 11, 2022, 17:18

Each node has up to 128 cores.
I understand you point, but even if I use 1 node only, using 96/128 cores is slower than 64/128: how could I speed up my simulation if there is no positive effect neither increasing the cores nor the nodes?

LuckyTran · September 11, 2022, 21:11

Have you tried 128?

Most likely you have 64x2=128 cores per node after hyperthreading. This means the node is fully loaded at 64 and 128 threads . Any number between 64-128 will slow down your calculation (63 is okay, 62 is okay, 65 is not okay) because now you have tasks in the queue that are just waiting for the others to finish before they can execute. If you spread it across additional nodes then make sure you load each node the same way. Take 64 or 128 from each node, or take <63 from each node, but not those intermediate numbers. General rule of thumb and major etiquette when running anything on a shared HPC cluster is, fully load the nodes you use or get off the cluster.

MarcoC501 · September 12, 2022, 03:40

Hello LuckyTran,

thanks a lot for you tips! I will try to full load a node (128/128)whan I will have the possibility....unfortunately I cannot run in full capacity at the moment since other users are using Fluent and we do not have a lot of parallel licences.

Marco

LuckyTran · September 12, 2022, 05:44

In the meantime you can check what kind of cpu your nodes are running and look up their specs . Then you know right way if they are hyperthreaded or not. Takes only a few minutes of google, won't bother anyone, and doesn't need any Fluent licenses.

MarcoC501 · September 12, 2022, 06:11

Hyper-threading is enabled, as the # of threads is twice the # of cores, i.e. 64 CPU cores vs 128 threads.
So you were completely right, thanks a lot!
I just have one other doubt: because we have only 188 Fluent parallel licences, let's say I will have soon 128 cores available, is it better to use
1 - two nodes each of them half-loaded (64/128)
2- one node fully loaded (128/128)

I know that economically speaking (and the rule of thumb you mentioned) 1 is not a good option, but is there in your opinion an increase of performance using more than one node or not at all (in this case)?

I would say no gain, since having two nodes would imply more time spent for inter-node communication. Am I right?
Thanks for your help
Marco

LuckyTran · September 12, 2022, 11:45

Regarding 2-, one node fully loaded at 128 is the same as one node fully loaded at 64. Better to not hold 64 Fluent licenses hostages if that's all you have.

Communication overhead with multiple nodes is rarely worse than 10% on any respectable cluster that uses highspeed links. If your internode comms is worse than 10% you should honestly return the cluster to the store or find yourself a lawyer and sue somebody. You can expect a 1.4 - 2.0x speedup (i.e. 70% to 50% calculation time) depending on linear N-scaling in the best case or sqrt(N) scaling in a highly coupled case. Given your cell density, I would expect you to still be within the linear range since the less than linear scaling usually doesn't kick in until you get to relatively low cell counts per partition (less than 100 000 cells per thread). All this is predicated on you being able to fully load your nodes and having them to yourself (or rather, Fluent).

Two nodes at 64 each would be fully loaded and likely get you the speedup that you want. Issue is, is your cluster job scheduler setup so that you can lock down two nodes to yourself? Can you convince all your buddies to not run on the unused cores? Etc. It depends on the queue setup of your cluster how doable this approach is. You can also run 2x63, all the way down to 2x48 and still get some type of speedup in the sqrt(N) case. If your calculation scales linearly you can even go as low as 2x36 maybe. Again it's bad etiquette and depends on how selfish you want to be and how time critical your LES calculation is. There's your local equilibrium, I leave it up to you to figure out the Nash equilibrium that satisfies everyone's else's needs.

MarcoC501 · September 12, 2022, 17:14

Hello LuckyTran,

thanks a lot for your really nice reply!
Actually I can have exclusive access to each node I am simulating in, and I can also choose the number of nodes. So, if I take literarly all the prescriptions you gave me, I could even run with 4x36 (being very selfish

, but I know everyone running Fluent on our cluster, so we could organize...) and see what happens.
Joking aside, I would first try 2x36 and 2x64 to check out whether there is a linear scaling or not, and I will verify your suggestions. If I will be really in a hurry later on I will go for 4x36 if there is linear scaling.
Oh yeah, do you have some references for the 2x36 and 2x48 you meantioned? I would love to read more about this CPU scaling performance.

Thanks

LuckyTran · September 12, 2022, 23:33

There's lots of anecdotal evidence of N-scaling or sqrt(N) scaling. You can even find it in product brochures. You don't find this information published in a scientific article because it's too practical. The parallel performance really though depends on the problem at hand, what models you use, a little bit of hardware dependence, etc. The best way to find out what your scaling is, is to do it yourself on your problem. And then you know it, for your problem. All it takes is a stopwatch. You can find it though in publications from benchmarking workshops.

AlexanderZ · September 13, 2022, 00:11

don't use hyperthreading for fluent

sbaffini · September 19, 2022, 05:01

Generally speaking, you can have:

- compute bound calculations, where the speed of the processor (or, more likely, how fast you feed the processor with calculations) is the main limiting factor

- memory bound calculations, where the speed of the memory (which somehow inlcudes also its size, as it typically implies a delay nonetheless) is the main limiting factor

- communication bound calculations, where the idle time due to the parallel communication is the main limiting factor

- IO bound calculations, where time spent writing stuff on files is the main limiting factor

By definition you can have only one main limiting factor, but which one depends from the hardware, code and case.

My kind of outdated experience with Fluent up to version 14 is that it can, indeed, suffer from any of them but, in practice, for common hardware and cases you typically end up suffering for memory bound limitations.

This is not much a Fluent problem as a common multicore limitation, as multiple cores tend to share stuff (channels, cache, etc.). Put in hyperthreading and you are also adding a compute bound limitation while making the memory bound one even worse.

A formal sweet spot, from my experience, is at using only half of the physical cores available on a node, with the rest left idle. It is formal because it is not, from a practical point of view, meaningful to use just half of a node cores. Also, it is a matter of efficiency, but you still get practical convenience from running at full capacity, just do not expect, like at all, the double.

If you want to test parallel efficiency in such cases, when it makes sense, it is typically done using the single node performances as a base and to reason in terms of nodes. What you see is that, while single node efficiency starts to saturate around half its capacity, the node based efficiency still follows your typical expectations for such codes.

In conclusion, however, just always use all the allowed cores in the most compact arrangement (i.e., use full nodes) and, if allowed, ask to not use the hyperthreading.

In your specific case, it is probable that 64 is the number of physical cores, which become 128 with hyperthreading. An aditional issue might be due to affinity settings, but that would be weird.

September 11, 2022, 04:15	HPC Fluent Capabilities	#1
MarcoC501 Member MC Join Date: Apr 2021 Posts: 43 Rep Power: 5	Hello, I am currentily running a DES/LES with Fluent on a HPC cluster, and I would like to know from your experience what is the best compromise between number of nodes and cores. The mesh has 70M cells, and for each node I can access maximum 128 CPUs: I saw that if I use 64 cores, the simulation is faster then if I use 96 cores, which makes me a bit doubtful since I heard people running on many more cores (1000+) and their simulation performance was clearly boosted. For sure there is some technical information I am missig and I would really appreciate if you could help me Thanks a lot Marco

September 13, 2022, 00:11		#13
AlexanderZ Senior Member Alexander Join Date: Apr 2013 Posts: 2,363 Rep Power: 34	don't use hyperthreading for fluent __________________ best regards ****************************** press LIKE if this message was helpful

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Submitting Ansys Fluent job on HPC using SLURM	Aero_Tn	FLUENT	1	March 8, 2019 09:53
fluent issue in windows hpc pack 2012 r2	mosesHPC	FLUENT	17	January 12, 2019 15:01
running fluent in HPC server	Ad.Al.Qads	FLUENT	1	November 4, 2016 10:11
The fluent stopped and errors with "Emergency: received SIGHUP signal"	yuyuxuan	FLUENT	0	December 3, 2013 22:56
Fluent Parallel Programing using Windows HPC 2008	santoshgoku	FLUENT	22	February 23, 2013 00:56

September 11, 2022, 07:07		#2
LoGaL Senior Member Lorenzo Galieti Join Date: Mar 2018 Posts: 373 Rep Power: 12	how big is your mesh? (Number of elements). Basically the increase in computational speed is a trade off between distributing the mesh elements to multiple cores and adding an additional ovehead of comunication between the cores. (because sometimes to solve the equations in one mesh element you need access to information held by another element of another core) Depending on the size of your mesh and type of equations being solved, there's an optimum number of cores.

September 11, 2022, 07:17		#3
MarcoC501 Member MC Join Date: Apr 2021 Posts: 43 Rep Power: 5	I can send you the example of partition when I use 1 node and 64 CPUs: Multicore processors detected. Processor affinity set! Reading from FILENAME in NODE0 mode ... Reading mesh ... Warning: reading 96 partition grid onto 64 compute node machine. will auto partition. 69963316 cells, 1 cell zone ... 69963316 tetrahedral cells, zone id: 3 142010839 faces, 5 face zones ... 4080098 triangular wall faces, zone id: 1 137842425 triangular interior faces, zone id: 2 51624 triangular pressure-inlet faces, zone id: 6 16140 triangular pressure-outlet faces, zone id: 7 20552 triangular pressure-outlet faces, zone id: 8 12766611 nodes, 1 node zone ... Done. When I open the same case with 96 CPUs, the simulation is slower. I also tried to increase the number of nodes, but still the simulation is not faster than this. I need to increase the compuational time and use the nodes/cores I have in the best possible way, but I do not understand if it depends on Fluent or on the cluster where I am running Thanks a lot for any help you can provide Let me know if more information are needed.

September 11, 2022, 14:11		#4
LoGaL Senior Member Lorenzo Galieti Join Date: Mar 2018 Posts: 373 Rep Power: 12	So each node has 64 cores in total? I am not really knowleadgeable in HPC cluster architecture, but what I know is that as soon as you use more than 1 node, there's an additional communication overhead between the nodes, because two different nodes physically exist in two different places. This may be the thing... But perhaps somebody a bit more expert than me might reply

September 11, 2022, 17:18		#5
MarcoC501 Member MC Join Date: Apr 2021 Posts: 43 Rep Power: 5	Each node has up to 128 cores. I understand you point, but even if I use 1 node only, using 96/128 cores is slower than 64/128: how could I speed up my simulation if there is no positive effect neither increasing the cores nor the nodes?

September 11, 2022, 21:11		#6
LuckyTran Senior Member Lucky Join Date: Apr 2011 Location: Orlando, FL USA Posts: 5,674 Rep Power: 65	Have you tried 128? Most likely you have 64x2=128 cores per node after hyperthreading. This means the node is fully loaded at 64 and 128 threads . Any number between 64-128 will slow down your calculation (63 is okay, 62 is okay, 65 is not okay) because now you have tasks in the queue that are just waiting for the others to finish before they can execute. If you spread it across additional nodes then make sure you load each node the same way. Take 64 or 128 from each node, or take <63 from each node, but not those intermediate numbers. General rule of thumb and major etiquette when running anything on a shared HPC cluster is, fully load the nodes you use or get off the cluster.

September 12, 2022, 03:40		#7
MarcoC501 Member MC Join Date: Apr 2021 Posts: 43 Rep Power: 5	Hello LuckyTran, thanks a lot for you tips! I will try to full load a node (128/128)whan I will have the possibility....unfortunately I cannot run in full capacity at the moment since other users are using Fluent and we do not have a lot of parallel licences. Marco

September 12, 2022, 05:44		#8
LuckyTran Senior Member Lucky Join Date: Apr 2011 Location: Orlando, FL USA Posts: 5,674 Rep Power: 65	In the meantime you can check what kind of cpu your nodes are running and look up their specs . Then you know right way if they are hyperthreaded or not. Takes only a few minutes of google, won't bother anyone, and doesn't need any Fluent licenses.

September 12, 2022, 06:11		#9
MarcoC501 Member MC Join Date: Apr 2021 Posts: 43 Rep Power: 5	Hyper-threading is enabled, as the # of threads is twice the # of cores, i.e. 64 CPU cores vs 128 threads. So you were completely right, thanks a lot! I just have one other doubt: because we have only 188 Fluent parallel licences, let's say I will have soon 128 cores available, is it better to use 1 - two nodes each of them half-loaded (64/128) 2- one node fully loaded (128/128) I know that economically speaking (and the rule of thumb you mentioned) 1 is not a good option, but is there in your opinion an increase of performance using more than one node or not at all (in this case)? I would say no gain, since having two nodes would imply more time spent for inter-node communication. Am I right? Thanks for your help Marco

September 12, 2022, 11:45		#10
LuckyTran Senior Member Lucky Join Date: Apr 2011 Location: Orlando, FL USA Posts: 5,674 Rep Power: 65	Regarding 2-, one node fully loaded at 128 is the same as one node fully loaded at 64. Better to not hold 64 Fluent licenses hostages if that's all you have. Communication overhead with multiple nodes is rarely worse than 10% on any respectable cluster that uses highspeed links. If your internode comms is worse than 10% you should honestly return the cluster to the store or find yourself a lawyer and sue somebody. You can expect a 1.4 - 2.0x speedup (i.e. 70% to 50% calculation time) depending on linear N-scaling in the best case or sqrt(N) scaling in a highly coupled case. Given your cell density, I would expect you to still be within the linear range since the less than linear scaling usually doesn't kick in until you get to relatively low cell counts per partition (less than 100 000 cells per thread). All this is predicated on you being able to fully load your nodes and having them to yourself (or rather, Fluent). Two nodes at 64 each would be fully loaded and likely get you the speedup that you want. Issue is, is your cluster job scheduler setup so that you can lock down two nodes to yourself? Can you convince all your buddies to not run on the unused cores? Etc. It depends on the queue setup of your cluster how doable this approach is. You can also run 2x63, all the way down to 2x48 and still get some type of speedup in the sqrt(N) case. If your calculation scales linearly you can even go as low as 2x36 maybe. Again it's bad etiquette and depends on how selfish you want to be and how time critical your LES calculation is. There's your local equilibrium, I leave it up to you to figure out the Nash equilibrium that satisfies everyone's else's needs.

September 12, 2022, 17:14		#11
MarcoC501 Member MC Join Date: Apr 2021 Posts: 43 Rep Power: 5	Hello LuckyTran, thanks a lot for your really nice reply! Actually I can have exclusive access to each node I am simulating in, and I can also choose the number of nodes. So, if I take literarly all the prescriptions you gave me, I could even run with 4x36 (being very selfish , but I know everyone running Fluent on our cluster, so we could organize...) and see what happens. Joking aside, I would first try 2x36 and 2x64 to check out whether there is a linear scaling or not, and I will verify your suggestions. If I will be really in a hurry later on I will go for 4x36 if there is linear scaling. Oh yeah, do you have some references for the 2x36 and 2x48 you meantioned? I would love to read more about this CPU scaling performance. Thanks

September 12, 2022, 23:33		#12
LuckyTran Senior Member Lucky Join Date: Apr 2011 Location: Orlando, FL USA Posts: 5,674 Rep Power: 65	There's lots of anecdotal evidence of N-scaling or sqrt(N) scaling. You can even find it in product brochures. You don't find this information published in a scientific article because it's too practical. The parallel performance really though depends on the problem at hand, what models you use, a little bit of hardware dependence, etc. The best way to find out what your scaling is, is to do it yourself on your problem. And then you know it, for your problem. All it takes is a stopwatch. You can find it though in publications from benchmarking workshops.

September 19, 2022, 05:01		#14
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,151 Blog Entries: 29 Rep Power: 39	Generally speaking, you can have: - compute bound calculations, where the speed of the processor (or, more likely, how fast you feed the processor with calculations) is the main limiting factor - memory bound calculations, where the speed of the memory (which somehow inlcudes also its size, as it typically implies a delay nonetheless) is the main limiting factor - communication bound calculations, where the idle time due to the parallel communication is the main limiting factor - IO bound calculations, where time spent writing stuff on files is the main limiting factor By definition you can have only one main limiting factor, but which one depends from the hardware, code and case. My kind of outdated experience with Fluent up to version 14 is that it can, indeed, suffer from any of them but, in practice, for common hardware and cases you typically end up suffering for memory bound limitations. This is not much a Fluent problem as a common multicore limitation, as multiple cores tend to share stuff (channels, cache, etc.). Put in hyperthreading and you are also adding a compute bound limitation while making the memory bound one even worse. A formal sweet spot, from my experience, is at using only half of the physical cores available on a node, with the rest left idle. It is formal because it is not, from a practical point of view, meaningful to use just half of a node cores. Also, it is a matter of efficiency, but you still get practical convenience from running at full capacity, just do not expect, like at all, the double. If you want to test parallel efficiency in such cases, when it makes sense, it is typically done using the single node performances as a base and to reason in terms of nodes. What you see is that, while single node efficiency starts to saturate around half its capacity, the node based efficiency still follows your typical expectations for such codes. In conclusion, however, just always use all the allowed cores in the most compact arrangement (i.e., use full nodes) and, if allowed, ask to not use the hyperthreading. In your specific case, it is probable that 64 is the number of physical cores, which become 128 with hyperthreading. An aditional issue might be due to affinity settings, but that would be weird.