CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > ANSYS > FLUENT

HPC Fluent Capabilities

Register Blogs Community New Posts Updated Threads Search

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   September 11, 2022, 04:15
Default HPC Fluent Capabilities
  #1
Member
 
MC
Join Date: Apr 2021
Posts: 43
Rep Power: 5
MarcoC501 is on a distinguished road
Hello,
I am currentily running a DES/LES with Fluent on a HPC cluster, and I would like to know from your experience what is the best compromise between number of nodes and cores.
The mesh has 70M cells, and for each node I can access maximum 128 CPUs: I saw that if I use 64 cores, the simulation is faster then if I use 96 cores, which makes me a bit doubtful since I heard people running on many more cores (1000+) and their simulation performance was clearly boosted.
For sure there is some technical information I am missig and I would really appreciate if you could help me
Thanks a lot

Marco
MarcoC501 is offline   Reply With Quote

Old   September 11, 2022, 07:07
Default
  #2
Senior Member
 
Lorenzo Galieti
Join Date: Mar 2018
Posts: 373
Rep Power: 12
LoGaL is on a distinguished road
how big is your mesh? (Number of elements). Basically the increase in computational speed is a trade off between distributing the mesh elements to multiple cores and adding an additional ovehead of comunication between the cores. (because sometimes to solve the equations in one mesh element you need access to information held by another element of another core)

Depending on the size of your mesh and type of equations being solved, there's an optimum number of cores.
LoGaL is offline   Reply With Quote

Old   September 11, 2022, 07:17
Default
  #3
Member
 
MC
Join Date: Apr 2021
Posts: 43
Rep Power: 5
MarcoC501 is on a distinguished road
I can send you the example of partition when I use 1 node and 64 CPUs:

Multicore processors detected. Processor affinity set!

Reading from FILENAME in NODE0 mode ...
Reading mesh ...
Warning: reading 96 partition grid onto 64 compute node machine.
will auto partition.
69963316 cells, 1 cell zone ...
69963316 tetrahedral cells, zone id: 3
142010839 faces, 5 face zones ...
4080098 triangular wall faces, zone id: 1
137842425 triangular interior faces, zone id: 2
51624 triangular pressure-inlet faces, zone id: 6
16140 triangular pressure-outlet faces, zone id: 7
20552 triangular pressure-outlet faces, zone id: 8
12766611 nodes, 1 node zone ...
Done.

When I open the same case with 96 CPUs, the simulation is slower. I also tried to increase the number of nodes, but still the simulation is not faster than this. I need to increase the compuational time and use the nodes/cores I have in the best possible way, but I do not understand if it depends on Fluent or on the cluster where I am running

Thanks a lot for any help you can provide
Let me know if more information are needed.
MarcoC501 is offline   Reply With Quote

Old   September 11, 2022, 14:11
Default
  #4
Senior Member
 
Lorenzo Galieti
Join Date: Mar 2018
Posts: 373
Rep Power: 12
LoGaL is on a distinguished road
So each node has 64 cores in total? I am not really knowleadgeable in HPC cluster architecture, but what I know is that as soon as you use more than 1 node, there's an additional communication overhead between the nodes, because two different nodes physically exist in two different places. This may be the thing...

But perhaps somebody a bit more expert than me might reply
LoGaL is offline   Reply With Quote

Old   September 11, 2022, 17:18
Default
  #5
Member
 
MC
Join Date: Apr 2021
Posts: 43
Rep Power: 5
MarcoC501 is on a distinguished road
Each node has up to 128 cores.
I understand you point, but even if I use 1 node only, using 96/128 cores is slower than 64/128: how could I speed up my simulation if there is no positive effect neither increasing the cores nor the nodes?
MarcoC501 is offline   Reply With Quote

Old   September 11, 2022, 21:11
Default
  #6
Senior Member
 
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,674
Rep Power: 65
LuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura about
Have you tried 128?

Most likely you have 64x2=128 cores per node after hyperthreading. This means the node is fully loaded at 64 and 128 threads . Any number between 64-128 will slow down your calculation (63 is okay, 62 is okay, 65 is not okay) because now you have tasks in the queue that are just waiting for the others to finish before they can execute. If you spread it across additional nodes then make sure you load each node the same way. Take 64 or 128 from each node, or take <63 from each node, but not those intermediate numbers. General rule of thumb and major etiquette when running anything on a shared HPC cluster is, fully load the nodes you use or get off the cluster.
LuckyTran is offline   Reply With Quote

Old   September 12, 2022, 03:40
Default
  #7
Member
 
MC
Join Date: Apr 2021
Posts: 43
Rep Power: 5
MarcoC501 is on a distinguished road
Hello LuckyTran,

thanks a lot for you tips! I will try to full load a node (128/128)whan I will have the possibility....unfortunately I cannot run in full capacity at the moment since other users are using Fluent and we do not have a lot of parallel licences.

Marco
MarcoC501 is offline   Reply With Quote

Old   September 12, 2022, 05:44
Default
  #8
Senior Member
 
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,674
Rep Power: 65
LuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura about
In the meantime you can check what kind of cpu your nodes are running and look up their specs . Then you know right way if they are hyperthreaded or not. Takes only a few minutes of google, won't bother anyone, and doesn't need any Fluent licenses.
LuckyTran is offline   Reply With Quote

Old   September 12, 2022, 06:11
Default
  #9
Member
 
MC
Join Date: Apr 2021
Posts: 43
Rep Power: 5
MarcoC501 is on a distinguished road
Hyper-threading is enabled, as the # of threads is twice the # of cores, i.e. 64 CPU cores vs 128 threads.
So you were completely right, thanks a lot!
I just have one other doubt: because we have only 188 Fluent parallel licences, let's say I will have soon 128 cores available, is it better to use
1 - two nodes each of them half-loaded (64/128)
2- one node fully loaded (128/128)

I know that economically speaking (and the rule of thumb you mentioned) 1 is not a good option, but is there in your opinion an increase of performance using more than one node or not at all (in this case)?

I would say no gain, since having two nodes would imply more time spent for inter-node communication. Am I right?
Thanks for your help
Marco
MarcoC501 is offline   Reply With Quote

Old   September 12, 2022, 11:45
Default
  #10
Senior Member
 
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,674
Rep Power: 65
LuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura about
Regarding 2-, one node fully loaded at 128 is the same as one node fully loaded at 64. Better to not hold 64 Fluent licenses hostages if that's all you have.

Communication overhead with multiple nodes is rarely worse than 10% on any respectable cluster that uses highspeed links. If your internode comms is worse than 10% you should honestly return the cluster to the store or find yourself a lawyer and sue somebody. You can expect a 1.4 - 2.0x speedup (i.e. 70% to 50% calculation time) depending on linear N-scaling in the best case or sqrt(N) scaling in a highly coupled case. Given your cell density, I would expect you to still be within the linear range since the less than linear scaling usually doesn't kick in until you get to relatively low cell counts per partition (less than 100 000 cells per thread). All this is predicated on you being able to fully load your nodes and having them to yourself (or rather, Fluent).

Two nodes at 64 each would be fully loaded and likely get you the speedup that you want. Issue is, is your cluster job scheduler setup so that you can lock down two nodes to yourself? Can you convince all your buddies to not run on the unused cores? Etc. It depends on the queue setup of your cluster how doable this approach is. You can also run 2x63, all the way down to 2x48 and still get some type of speedup in the sqrt(N) case. If your calculation scales linearly you can even go as low as 2x36 maybe. Again it's bad etiquette and depends on how selfish you want to be and how time critical your LES calculation is. There's your local equilibrium, I leave it up to you to figure out the Nash equilibrium that satisfies everyone's else's needs.
LuckyTran is offline   Reply With Quote

Old   September 12, 2022, 17:14
Default
  #11
Member
 
MC
Join Date: Apr 2021
Posts: 43
Rep Power: 5
MarcoC501 is on a distinguished road
Hello LuckyTran,

thanks a lot for your really nice reply!
Actually I can have exclusive access to each node I am simulating in, and I can also choose the number of nodes. So, if I take literarly all the prescriptions you gave me, I could even run with 4x36 (being very selfish , but I know everyone running Fluent on our cluster, so we could organize...) and see what happens.
Joking aside, I would first try 2x36 and 2x64 to check out whether there is a linear scaling or not, and I will verify your suggestions. If I will be really in a hurry later on I will go for 4x36 if there is linear scaling.
Oh yeah, do you have some references for the 2x36 and 2x48 you meantioned? I would love to read more about this CPU scaling performance.

Thanks
MarcoC501 is offline   Reply With Quote

Old   September 12, 2022, 23:33
Default
  #12
Senior Member
 
Lucky
Join Date: Apr 2011
Location: Orlando, FL USA
Posts: 5,674
Rep Power: 65
LuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura aboutLuckyTran has a spectacular aura about
There's lots of anecdotal evidence of N-scaling or sqrt(N) scaling. You can even find it in product brochures. You don't find this information published in a scientific article because it's too practical. The parallel performance really though depends on the problem at hand, what models you use, a little bit of hardware dependence, etc. The best way to find out what your scaling is, is to do it yourself on your problem. And then you know it, for your problem. All it takes is a stopwatch. You can find it though in publications from benchmarking workshops.
LuckyTran is offline   Reply With Quote

Old   September 13, 2022, 00:11
Default
  #13
Senior Member
 
Alexander
Join Date: Apr 2013
Posts: 2,363
Rep Power: 34
AlexanderZ will become famous soon enoughAlexanderZ will become famous soon enough
don't use hyperthreading for fluent
__________________
best regards


******************************
press LIKE if this message was helpful
AlexanderZ is offline   Reply With Quote

Old   September 19, 2022, 05:01
Default
  #14
Senior Member
 
sbaffini's Avatar
 
Paolo Lampitella
Join Date: Mar 2009
Location: Italy
Posts: 2,151
Blog Entries: 29
Rep Power: 39
sbaffini will become famous soon enoughsbaffini will become famous soon enough
Send a message via Skype™ to sbaffini
Generally speaking, you can have:

- compute bound calculations, where the speed of the processor (or, more likely, how fast you feed the processor with calculations) is the main limiting factor

- memory bound calculations, where the speed of the memory (which somehow inlcudes also its size, as it typically implies a delay nonetheless) is the main limiting factor

- communication bound calculations, where the idle time due to the parallel communication is the main limiting factor

- IO bound calculations, where time spent writing stuff on files is the main limiting factor

By definition you can have only one main limiting factor, but which one depends from the hardware, code and case.

My kind of outdated experience with Fluent up to version 14 is that it can, indeed, suffer from any of them but, in practice, for common hardware and cases you typically end up suffering for memory bound limitations.

This is not much a Fluent problem as a common multicore limitation, as multiple cores tend to share stuff (channels, cache, etc.). Put in hyperthreading and you are also adding a compute bound limitation while making the memory bound one even worse.

A formal sweet spot, from my experience, is at using only half of the physical cores available on a node, with the rest left idle. It is formal because it is not, from a practical point of view, meaningful to use just half of a node cores. Also, it is a matter of efficiency, but you still get practical convenience from running at full capacity, just do not expect, like at all, the double.

If you want to test parallel efficiency in such cases, when it makes sense, it is typically done using the single node performances as a base and to reason in terms of nodes. What you see is that, while single node efficiency starts to saturate around half its capacity, the node based efficiency still follows your typical expectations for such codes.

In conclusion, however, just always use all the allowed cores in the most compact arrangement (i.e., use full nodes) and, if allowed, ask to not use the hyperthreading.

In your specific case, it is probable that 64 is the number of physical cores, which become 128 with hyperthreading. An aditional issue might be due to affinity settings, but that would be weird.
sbaffini is offline   Reply With Quote

Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Submitting Ansys Fluent job on HPC using SLURM Aero_Tn FLUENT 1 March 8, 2019 09:53
fluent issue in windows hpc pack 2012 r2 mosesHPC FLUENT 17 January 12, 2019 15:01
running fluent in HPC server Ad.Al.Qads FLUENT 1 November 4, 2016 10:11
The fluent stopped and errors with "Emergency: received SIGHUP signal" yuyuxuan FLUENT 0 December 3, 2013 22:56
Fluent Parallel Programing using Windows HPC 2008 santoshgoku FLUENT 22 February 23, 2013 00:56


All times are GMT -4. The time now is 16:44.