CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums > Software User Forums > OpenFOAM > OpenFOAM Running, Solving & CFD

A cluster of 2 is almost 5x slower than individual node

Register Blogs Members List Search Today's Posts Mark Forums Read

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old   June 21, 2020, 10:16
Default A cluster of 2 is almost 5x slower than individual node
  #1
New Member
 
Join Date: May 2020
Posts: 27
Blog Entries: 1
Rep Power: 4
Mars409 is on a distinguished road
I put together a cluster of two(2) raspberryPi-4 boards and ran MPPICFoam (almost, save not writing out the volume fields) across 8 processors, four from each board. The time step is around 5e-6s. No NFS. Each board gets its own root directory with its own latestTime, system, and constant subdirectories on its local microSD card.

I then compare with running on only 4 processors on one board alone.

A lone board took 11s.

The cluster took 52s per time step. No difference between whether Distributed is yes or no. This is happening way before getting to the time to write out data to disk.

For the cluster, trying to pin down whether it is the communication link between the boards that bogs down the performance, I list below nuggets of info, and estimate that the amount of data exchanged over the Ethernet per time step costs only 0.045s, which is nowhere near the 52s that the 2-node cluster produced.

Can someone shine a light on this puzzle? Perhaps its the TCPIP packet size being too small? How do it control this under OpenMPI? Perhaps its because one of the 4 processors (each board has only 4) has to switch back and forth between MPI and the solver? (I haven't tried 3+3.)
  • A Gigabit Ethernet connects the two boards. At, say, 80% utilization, yields 800 Mb/s.
  • A total of nearly 3,000 faces connects the processors on one board to the processors on the other.
  • 9 values (3 are the components of U) are emitted from each side of each face, with 64 bits/value.
  • U takes 1 iteration to converge, whereas p takes around 10 iterations.
  • Payload per iteration = 3,000 faces x 2 sides/face x 9 values/side x 64 bits/value ~ 3,600,000 bits
  • Payload per time step = 10 iterations x 3,600,000 bits/iteration = 36 Mbits
  • Time taken on Ethernet = 36 Mbits / 800 Mbits/s ~ 1/20 * 0.9 s, far smaller than the 52s.
Mars409 is offline   Reply With Quote

Old   June 23, 2020, 21:59
Default
  #2
Senior Member
 
Join Date: Nov 2010
Location: USA
Posts: 1,232
Rep Power: 22
me3840 is on a distinguished road
It might be useful to run OpenFOAM's profiling functions or another system information gathering application like sar or valgrind.


https://www.openfoam.com/documentati...rofiling.htmla


My initial guess would be some issue with latency going from the pi's SoC > pcie > ethernet controller.
me3840 is offline   Reply With Quote

Old   June 25, 2020, 13:34
Default
  #3
New Member
 
Join Date: May 2020
Posts: 27
Blog Entries: 1
Rep Power: 4
Mars409 is on a distinguished road
Thanks for the tip off.

I ran it and got no profiling results.

It turns out need to set compiling option to 'Prof':
Code:
export WM_COMPILE_OPTION=Prof
Re-compiled OpenFOAM yesterday evening.

Now running on 4 processors on one board first to see what will get.

Anyway evening running just one the same board the four processors together take about 13 seconds per time step of MPPICFoam, whereas running on 4 threads on my 8 year old Dell Inspiron laptop (Intel x64 arch, 2 cores) takes only 4 seconds. The 3x worse performance on the ARM SoC on RPI-4 is shocking.

Next thing I need to do is to compile OpenFOAM on the Dell laptop for 'Prof' and re-run mpirun for 4 processors to see where the difference lies.

What is sar? A web search only turns up pages after pages about the virus.

I tried valgrind, and got some error message and a search on it brought up a 2016 thread of discussion between the developers (https://bugs.kde.org/show_bug.cgi?id=303877.) that seem to indicate that they fixed the problem for some specific architectures. Apparently not for Armhf.
Mars409 is offline   Reply With Quote

Old   June 25, 2020, 13:52
Default
  #4
Senior Member
 
Join Date: Nov 2010
Location: USA
Posts: 1,232
Rep Power: 22
me3840 is on a distinguished road
You can read more about sar here:
https://linux.die.net/man/1/sar
https://www.geeksforgeeks.org/sar-co...m-performance/


It would be interesting to compute the maximum memory bandwidth available to your old desktop and the pi, and see how that correlates with the performance difference. You could also measure the memory bandwidth while running the application with PCM
https://github.com/opcm/pcm

I don't know if it can be build for arm though.
me3840 is offline   Reply With Quote

Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
SAP cluster resource/services not coming online on cluster node 2 Nthar1@yahoo.com Hardware 0 May 9, 2017 05:55
Running UDF with Supercomputer roi247 FLUENT 4 October 15, 2015 13:41
Cluster ID's not contiguous in compute-nodes domain. ??? Shogan FLUENT 1 May 28, 2014 15:03
The fluent stopped and errors with "Emergency: received SIGHUP signal" yuyuxuan FLUENT 0 December 3, 2013 22:56
999999 (../../src/mpsystem.c@1123):mpt_read: failed:errno = 11 UDS_rambler FLUENT 2 November 22, 2011 09:46


All times are GMT -4. The time now is 17:59.