A cluster of 2 is almost 5x slower than individual node

Mars409 · June 21, 2020, 10:16

I put together a cluster of two(2) raspberryPi-4 boards and ran MPPICFoam (almost, save not writing out the volume fields) across 8 processors, four from each board. The time step is around 5e-6s. No NFS. Each board gets its own root directory with its own latestTime, system, and constant subdirectories on its local microSD card.

I then compare with running on only 4 processors on one board alone.

A lone board took 11s.

The cluster took 52s per time step. No difference between whether Distributed is yes or no. This is happening way before getting to the time to write out data to disk.

For the cluster, trying to pin down whether it is the communication link between the boards that bogs down the performance, I list below nuggets of info, and estimate that the amount of data exchanged over the Ethernet per time step costs only 0.045s, which is nowhere near the 52s that the 2-node cluster produced.

Can someone shine a light on this puzzle? Perhaps its the TCPIP packet size being too small? How do it control this under OpenMPI? Perhaps its because one of the 4 processors (each board has only 4) has to switch back and forth between MPI and the solver? (I haven't tried 3+3.)

A Gigabit Ethernet connects the two boards. At, say, 80% utilization, yields 800 Mb/s.
A total of nearly 3,000 faces connects the processors on one board to the processors on the other.
9 values (3 are the components of U) are emitted from each side of each face, with 64 bits/value.
U takes 1 iteration to converge, whereas p takes around 10 iterations.
Payload per iteration = 3,000 faces x 2 sides/face x 9 values/side x 64 bits/value ~ 3,600,000 bits
Payload per time step = 10 iterations x 3,600,000 bits/iteration = 36 Mbits
Time taken on Ethernet = 36 Mbits / 800 Mbits/s ~ 1/20 * 0.9 s, far smaller than the 52s.

me3840 · June 23, 2020, 21:59

It might be useful to run OpenFOAM's profiling functions or another system information gathering application like sar or valgrind.

https://www.openfoam.com/documentati...rofiling.htmla

My initial guess would be some issue with latency going from the pi's SoC > pcie > ethernet controller.

Mars409 · June 25, 2020, 13:34

Thanks for the tip off.

I ran it and got no profiling results.

It turns out need to set compiling option to 'Prof':

Code:

export WM_COMPILE_OPTION=Prof

Re-compiled OpenFOAM yesterday evening.

Now running on 4 processors on one board first to see what will get.

Anyway evening running just one the same board the four processors together take about 13 seconds per time step of MPPICFoam, whereas running on 4 threads on my 8 year old Dell Inspiron laptop (Intel x64 arch, 2 cores) takes only 4 seconds. The 3x worse performance on the ARM SoC on RPI-4 is shocking.

Next thing I need to do is to compile OpenFOAM on the Dell laptop for 'Prof' and re-run mpirun for 4 processors to see where the difference lies.

What is sar? A web search only turns up pages after pages about the virus.

I tried valgrind, and got some error message and a search on it brought up a 2016 thread of discussion between the developers (https://bugs.kde.org/show_bug.cgi?id=303877.) that seem to indicate that they fixed the problem for some specific architectures. Apparently not for Armhf.

me3840 · June 25, 2020, 13:52

You can read more about sar here:
https://linux.die.net/man/1/sar
https://www.geeksforgeeks.org/sar-co...m-performance/

It would be interesting to compute the maximum memory bandwidth available to your old desktop and the pi, and see how that correlates with the performance difference. You could also measure the memory bandwidth while running the application with PCM
https://github.com/opcm/pcm

I don't know if it can be build for arm though.

June 21, 2020, 10:16	A cluster of 2 is almost 5x slower than individual node	#1
Mars409 New Member Join Date: May 2020 Posts: 29 Blog Entries: 1 Rep Power: 5	I put together a cluster of two(2) raspberryPi-4 boards and ran MPPICFoam (almost, save not writing out the volume fields) across 8 processors, four from each board. The time step is around 5e-6s. No NFS. Each board gets its own root directory with its own latestTime, system, and constant subdirectories on its local microSD card. I then compare with running on only 4 processors on one board alone. A lone board took 11s. The cluster took 52s per time step. No difference between whether Distributed is yes or no. This is happening way before getting to the time to write out data to disk. For the cluster, trying to pin down whether it is the communication link between the boards that bogs down the performance, I list below nuggets of info, and estimate that the amount of data exchanged over the Ethernet per time step costs only 0.045s, which is nowhere near the 52s that the 2-node cluster produced. Can someone shine a light on this puzzle? Perhaps its the TCPIP packet size being too small? How do it control this under OpenMPI? Perhaps its because one of the 4 processors (each board has only 4) has to switch back and forth between MPI and the solver? (I haven't tried 3+3.) A Gigabit Ethernet connects the two boards. At, say, 80% utilization, yields 800 Mb/s. A total of nearly 3,000 faces connects the processors on one board to the processors on the other. 9 values (3 are the components of U) are emitted from each side of each face, with 64 bits/value. U takes 1 iteration to converge, whereas p takes around 10 iterations. Payload per iteration = 3,000 faces x 2 sides/face x 9 values/side x 64 bits/value ~ 3,600,000 bits Payload per time step = 10 iterations x 3,600,000 bits/iteration = 36 Mbits Time taken on Ethernet = 36 Mbits / 800 Mbits/s ~ 1/20 * 0.9 s, far smaller than the 52s.

June 25, 2020, 13:34		#3
Mars409 New Member Join Date: May 2020 Posts: 29 Blog Entries: 1 Rep Power: 5	Thanks for the tip off. I ran it and got no profiling results. It turns out need to set compiling option to 'Prof': Code: export WM_COMPILE_OPTION=Prof Re-compiled OpenFOAM yesterday evening. Now running on 4 processors on one board first to see what will get. Anyway evening running just one the same board the four processors together take about 13 seconds per time step of MPPICFoam, whereas running on 4 threads on my 8 year old Dell Inspiron laptop (Intel x64 arch, 2 cores) takes only 4 seconds. The 3x worse performance on the ARM SoC on RPI-4 is shocking. Next thing I need to do is to compile OpenFOAM on the Dell laptop for 'Prof' and re-run mpirun for 4 processors to see where the difference lies. What is sar? A web search only turns up pages after pages about the virus. I tried valgrind, and got some error message and a search on it brought up a 2016 thread of discussion between the developers (https://bugs.kde.org/show_bug.cgi?id=303877.) that seem to indicate that they fixed the problem for some specific architectures. Apparently not for Armhf.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
SAP cluster resource/services not coming online on cluster node 2	Nthar1@yahoo.com	Hardware	0	May 9, 2017 05:55
Running UDF with Supercomputer	roi247	FLUENT	4	October 15, 2015 13:41
Cluster ID's not contiguous in compute-nodes domain. ???	Shogan	FLUENT	1	May 28, 2014 15:03
The fluent stopped and errors with "Emergency: received SIGHUP signal"	yuyuxuan	FLUENT	0	December 3, 2013 22:56
999999 (../../src/mpsystem.c@1123):mpt_read: failed:errno = 11	UDS_rambler	FLUENT	2	November 22, 2011 09:46

June 23, 2020, 21:59		#2
me3840 Senior Member Join Date: Nov 2010 Location: USA Posts: 1,232 Rep Power: 24	It might be useful to run OpenFOAM's profiling functions or another system information gathering application like sar or valgrind. https://www.openfoam.com/documentati...rofiling.htmla My initial guess would be some issue with latency going from the pi's SoC > pcie > ethernet controller.

June 25, 2020, 13:52		#4
me3840 Senior Member Join Date: Nov 2010 Location: USA Posts: 1,232 Rep Power: 24	You can read more about sar here: https://linux.die.net/man/1/sar https://www.geeksforgeeks.org/sar-co...m-performance/ It would be interesting to compute the maximum memory bandwidth available to your old desktop and the pi, and see how that correlates with the performance difference. You could also measure the memory bandwidth while running the application with PCM https://github.com/opcm/pcm I don't know if it can be build for arm though.