benchmark results

August 24, 2001, 02:55

These are the results of a benchmark I did with STAR v3.15. I used a simple case with approx. 500000 cells (see below).

Machines:

SGI Origin 2000, 8 * R10000, 195 MHz

Linux PC with P3, 1 GHz

Linux PC Cluster, 8 * P4, 1.7 Ghz, 100 MBit Network

Results for serial run (times as reported by STAR as "ELAPSED TIME" in the .run file):

R10000: 24473 s

P3: 16638 s

P4: 4841 s

This means that (for this case) the P4 is 5 times faster than the R10000 and 3.4 times faster than the P3!

Results for HPC:

R10000:

CPUs----time----Speedup to serial

2------13926----1.76

4-------6887----3.55

8-------3009----8.13

P4:

CPUs----time----Speedup to serial

2-------2504----1.93

4-------1332----3.63

6-------1034----4.68

8--------901----5.37 (optimized metis decomposition failed, used basic)

For the cluster, the problem seems to be too small to get an adequate speedup with more than 4 CPUs. This should be better for a problem with more cells and equations.

As it would be interesting to compare these results with other machines, you are invited to run the benchmark yourself and post the results here.

The case I used can easily be set up using the following commands (delete the blank lines):

*set ncel 32

*set nvrt ncel + 1

vc3d 0 1 ncel 0 1 ncel 0 1 ncel

*set coff ncel * ncel * ncel

*set voff nvrt * nvrt * nvrt

cset news cran 0 * coff + 1 1 * coff$cgen 3 voff cset,,,vgen 1 1 0 0

cset news cran 2 * coff + 1 3 * coff$cgen 3 voff cset,,,vgen 1 0 0 1

cset news cran 4 * coff + 1 5 * coff$cgen 3 voff cset,,,vgen 1 0 1 0

cset news cran 6 * coff + 1 7 * coff$cgen 3 voff cset,,,vgen 1 1 0 0

cset news cran 8 * coff + 1 9 * coff$cgen 3 voff cset,,,vgen 1 0 0 -1

cset news cran 10 * coff + 1 11 * coff$cgen 3 voff cset,,,vgen 1 0 -1 0

cset news cran 12 * coff + 1 13 * coff$cgen 3 voff cset,,,vgen 1 1 0 0

vmer all

n

vcom all

y

cset all

axis z

view -1 -1 1

plty ehid

cplo

live surf

cset news gran,,0.01

cplo

bzon 1 all

cset news gran 6.99

cplo

bzon 2 all

cset news shell

cdele cset

ccom all

y

dens const 1000

lvis const 1e-3

rdef 1 inlet

0.025

rdef 2 outlet

split 1

iter 10000

geom,,1e-2

prob,,

The run converges in 383 (or 384) iterations (using single precision).

(I had some difficulties to post this message in a readable form. The tables and commands were concatenated to single strings, blanks and tabs werde deleted. So I had to add the "-"s and the blank lines. How can this be avoided?)

August 24, 2001, 08:33

Very interesting results, thanks for sharing them with us!

Anbout how to post preformatted text - take a look at question 3 here.

August 31, 2001, 06:25

Here are some results obtained on an IBM RS/6000 SP-SMP with POWER3-II 375 MHz CPUs. All computations have been done on 8-way Nighthawk II nodes.

Serial run: 9400 elapsed time, 8753 CPU time

Values and comparisons below are elapsed times.

CPUs---time----Speedup to serial

2------4190-----2.24

4------1924-----4.88

8-------930----10.11

September 10, 2001, 09:48

Your results are quite interesting and I am glad you are sharing them. I think you drew one possibly invalid conclusion concerning scalability, although, as you say, running a much larger job might help. The P4 cluster is using 100mb ethernet as its message passing media. Its just not good enough for 8 machines of that speed. When running parallel, everything has to be balanced or the slowest component will be the choke point. In this case its the ethernet. There are 2 relatively cheap things you can do to improve the performance (assuming you have not done these already):

1)put a 2nd ethernet card on each machine and dedicate that to MPI while the first does general network activity (nfs, ftp, X...)

2)Connect your cards to an etherswitch rather than a hub.

If this does not work, then its likely that you need a better (and far more expensive) message passing media to get good scaling.

Steve

August 24, 2001, 02:55	benchmark results	#1
stefan Guest Posts: n/a	These are the results of a benchmark I did with STAR v3.15. I used a simple case with approx. 500000 cells (see below). Machines: SGI Origin 2000, 8 * R10000, 195 MHz Linux PC with P3, 1 GHz Linux PC Cluster, 8 * P4, 1.7 Ghz, 100 MBit Network Results for serial run (times as reported by STAR as "ELAPSED TIME" in the .run file): R10000: 24473 s P3: 16638 s P4: 4841 s This means that (for this case) the P4 is 5 times faster than the R10000 and 3.4 times faster than the P3! Results for HPC: R10000: CPUs----time----Speedup to serial 2------13926----1.76 4-------6887----3.55 8-------3009----8.13 P4: CPUs----time----Speedup to serial 2-------2504----1.93 4-------1332----3.63 6-------1034----4.68 8--------901----5.37 (optimized metis decomposition failed, used basic) For the cluster, the problem seems to be too small to get an adequate speedup with more than 4 CPUs. This should be better for a problem with more cells and equations. As it would be interesting to compare these results with other machines, you are invited to run the benchmark yourself and post the results here. The case I used can easily be set up using the following commands (delete the blank lines): set ncel 32 set nvrt ncel + 1 vc3d 0 1 ncel 0 1 ncel 0 1 ncel set coff ncel ncel * ncel set voff nvrt nvrt * nvrt cset news cran 0 * coff + 1 1 * coff$cgen 3 voff cset,,,vgen 1 1 0 0 cset news cran 2 * coff + 1 3 * coff$cgen 3 voff cset,,,vgen 1 0 0 1 cset news cran 4 * coff + 1 5 * coff$cgen 3 voff cset,,,vgen 1 0 1 0 cset news cran 6 * coff + 1 7 * coff$cgen 3 voff cset,,,vgen 1 1 0 0 cset news cran 8 * coff + 1 9 * coff$cgen 3 voff cset,,,vgen 1 0 0 -1 cset news cran 10 * coff + 1 11 * coff$cgen 3 voff cset,,,vgen 1 0 -1 0 cset news cran 12 * coff + 1 13 * coff$cgen 3 voff cset,,,vgen 1 1 0 0 vmer all n vcom all y cset all axis z view -1 -1 1 plty ehid cplo live surf cset news gran,,0.01 cplo bzon 1 all cset news gran 6.99 cplo bzon 2 all cset news shell cdele cset ccom all y dens const 1000 lvis const 1e-3 rdef 1 inlet 0.025 rdef 2 outlet split 1 iter 10000 geom,,1e-2 prob,, The run converges in 383 (or 384) iterations (using single precision). (I had some difficulties to post this message in a readable form. The tables and commands were concatenated to single strings, blanks and tabs werde deleted. So I had to add the "-"s and the blank lines. How can this be avoided?)

August 24, 2001, 08:33	Re: benchmark results	#2
Jonas Larsson Guest Posts: n/a	Very interesting results, thanks for sharing them with us! Anbout how to post preformatted text - take a look at question 3 here.

August 31, 2001, 06:25	Re: benchmark results	#3
Stephan Meyer Guest Posts: n/a	Here are some results obtained on an IBM RS/6000 SP-SMP with POWER3-II 375 MHz CPUs. All computations have been done on 8-way Nighthawk II nodes. Serial run: 9400 elapsed time, 8753 CPU time Values and comparisons below are elapsed times. CPUs---time----Speedup to serial 2------4190-----2.24 4------1924-----4.88 8-------930----10.11

September 10, 2001, 09:48	Re: benchmark results	#4
steve Guest Posts: n/a	Your results are quite interesting and I am glad you are sharing them. I think you drew one possibly invalid conclusion concerning scalability, although, as you say, running a much larger job might help. The P4 cluster is using 100mb ethernet as its message passing media. Its just not good enough for 8 machines of that speed. When running parallel, everything has to be balanced or the slowest component will be the choke point. In this case its the ethernet. There are 2 relatively cheap things you can do to improve the performance (assuming you have not done these already): 1)put a 2nd ethernet card on each machine and dedicate that to MPI while the first does general network activity (nfs, ftp, X...) 2)Connect your cards to an etherswitch rather than a hub. If this does not work, then its likely that you need a better (and far more expensive) message passing media to get good scaling. Steve

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Another Benchmark	andras	OpenFOAM	2	July 10, 2008 08:35
Benchmark	Ford Prefect	Main CFD Forum	6	July 3, 2006 10:29
CFX cylinder or sphere benchmark results	Mel	CFX	1	August 8, 2005 18:47
benchmark	ernie	FLUENT	0	July 9, 2005 02:17
new P4 vs. AMD and ifc vs. g77 benchmark	George Bergantz	Main CFD Forum	3	July 9, 2002 03:04