To Jonas: Linux Cluster
I just saw you in that Fluent newsletter in front of your 150 node cluster.
We have the same here, but with only 3x8 nodes. We didn't increase the number of nodes per cluster because we thought that with standard ethernet connection the efficiency regarding computing time might decrease considerably. How is your experience with the speed up vs. number of nodes used for one job ?
Have you considered to switch to some faster connection like GigaBit-ethetnet or myrinet ? Or even tested it ?
One year ago, the price for a fast connection within the cluster was the same as for the the cluster nodes itself, so we decided to stick to standard 100 Mbit ethernet. But time might have changes meanwhile.
Re: To Jonas: Linux Cluster
The speedup is very dependent on problem size. From my experience you get good speedup with Fluent down to about 50,000 cells per CPU. With even more CPUs and less cells per CPU communication overhead starts to become a problem. Scaling is of course dependent on which models you run - things like sliding meshes, discrete phase etc. can deteriorate scaling. Very large problems also often scale a bit worse. To summarize; a job with 50,000 cells doesn't parallelise very well, a job with 500,000 cells runs well on up to 10 CPUs and a job with 5,000,000 cells runs well on up to about 70 CPUs (scaling often a bit worse for very large problems).
I think that these numbers are quite typical for most commercial codes that have mature parallelizations. With our in-house code (an explicit structured Runge-Kutta code which is easy to parallelize) scaling is much better though - a 1 million case runs well on 100 CPUs.
Our cluster is 1.5 years old now and has PIII 1GHz CPUs. With faster CPU's the scaling problem becomes worse - a faster CPU needs more communication bandwith to keep it happy. We rarely use more than 50 CPUs for one Fluent simulation. A typical Fluent simulation has about 1 million cells and runs on 15-20 CPUs.
The cluster is used by 15 active CFD users at our CFD department, so running a 150 CPU job on your own requires some diplomatic skills ;-) After we switched to linux clusters we have removed all que-system and load-balancing things - they were too expensive and created a lot of administrative overhead. With the low cost per CPU for linux clusters it makes much more sence to simply buy more CPU when needed, instead of forcing people to wait in a que system. Everyone here are very happy to avoid the hassle with a que system - it has worked great for our department and average CPU usage has been very high (>70%). For inter-departmental clusters things might be different though. To avoid diplomatic problems we have bought separate clusters for each department that uses CFD.
About faster networks - I haven't checked prices lately, but I think that they are still quite expensive. That a faster network will double the cost per CPU sounds reasonable. We haven't tested any faster networks. However, I have looked at a few bechmarks from others. Fluent and HP have tested myrinet. The results can be found on Fluent's web site and are interesting, although a couple of years old by now. Scali (see the sponsor list) has also benchmarked Fluent on a cluster with SCI interconnect.
My impression from this is that a non-standard faster network can only be justified if you want to run very large cases (say 10 million cells or more) or if you for some reason want to run small cases extremely fast (convergence in minutes) - could be needed for automatic optimization routines or similar. For normal cases, where you don't have more than a few million cells and are happy to have a converged solution in a few hours or at worst over night, standard 100 mbit fast ethernet is they way to go I think. I also like concept of using standard off-the-shelf components as much as possible - it will make administration and future upgrades much easier.
|All times are GMT -4. The time now is 17:40.|