Linux Cluster Computing
I am interested in your experiences about using linux based cluster computing for CFD. Specifically:
1) Is it more cost effective (both long and short term) to purchase a pre-assembled rack or build your own from a series of PC's? From the little bit of looking around I've done it looks like the initial purchase price is cheaper for the second approach. This is not what I am interested in. I am asking about total cost of ownership (ease of O.S. and hardware maintenance, ease of use, etc).
2) Who are the vendors for pre-assembled systems?
3) Is it better to have the nodes be "diskless" or near diskless with the disk system being centralized or to have a distributed disk system?
4) What are the pros/cons of dual versus single processors per node?
5) Pros/cons of Intel vs. AMD processors? (I'm not really interested in Alphas)
Thanks for the input.
Re: Linux Cluster Computing
I can't answer all your questions, but for pre-installed linux system, you might want to check at http://www.linux.org/vendors/systems.html.
As for the distributed disk system, it depends on the physical memeory and the problem size, I would suggest distributed disk system though it might cost higher at the beginning.
Dual processors or more will for sure work faster. But at the same time, with parallel processors, the commercial codes are designed that it will take two licenses for one job if you run it on dual...
Re: Linux Cluster Computing
2) You can use any PCs you can get. Install Linux
shouldn't a problem for any computer support.
3) It depends. If you really want to save some money,
go for diskless system. But now days even 27gb/7200 Harddisk only costs about $150, so why bother?
4) For large scale parallel computing, Dual processor
really can't help too much. Except you want do pararel computing on single PC.
5) No difference between them. They all work great for Linux. Just go for best price/performane. Now AMD's CPU is best choice.
Re: Linux Cluster Computing
What is most cost effective probably depends on your own background. If you have never used Linux and haven't done any system-administration on UNIX you might want to consider buying a pre-installed rack-based cluster with support included. On the other hand, if you have a fair amount of experience from UNIX and like the thought of setting up things yourself you should definitely buy separate machines and install the cluster yourself. If you haven't installed Linux you could try to first install Linux on a single PC (assuming thatr you have an old PC that you can try it on) and see if you can get it up and running without any problems on your network. This should give you a clear indication on which way to go. Another factor might be space - have you got a large server-room where you can put all your stand-alone nodes if you buy separate machines?
By buying a pre-installed supported system you mainly avoid start-up problems. Once you've got the system up and running it will probably run very well for a long time... the stability of Linux is very impressive. We've had a large cluster of Dell OptiPlex machines running for almost 2 years now (self installed) and stability has been better than we ever expected. All nodes stay up for month after month... and the cluster is very heavily loaded with large CFD jobs 24/7.
There are hundreds of vendors that offer pre-installed clusters You can buy them from from well known companies like SGI and HP or from Linux-giants like VA Linux ( www.valinux.com), Penguin Computers ( www.penguincomputing.com), Aspen Systems ( www.aspsys.com). You can probably also find a lot of smaller companies that will pre-install everything for you and perhaps charge a bit less than the well-known brands.
You asked about diskless nodes. I would say that it isn't worth the extra trouble to set up diskless nodes. Disks are very cheap and also quite reliable. The advantage with diskless nodes is that it is one less component that can cause trouble. The disadvantage is that it is more difficult to set up since this isn't the normal linux setup. You also will not have any fast local disk to swap on - might be important sometimes. In any case you should probably avoid storing any data on local disks and only have the operating system and swap area there. We have small local disks (6GB) on each node in our cluster. Aside from one disk that was bad on delivery we haven't had any trouble. It is nice to have "independent nodes" - you can take one node and move it wherever you want and it will still boot and run.
About dual/single processor machines. I thinkthat dual machines with enough memory (at least 1 GB) cost more than two times what you would pay for two equivalent single-cpu machines. Dual-CPU machines might reduce the network load a bit though. However, you will not have "full symmetry" of everything - a cause of trouble. With the first Fluent release dual-CPU machines were actually slower than single-CPU machines due to a communication problem in the Linux distribution. I think that this has been fixed now though. We have single-CPU machines in our cluster and we will buy the same again.
AMD Athlon is between 10 to 20% faster than a PIII with the same clock-frequency on floating-point intensive things like CFD. Hence, AMD is probably the most cost-effective alternative. You can get the 900 MHz Athlons very cheap - I recently looked at the node-price of an Athlon-based 950 MHz cluster with 512 MB RAM and Asus A7V motherboards (good) and the price per node was about $1600. The faster Athlons generate a lot of heat though so I'd be a bit concerned about that - make sure your vendor uses a good CPU cooling arrangement. Make sure you get the VIA chipset and the latest socket-based Athlon Thunderbirds. Also make sure you get ECC memory.
Buying cheap PIII nodes these days is a bit more complicated. The new low-cost 815 based motherboard from Intel doesn't support ECC I think - not good. If you buy the more expensive RDRAM based motherboards memory costs increase dramatically. You best alternative might still be to buy BX based motherboards. I don't like the though of buying new machines with a motheboard that is based on a chip-set that is a couple of years old though.
Re: Linux Cluster Computing
Thanks for the response. I do have a great deal of experience as a UNIX user but only have a smattering of experience in UNIX (or specifically linux) system administration. What kind of time commitment are you talking about for learning what I need to install linux and building an effective cluster (including some kind of cluster monitoring software and queueing system)? To be honest, I am more interested in getting CFD results than playing with the computer.
My question about the diskless node arrangement was mainly from the standpoint of system maintenance. Some of the stuff I've read suggests that with this arrangement you can have a single OS image on a centralized disk and have this automatically copied to the nodes at boot time. This seems like it would minimize OS maintenance. I clearly understand the advantages of having some disk on each machine for /tmp and swap space. So when I say near diskless, this is what I have in mind - small disks on each machine for /tmp and swap and possibly an OS image, but all other disk space centralized. What do you know about this type of arrangement?
You mentioned also that you use centralized disk storage. How does this work? Are you using a RAID system? Is the disk system mounted on each node via NFS or something more exotic? Are there any performance issues associated with this arrangement?
It appears you also think that ECC memory is the way to go. Do you have some negative experience with non-ECC memory that has formed this opinion? On single machines we have typically used non-ECC memory with good success for many years.
Thanks again for the information and I look forward to your responses.
Re: Linux Cluster Computing
You asked about time. I would estimate that it took us about one full weeks man-time to get the cluster up and running. This includes learning Linux administration, installing the operating system on all nodes, setting up compilers, disks and to install cluster monitoring software and write a few admin scripts. After that we haven't done anything for almost 2 years. We had a basic knowledge of UNIX sysadmin when we started. Add a few extra days, or perhaps a week, if you need to learn about NFS, IP numbers etc.
Having a central operating system image sounds nice in theory but I doubt if you really gain that much from it unless you intend to do a lot of tweaking and upgrading of your operating system. I'd guess that once you have the operating system installed and running you won't touch it again. We played around a bit with installing Linux over the network but eventually found it easier to just use CDs. With RedHat you can create a "kick-start" floppy which will automize the installation - you just put in the floppy and the CD, turn on the machine and go for a 20 minute coffee break while the OS installs.
About disks - we have two "mother nodes" with a bit of extra memory (1 gig each) and disks. We bought RAID cards to these mother-nodes but we couldn't get them to run in Linux (we should have checked this in advance but our IT dept. ordered the stuff before we had a chance to check Linux compatibility), hence we don't use RAID now. In our next cluster we will certainly get RAID which works in Linux. The disks on the mother-nodes are NFS mounted on all cluster-nodes and we keep all applications and data on these mother-node-disks. We also only have compilers etc. installed on the mother nodes. The cpu-nodes only have plain RedHat installed. We are happy with this setup and will use it again in our next cluster. I don't see any important performance bottlenecks with this kind of mother-node-disks setup. Starting an application on many nodes might be a bit faster if you have the application on local disks on each node, but the startup-time is not a problem neither with our in-house codes nor with fluent. Reading in cases/data can be slow though, but there is no way to distribute this storage unless you want to split your case/data files - not very practical.
My comments concerning ECC really isn't based on problems with non-ECC machines. However, memory related problems can very very nasty, intermittent and tricky to debug. If you can reduce the risk of these that is good. ECC usually doesn't cost that much extra either. There must be a reason why all workstations/servers on the market use ECC - if you want to have your machines up and running 24/7 you don't want a memory-glitch every week take them down. On a day-on/night-off system this isn't a big issue. Our cluster, which has ECC memory, has a very impressive uptime history. In almost two-years we have only had two single-node lock-ups! This is actually better than all our other HP and SGI workstations and servers!
|All times are GMT -4. The time now is 19:40.|