CFD Online Discussion Forums - Same hardware question =)

CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)

- Main CFD Forum (https://www.cfd-online.com/Forums/main/)

- - Same hardware question =) (https://www.cfd-online.com/Forums/main/8324-same-hardware-question.html)

Mikael Ersson

November 9, 2004 03:15

Same hardware question =)

Hello,

AMD vs Intel

I'm aware that this subject has been debated over and over again, but recent events in hardware industry have formed some new questions.

I'm thinking about clustering about 8 computers and am at a loss at what type to choose. Several benchmarks clearly states that the p4 has the advantage over AMD 64 when it comes to floating point calcs. I think an equal ammount of benchmarks show "real" apps that contradicts this.

So my question is if any of you have some experience with either prescott and/or AMD 64 based systems. CFD benchmarks or the like.

Best Regards

Micke

Oh I will be using GBit Lan only

Two examples of benchmarks (one pro-Intel and the other pro-AMD):

http://www.hardwarezone.com/articles...?cid=2&id=1262

http://www.aceshardware.com/SPECmine...0&o=1&start=20

ag	November 9, 2004 10:10

Re: Same hardware question =)

The K8 series processor (like the FX-55) has a superior floating point unit, which is why Intel needs a significantly higher clock speed to be competitive in Spec. The other thing to bear in mind is that Spec results can depend on how the benchmark is compiled. While the Pentium fp unit is not as efficient, the use of SSE2 and SSE3 extensions can provide a performance boost. I have used both types of processors and seen good performance from both. You may want to consider some other factors, such as power consumption/cooling (AMD has a small advantage here), hardware compatibility (small Intel advantage here), memory access/latency (AMD advantage here), I/O speed (possible Intel advantage here), total cost of ownership, etc. It's not just who has the fastest Spec score. What will your code be doing? Do you do a lot of intermediate I/O, can you make use of large 64-bit addressable memory space in lieu of disk I/O, can your code easily sustain optimizations that favor one processor over another? And as one of my colleagues is fond of saying - Using cache memory can have a tremendous impact on execution speed.

Mikael Ersson

November 10, 2004 03:39

Re: Same hardware question =)

Hey,

Well the cluster will not be tied to any specific code, it will be shared between a number of users. Phoenics, Fluent, Flow-3D, ThermoCalc and FOAM are some of the commercial codes that most likely will be used. Also, Linux will be the main OS.

About I/O, I was under the assumption that AMD had the distinct advantage. Seems that, as always, it's in the eye of the beholder. =)

Do you have any comments on chipset? (For instance; I understand that K8T800 has some problems with onboard network adapter. And in the case of intel chipset, the 1066FSB chipset will soon be available for regular prescott)

Best Regards

Micke

ag	November 10, 2004 08:57

Re: Same hardware question =)

With I/O a lot of the speed is going to be determined by the chipset, not the CPU, and of course that's where Intel has an advantage. That's what I have seen with regard to the "consumer-type" chipsets such as the i875 and K8T800. If Linux is your choice of OS, I would think that that would tend to favor AMD slightly, if only for the more polished 64-bit implementation and the fact that 64-bit linux is available. As far as chipsets, I haven't heard anything about the onboard adapter of the K8T800 (doesn't mean anything though, could be I just haven't been paying attention) but with good discrete network cards available cheap (and ones that function better than onboard) I would use the discrete ones anyway - they tend to utilize the CPU less. The 1066FSB chipset from Intel is supposed to be out soon, but I have read some internet rumors indicating possible glitches with the first batch. I would avoid the first generation of most hardware like the plague anyway, just to let the bugs get worked out (AMD or Intel). Either way, I think if you focus on building a good all-around system (good disks, good chipset, plenty of ram, good network, and good CPUs) you'll be satisfied either way you go.

Mikael Ersson

November 10, 2004 10:50

Re: Same hardware question =)

Thanx for your input ag.

So this Friday we'll have a test system based on AMD delivered to us for benchmarking. If it gives us no trouble I think we'll opt for it. Now it's just the matter of choosing a low latency switch ;)

Best Regards

Micke

Zeng	November 11, 2004 03:32

Re: Same hardware question =)

The performance depends on many factors: processor frequency, architecture, memory subsystem (e.g. bandwith, latency), IO subsystem etc. There is no general answer for the question: which one (P4, AMD64 (e.g. Opteron), Xeon, Itanium2) is the best choice. However, for a specify application, there is a best choice. For CPU-bound computation, P4/Xeon has better performance than Opteron due to its higher clock frequency. However, Memory bandwidth is the bottleneck for many large scale scientific computating, in other words, the performance is decided by the subtainable memory bandwidth rather than processor frequency. For memorry-bound computation, Opteron has better performance because its DDR memory controller is integerated directly into CPU. Comparing with Xeon and Itanium2 it has

(a) larger bandwidth of memory to processor core;

(b) reduced accesses latency,

(c) the bandwidth scales with the processors number and speed;

In fact, the memory bandwidth is a serious problem for XEON.

If you want a cheap one, P4 cluster is a good choice, and its performance is also good. If you want to use larger memory and OpenMP parallization (it is possible for automatic parallization with Intel and PGI compilers), I think Opteron has better performance price ratio than Xeon and Itanium2.

You can refer to the benchmark of SPEC etc., however, for a specifiy application, it is difficult to say which one can give you the best performance without test in advance. Based on my experence, AMD Opteron has better performance than Xeon and Itanium2 in CFD application field (My experence just limited in PHOENICS and Fluent), and you can refer to offical benchmark result of FLUENT: http://www.fluent.com/software/fluen...ch/fullres.htm

By the way, if you want to make a comparsion among Opteron, P4, Itanium2(in near furture) for your code, you can contact with me.

Zeng

andy	November 12, 2004 09:33

Re: Same hardware question =)

In order to answer your question you must compare the computer systems solving large numerical problems typical of CFD. Consumer orientatated benchmarks in magazines can be a very unreliable guide to relative performance solving CFD problems. Perhaps the key thing to recognize is that CFD problems are usually large with relatively little useful data in memory caches. This is not the case for many games/data base/word processor/etc... test cases. This puts a much greater emphasis on main memory performance rather than in-cache floating point performance.

I bought a small 20 processor cluster about a year ago and found it difficult to extract meaningful numbers for CFD purposes on G5s, P4s, AMDs and their chipsets and so obtained/performed a few benchmarks. By far the most useful benchmark was the NASA NAS Parallel Benchmark (NPB) suite of typical CFD codes and kernels. Results were highly correlated with memory performance and not much with clock speed, peak in-cache vector performance, etc...

We bought the cheapest P4s (i.e. slowest clockspeed) that had the highest memory performance. We also went for the highest performance chipset (both memory and gigabit ethernet reasons) in the simplest motherboard.

A colleague who bought a smaller AMD based cluster (motivated I believe by the AMDs being cheaper and having higher in-cache floating performance) around the same time has recently produced a report comparing their cluster against one or two others including the one here. Over a range of CFD tests the P4 system was between 2.5 -> 6.5 faster (with most results closer to the 6.5 end). I would estimate the P4 node price was about 25%-50% more than the AMD.

Mikael Ersson

November 12, 2004 10:02

Re: Same hardware question =)

Ok andy. Tnx.

I agree that most benchmarks are quite useless. I have tried to look at benchmarks covering live cases, in commercial codes such as Fluent or Star CD. Also some tests are usefull such as super PI (standard in most "good" on-line tech-sites).

A comment though on your colleague and the benchmark. I think that today's AMD is lightyears from what it was a year ago. A year ago I would have bought a cheap i875P chipset and p4c - Northwood, without hesitation (still have one at home that outperforms most similar systems). Today I'm not so sure (obviously). The p4 price per node is lower today (unless you go for the top model), but I'm not sure the extra 1 or 2 nodes would be beneficial. Especially considering that only Gbit Lan will be used.

Regards

andy	November 12, 2004 10:55

Re: Same hardware question =)

Perhaps I should have mentioned that CFD performance for minimum price was the only consideration when purchasing the cluster. It cost about 14.5k Euro including switch, cables and racking but excluding air conditioning. I guess it would be a bit cheaper today. For real parallel Fortran CFD codes we get about 10 GFlops performance. This could be improved with effort (Linpack numbers are obviously much larger) but it is unlikely to happen.

We use a gigabit LAN with a cheap non-blocking switch. Finding latency figures for small packets under full load for the switches was very difficult and, remarkably, some of the popular expensive switches turned out to be blocking. There are substantial problems with how the ethernet packets are handled by the OS kernel and, annoyingly, this is fixable in software. However, despite a few academic projets bound to specific hardware, there is little sign it will get addressed before another commodity interconnect comes along. The hardware should be able to deliver 10us latencies with minimal processor load but the best we get is about 30us with some processor load. Pretty much all of this could be fixed by changing the way the kernel handles the ethernet packets.

However, if you are interested in performance/price ratio all the proprietary interconnects with better performance are far too expensive unless you need >100 processors when they become essential. Testing has indicated we could go comfortably to 64 nodes without interconnect speeds killing overall CFD performance. However, beyond this switches start getting expensive and the latency starts to really hurt overall performance.

Adding 4 nodes to our 20 would increase overall performance by close to 24/20 (we have a 24 port switch) and it would help the situation of multiple small jobs. But, going to 25-30 nodes would not make sense because of the need to purchase a new switch. A few extra nodes on a heavily used machine can be more beneficial than fewer quicker nodes.

Mikael Ersson

November 12, 2004 11:13

Re: Same hardware question =)

"Adding 4 nodes to our 20 would increase overall performance by close to 24/20 (we have a 24 port switch) and it would help the situation of multiple small jobs. But, going to 25-30 nodes would not make sense because of the need to purchase a new switch. A few extra nodes on a heavily used machine can be more beneficial than fewer quicker nodes. "

I agree that this might happen, but then again it all depends on how large the actual simulation is. I have some simple examples in the following thread:

http://www.fluent.com/software/fluen...ench/intro.htm

Although Fluent will only be one of many programs running on the cluster the benchmarks are quite interresting. =)

Regards

All times are GMT -4. The time now is 07:36.