Apple vs. Intel supercomputer benches! 2304 x Xeon 2.4 on Linux vs. 2112 x G5 2.0 on Mac OS X. Guess who wins? ;)

Eug

Lifer
Mar 11, 2000
24,169
1,812
126
This year's supercomputer numbers are out. 2304 Xeons vs. 2112 G5s running the Linpack bench. Guess who wins? The Xeons. ;)

.ps file
Screengrab

Japan's 5120 CPU NEC wins overall of course, by a humungous margin, but what do you expect for a 350 million dollar supercomputer? :p However, I was most interested in the Linux NetworX Xeon 2.4 (#3 in the world last year) vs. the Apple/Virginia Tech G5 2.0 (new entry this year) scores, since both use more or less off-the-shelf components.

The Linux setup gets 7634 Gflops per second, with 2304 Xeon 2.4 GHz processors.
The Mac OS X setup gets 7417 Gflops per second, with 2112 G5 2.0 GHz processors.

So the Xeon setup edges out the G5 setup this year for overall score. I wonder why VT couldn't use all 2200 processors they bought though, since that would have put them ahead (7726 Gflops/s) - better bragging rights. Are the extra 88 processors (44 dual G5 Power Macs) simply backups?

BTW, if you extrapolate the Xeon scores, a G5 2.0 is equivalent to a Xeon 2.54, assuming linear scaling of the Xeon's performance per GHz. (EDITED to correct my calculations.)

EDIT:

See later in thread. Numbers updated. G5 system pulls ahead.
 

Duvie

Elite Member
Feb 5, 2001
16,215
0
71
What are the cost of those computers, each?

Edit: So in this 1 test the g5 2.0ghz is equal to 1 Xeon 3.05ghz??? Why is it then the G5 dual 2.0ghz barely beats or loses to 1 single 3.2c in test I have seen....If that was the real case in most apps we could all afford apple computers cause we would only need one processor, right?? It must be the altivec code is optimised for it and does better per chip in multiple cpu configurations. I can believe that!!!
 

Eug

Lifer
Mar 11, 2000
24,169
1,812
126
Originally posted by: Duvie
What are the cost of those computers, each?
The Apple one is $5.2 million.

I dunno what the Xeon one cost but it uses Supermicro E7500-based dual Xeon server motherboards. See here. It would be cheap, but it's not clear if the Apple one is cheaper or not.

Edit: So in this 1 test the g5 2.0ghz is equal to 1 Xeon 3.05ghz??? Why is it then the G5 dual 2.0ghz barely beats or loses to 1 single 3.2c in test I have seen....If that was the real case in most apps we could all afford apple computers cause we would only need one processor, right?? It must be the altivec code is optimised for it and does better per chip in multiple cpu configurations. I can believe that!!!
No, I made a mistake. It's more like a single 2.54 GHz Xeon. The reason you're seeing some dual G5 tests losing to a single 3.2c P4 is because performance various from test to test.

For instance, in gaming, a single 3.2 P4 beats a dual G5 2.0.
In Linpack, a single 2.54 Xeon ~ equals a single G5 2.0.
In Photoshop a dual G5 2.0 beats a dual Xeon 3.06 and vice versa, depending on the test method.
In Renderman a dual G5 2.0 beats a dual Xeon 3.06 and vice versa, depending on the test done.
In Cinema 4D, a dual G5 scores the same as a dual Opteron 2.0, but a dual Xeon 3.06 beats them both.

As for Altivec, Linpack can't use it, so that's not the reason. Linpack is I believe double precision floating point. (To anyone, feel free to correct me if I'm wrong, since I don't really know anything about Linpack to be honest.)
 

RaynorWolfcastle

Diamond Member
Feb 8, 2001
8,968
16
81
Originally posted by: Eug

BTW, if you extrapolate the Xeon scores, a G5 2.0 is equivalent to a Xeon 3.05, assuming linear scaling of the Xeon's performance per GHz.

How do you figure?

7634 Gflops/2304 Xeons = 3.313 Gflops/Xeon
7417 Gflops/2112 G5s = 3.512 Gflops/G5


3.512/3.313 = 1.060 [2.4GHz Xeons]/G5

Assuming linear scaling with clockspeed 2.4 GHz * 1.060 = 2.544 GHz

So we have that for this test a 2.0 G5 is equal to a 2.55 GHz P4... Let me know how you came up with your number


 

Eug

Lifer
Mar 11, 2000
24,169
1,812
126
Originally posted by: RaynorWolfcastle
Originally posted by: Eug

BTW, if you extrapolate the Xeon scores, a G5 2.0 is equivalent to a Xeon 3.05, assuming linear scaling of the Xeon's performance per GHz.
How do you figure?

7634 Gflops/2304 Xeons = 3.313 Gflops/Xeon
7417 Gflops/2112 G5s = 3.512 Gflops/G5

3.512/3.313 = 1.060 [2.4GHz Xeons]/G5

Assuming linear scaling with clockspeed 2.4 GHz * 1.060 = 2.544 GHz

So we have that for this test a 2.0 G5 is equal to a 2.55 GHz P4... Let me know how you came up with your number
You are right, and I am an idiot. In this test, 1 G5 2.0 = 1 Xeon 2.54. I will correct my original post.

The 2200 G5s = 7726 Teraflops calculation still applies though, and that would edge out the NetworX setup. I wonder why VT couldn't use all the 2200 they supposedly bought.
 

zephyrprime

Diamond Member
Feb 18, 2001
7,512
2
81
Japan's 5120 CPU NEC wins overall of course, by a humungous margin, but what do you expect for a 350 million dollar supercomputer?
Yikes! That's $68 grand per CPU!
 

HokieESM

Senior member
Jun 10, 2002
798
0
0
Eug... just a thought. One of the problems with clusters is that you lose efficiency when you add nodes. I think that VT probably has looked at how many nodes they could bring online and maintain maximum efficiency.

Also, from a few friends, I have heard (keep in mind this is a rumor.... but one from a good source) that stability is an issue when bringing up 1100 nodes (of brand new computers on a relatively new OS) on a new, innovative networking solution (those Infiniband switches). :) THAT probably has more to do with it than anything.
 

Eug

Lifer
Mar 11, 2000
24,169
1,812
126
Originally posted by: HokieESM
Eug... just a thought. One of the problems with clusters is that you lose efficiency when you add nodes. I think that VT probably has looked at how many nodes they could bring online and maintain maximum efficiency.

Also, from a few friends, I have heard (keep in mind this is a rumor.... but one from a good source) that stability is an issue when bringing up 1100 nodes (of brand new computers on a relatively new OS) on a new, innovative networking solution (those Infiniband switches). :) THAT probably has more to do with it than anything.
The weird part is they got only 821 Gigaflops for their 256 CPU setup. That's only 40% efficiency, but it increased to 44% efficiency with their 2112 CPU setup. :confused:

I'm impressed they got everything up and running so quickly though. They were still assembling the thing in September, but managed to swap in/out the necessary components, get used to the new Mac OS X and compilers (IBM xlc, etc.) and newly ported software (like Infiniband's drivers, etc.) and still come up with a respectable score by Oct. 1.

One wonders what we'll see next year. They might overtake NetworX, unless NetworX adds new nodes. However, there are other machines coming online this year that will likely destroy both clusters.
 

HokieESM

Senior member
Jun 10, 2002
798
0
0
Yeah... the whole thing was pretty impressive--the production time, the improvement, etc. But just a note: VT, which was originally planning to let COE grad students and professors start doing computational models on the "supercomputer" in November, is now not allowing any researcher to use it until January. The email I received indicates they are going to take the two months to improve stability and efficiency. There might be better numbers to be had, still. :) From what I understand, stability is the primary concern--I've "heard" that they had to try the benchmark several times before the cluster could complete it (what do you expect? they pushed the power switch in late Sept).

Now, my personal feeling is that the cluster isn't going to be used for massive computations like the benchmark. There are hundreds of people who could use "Big Mac"--and they could accommadate them all. I'd love to have 10 nodes at a time. Most scientific codes don't parallelize infinitely (or even at all). The "cluster" solves the I/O and storage problem well--each node is, in effect, a functional computer. Now if they'll only give us access to AT LEAST a 100Mbps connection to get the data off (my output files are roughly 1 GB). :)

That said, I think we need to work on faster serial computing, too.... I'd love to see faster and faster single processors. (especially since my code doesn't benefit from more than 2).
 

Eug

Lifer
Mar 11, 2000
24,169
1,812
126
Cool. It will be interesting to see what the next year brings. :cool:
 

Eug

Lifer
Mar 11, 2000
24,169
1,812
126
New numbers:

1) NEC Earth Simulator - 35860
2) ASCI Q - 13880
3) HP RX2600 Itanium 2 (1936 CPUs) - 8633
4) Apple/VT (2112 G5s) - 8164
5) Linux NetworX (2304 Xeons) - 7634

These numbers seem more representative. The G5 system pulls ahead of the Xeon system, but the new Itanium 2 system beats them both handily.

Itanium 2 1.5 = 4.46 Gflops/s per CPU
G5 2.0 = 3.87 Gflops/s per CPU
Xeon 2.4 = 3.31 Gflops/s per CPU

These numbers put the G5 2.0 in Xeon 2.8 GHz (and 1.3 GHz Itanium 2) territory.
 

Eug

Lifer
Mar 11, 2000
24,169
1,812
126
Originally posted by: BingBongWongFooey
While these numbers will no doubt come as a disappointment for Mac zealots who wanted to blow away all the Intel machines, it should still be noted that this is the best price/performance ratio ever achieved on a supercomputer.

http://slashdot.org/article.pl?sid=03/10/22/1727208&mode=thread&tid=126&tid=137&tid=181
Slashdot has the wrong (older) numbers. The Apple/VT system is now at 8.2 Tflops/s, not 7.4.

That makes it faster than the 2304 Xeon 2.4 system (but slower than the 1936 Itanium 2 1.5 system).
 

dexvx

Diamond Member
Feb 2, 2000
3,899
0
0
Originally posted by: Eug
New numbers:

1) NEC Earth Simulator - 35860
2) ASCI Q - 13880
3) HP RX2600 Itanium 2 (1936 CPUs) - 8633
4) Apple/VT (2112 G5s) - 8164
5) Linux NetworX (2304 Xeons) - 7634

These numbers seem more representative. The G5 system pulls ahead of the Xeon system, but the new Itanium 2 system beats them both handily.

Itanium 2 1.5 = 4.46 Gflops/s per CPU
G5 2.0 = 3.87 Gflops/s per CPU
Xeon 2.4 = 3.31 Gflops/s per CPU

These numbers put the G5 2.0 in Xeon 2.8 GHz (and 1.3 GHz Itanium 2) territory.
The HP RX2600 uses McKinley Itanium II's, they operate at 1Ghz, not 1.5Ghz.

Therefore, based on your calculations, a 2Ghz G5 is equivalent to about a 850Mhz McKinley Itanium II. The Madison Itanium II is significantly faster than the McKinley, clock for clock.
 

Eug

Lifer
Mar 11, 2000
24,169
1,812
126
Originally posted by: dexvx

The HP RX2600 uses McKinley Itanium II's, they operate at 1Ghz, not 1.5Ghz.

Therefore, based on your calculations, a 2Ghz G5 is equivalent to about a 850Mhz McKinley Itanium II. The Madison Itanium II is significantly faster than the McKinley, clock for clock.
The screengrab from Dongarra's document & HP's RX2600 page both say 1.5 GHz Itanium II.

Here is the article from the The Register. Maybe you're thinking of the old setup.
 

Eug

Lifer
Mar 11, 2000
24,169
1,812
126
Now up to 9.6 Teraflops per second.

7634 Gflops/2304 Xeons = 3.313 Gflops/Xeon
9555 Gflops/2112 G5s = 4.524 Gflops/G5

4.524/3.313 = 1.366 (2.4GHz Xeons) per G5 2.0

Thus, assuming linear scaling with clockspeed, with these numbers a G5 2.0 is equivalent to a Xeon 3.28 GHz.

EDIT:

Holy crap! It's even beating the mighty Itanium 2 on a per CPU basis.
 

Ardan

Senior member
Mar 9, 2003
621
0
0
I dunno about an 'anti-apple' thread or anti-anything. Looks like posts are pretty fair, scientific and unbiased to me :)
I think the stuff you are showing here is amazing, Eug....regarding the G5 2.0. Looks like Apple has a winner on their hands :)
 

LordOfAll

Senior member
Nov 24, 1999
838
0
0
If memory serves I don't recall the g5 system using ECC memory. I have to think that would be a huge problem in a super computer.
 

Eug

Lifer
Mar 11, 2000
24,169
1,812
126
Originally posted by: LordOfAll
If memory serves I don't recall the g5 system using ECC memory. I have to think that would be a huge problem in a super computer.
Yes, indeed it is, but it was one of many variables in the decision (and obviously not a deal-killer for them).

It seems VT had a strict set of criteria, and along with a number of other factors, they made their decision.

1) The hardware had to be 64-bit (even though they wouldn't necessarily be using too much 64-bitness at the beginning). This disqualified the Xeon. Options included Itanium 2, Opteron, and G5 PPC 970.
2) It had to be cheap. Apple gave them a KILLER deal. It seems like they got the stuff at cost, or perhaps below cost. The stuff from Dell, etc. had was gonna cost in the range of twice as much.
3) It had to be soon. The PPC 970 was a fine chip, but IBM couldn't deliver the hardware soon enough. Apple on the other hand could, but without ECC.
4) They seemed to want to score well on Linpack. I think they're aiming for 10 Teraflops/s. The G5 2.0 has a theoretical peak of 8 Gflops/s but usually it's quite inefficient based on the design. Itanium 2 1.5 has a theoretical peak of 6 Gigaflops/s. It's very efficient, but it's only 1.5 GHz, and it's quite expensive. The Opteron 2.0 has a theoretical peak of 4 Gflops/s and probably pretty efficient, but it would have taken a lot of CPUs and cost a lot too.

So after all of this, they decided to live with no ECC and go for the G5. I'm not sure if that will be as big a deal as some have guessed though, since somehow it seems they're not going to be using this as a supercomputer in the same way as say the NEC Earth Simulator is used as a supercomputer. And anyways, if you're worried about a result, you can always run the test twice for verification, which is something you'd probably want to do anyway even if you were running ECC.

As for Linpack, they've finally hit 9555 Gflops/s, excruciatingly close to 10 Tflops/s, and still have a couple of weeks left to optimize, and also still have 88 CPUs left that weren't used in the benchmarks. (They purchased 2200 CPUs, but the benches so far only include 2112.)

BTW, one thing I thought that was kind of interesting. In Linpack, each G5 is running at "only" 56.5% efficiency. However, the theoretical max is so high that it means each G5 2.0 is running at 4.5 Gflops/s in actual benchmarking, already higher than the theoretical max of the Opteron 2.0 (4 Gflops/s). And this isn't even counting the efficiency loss the Opteron would see in a large cluster like this. (Now this is not real life, as Linpack is a very specific benchmark, but nonetheless, this is going to be a marketing coup for Apple methinks.)
 

BDSM

Senior member
Jun 6, 2001
584
0
0
Howabout this then:

linky
Over 10k 2 ghz opterons.. And they claim that performance scales almsot perfectly.

Should kick some ass :)

I wonder how long it would take to encode a divx movie on that thing.
 

InlineFive

Diamond Member
Sep 20, 2003
9,599
2
0
Originally posted by: BDSM
Howabout this then:

linky
Over 10k 2 ghz opterons.. And they claim that performance scales almsot perfectly.

Should kick some ass :)

I wonder how long it would take to encode a divx movie on that thing.

Somebody in Distributed Computing should assimilate that beast. We would have a powerful lead then. :evil: