VT Supercomputer

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Eug

Lifer
Mar 11, 2000
24,158
1,806
126
why would we build a $5 million dollar supercomputer, benchmark it for a month, let users "play" with it for a month, and then take it apart?

...

They dropped the ball in what they promised (cough cough).... and now they're having to "pay" for it. VT is very "complimentary" towards Apple because of their "service". Just don't expect to hear anyone say anything about it. Hopefully it will be up and running this summer without the bugs--I sure could use it.
Quite frankly, I'd be very surprised if VT truly was surprised that they needed the G5 Xserve upgrade. Remember, it was VT who approached Apple, and not the other way around, and Apple actually initially balked at the idea of building a Power Mac cluster. I betcha Apple told them to use G4 Xserves (which would of course have been a bad idea because the G4 Xserve is slow), or just to wait for the G5 Xserves (some of which are specifically built to be clustered).

The G5 Power Mac has no server monitoring functionality, nor does it have ECC. But VT chose it anyway despite having alternatives. Thus I might suspect that the G5 Xserve may have been part of the deal right from the outset as a potential backup if required. VT knew that the Power Mac was the only way to make it on the list now cheaply, before more the big guns coming online get onto the list. Plus the price was right, and Apple was bending over backwards to help VT out it seemed. (Apple knew this would be a PR coup if VT were successful.) I'm sure that VT thought that if the Power Mac cluster worked fine then great, but if it didn't then they'd have the G5 Xserve was the backup. It's interesting to note that early on Dr. V is quoted as saying that VT would be moving to ECC systems in the future.
 

HokieESM

Senior member
Jun 10, 2002
798
0
0
Originally posted by: Eug
why would we build a $5 million dollar supercomputer, benchmark it for a month, let users "play" with it for a month, and then take it apart?

...

They dropped the ball in what they promised (cough cough).... and now they're having to "pay" for it. VT is very "complimentary" towards Apple because of their "service". Just don't expect to hear anyone say anything about it. Hopefully it will be up and running this summer without the bugs--I sure could use it.
Quite frankly, I'd be very surprised if VT truly was surprised that they needed the G5 Xserve upgrade. Remember, it was VT who approached Apple, and not the other way around, and Apple actually initially balked at the idea of building a Power Mac cluster. I betcha Apple told them to use G4 Xserves (which would of course have been a bad idea because the G4 Xserve is slow), or just to wait for the G5 Xserves (some of which are specifically built to be clustered).

The G5 Power Mac has no server monitoring functionality, nor does it have ECC. But VT chose it anyway despite having alternatives. Thus I might suspect that the G5 Xserve may have been part of the deal right from the outset as a potential backup if required. VT knew that the Power Mac was the only way to make it on the list now cheaply, before more the big guns coming online get onto the list. Plus the price was right, and Apple was bending over backwards to help VT out it seemed. (Apple knew this would be a PR coup if VT were successful.) I'm sure that VT thought that if the Power Mac cluster worked fine then great, but if it didn't then they'd have the G5 Xserve was the backup. It's interesting to note that early on Dr. V is quoted as saying that VT would be moving to ECC systems in the future.

I will say that VT WAS surprised because of the Xserve upgrade. In fact, some people are very, very, very angry. People have lost grant money over this--and if you don't know, grant money is KING in the COE.... the only thing more important to the university is the football team (which they've proven they would sell their souls to get one extra game on ESPN). The "private" side of things... lets just say people are VERY angry. If this was intentionally done by VT.... people's heads will roll (like firing of faculty... the cardinal sin of a university). Plus, I'm sure all the students (myself included) who volunteered time and energy to setting up the original one (plugging in cards, RAM, installing software) feel slightly slighted. Also, the potential users who spent time with the "teething problems" are SO anxious to go through it all again when the new one is up in May.

From what I hear, Dr. V had Apple look over some error-correcting software for the cluster (very very very similar to some of the stuff included in Xgrid, as a matter of fact). Apple claimed it would work, no problem! Unfortunately it didn't. The Terascale Team wasn't happy to have to dismantle its "baby".... but it proved necessary. But you're right in saying that VT shares the blame. And they're "sharing the blame" financially, too... but definitely not all of it (because they were promised a few things that weren't delivered, either).

As far as "getting it online to get it benchmarked"... isn't that like "paper-launching" a supercomputer? Its ridiculous. People bashed Intel for "paper launching" the P4EE..... even though it actually worked (for the few who got it). They know that there are a couple supercomputers coming online that will easily push them back to fifth or sixth.... (because with the new ones, its just margin of victory, not whether they'll match it).... so they wanted the brass ring. Which, in my opinion, should be taken away until the new one is online.
 

TheLonelyPhoenix

Diamond Member
Feb 15, 2004
5,594
1
0
Originally posted by: HokieESM

A few notes:

Be CAREFUL about what you hear coming specifically from VT or Apple. VERY careful. There have been some very very very nasty legalities here of late. I know for a fact that the transition wasn't solely about the ECC RAM--it was definitely a part, however--BUT error-correction/node-management was the real reason. Let's just say that you won't hear from VT (and DEFINITELY not Apple) why--nor from me... I don't like lawsuits (and I don't know nuts-and-bolts specifics... although I do have some idea). If there was any doubt about this..... why would we build a $5 million dollar supercomputer, benchmark it for a month, let users "play" with it for a month, and then take it apart? Especially when we were specifically told that the Xserve would be ready in January? 1) They wanted it done in 2003... and 2) they thought the G5 would work as well as the Xserve. It didn't. And the "swap" isn't easy.... can you imagine just doing the card installation (the Infiniband cards) on 1100 nodes? And physically carrying them? Much less software installation. No sane person would do that... and although we ARE talking about a university (the home of insane people), they could have waited until January if the ECC RAM was really that big of an issue (and yes, they knew... the supercomputer team has some top-notch compsci people).

Oh, the grad student you're talking about... I know him. :) The quote is right on... but somewhat misleading. The supercomputer IS fast.... IF your code parallelizes. He was using 600 nodes... as opposed to 100 nodes on the Itanium cluster here (its 200 nodes total). The available computation power on System X is tremendous--if you can use it all. Some people's work (particularly a lot of the quantum/atomic discrete work) parallelizes well--and he could use 600 processors for a single run. It was scary fast. He's very pissed that its not running right now... understandably. But to give a counterpoint... my thermoviscoplastic code ran 2.3 times faster on an Itanium 2 than System X. Of course, mine doesn't parallelize well.... and its VERY sensitive to the amount of cache on the chip.

In any event... don't think I'm criticizing VT about the cluster. Its great. We needed it. I don't think it mattered whether it was Apple, Dell, whitebox, whatever... it was necessary. Apple cut us a great deal... and that was important. They dropped the ball in what they promised (cough cough).... and now they're having to "pay" for it. VT is very "complimentary" towards Apple because of their "service". Just don't expect to hear anyone say anything about it. Hopefully it will be up and running this summer without the bugs--I sure could use it. :) Although, I'm still pissed that McDonald's wouldn't let us call it "Big Mac".

Wow.... intrigue :)

I did get the impression that Apple must have cut them a serious deal on the Xserves, or they probably would have waited. I don't doubt that VT and Apple are keeping certain things quiet, but I also don't think they're lying... I believe him when he said the ECC RAM was a big deal, because its all about error-correction - the node-management is more of a secondary bonus. Of course, I'm a stupid EE, so what do I know? ;-)