Ex-AMD Engineer explains Bulldozer fiasco

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

krumme

Diamond Member
Oct 9, 2009
5,956
1,595
136
hopefully that guy amd hired from lenovo can help turn things around

I dont think he understands anything discussed here, but does he have to? I think he cares about economic results only, and if he dont get results he makes change. Might be good for AMD for a change, but what about GF? hehe :)
 

intangir

Member
Jun 13, 2005
113
0
76
According to DrWho (Francois) on XS, the powerpoint slide claims regarding the architecture features of Bulldozer don't match up to reality.

This is relevant when we are attempting to lazy-boy our design fixes, we can't assume the architecture is as has been presented in powerpoint.

for the tl;dr crowd - Francois is basically saying he analyzed the functionality of bulldozer with code designed to tease out the details of the microarchitecture and the integer cores are infact decoder limited, they aren't effectively sharing decoders.

So if AMD wanted to address the IPC issue it would seem they need to address the decoders.

Hm, interesting. Francois's posts are certainly always... entertaining. But I'm not sure how he concluded that BD is decoder-limited based on measuring the number of instructions retired per cycle. It could be issue-width limited, or execution-resource-limited, or retire-limited.

In other words, slow throughput at the end of the pipeline does not at all imply the bottleneck (if there is one) is at the beginning! Weren't the integer pipelines supposed to be 2-wide too? They supposedly decoupled the AGUs from the ALUs so they could be used for additional execution units, but the software optimization manual seems to contradict this. We all thought it was just an unprofessional edit job, but it could be there was truth there. Maybe they intended it to work that way, but it had to be disabled because of unforeseen issues. Or maybe the whole AGU/ALU split was a lie, since we're now distrusting AMD's explicit statements about the microarchitecture. :p

That's my point, blaming synthesized design for lack of performance is not right if the uarch is bottlenecked or poorly designed.

Besides, I find it hard to believe that AMD did not slice up the design and stick with hand design on the most critical portions.

Ah, I guess I misread your statement. I thought it was an honest question, not a proof by contradiction with known facts.

Anyway, something delayed Bulldozer from Q2 2011 to Q4. The microarchitectural engineering should have all been done before tapeout (Q2 2010). Most likely, it was process troubles or GF yield issues that accounted for the post-tapeout delays.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
I dont think he understands anything discussed here, but does he have to? I think he cares about economic results only, and if he dont get results he makes change. Might be good for AMD for a change, but what about GF? hehe :)

Sorry, its not fair to make the comparison, but I can't help but think of what Stephen Elop has done to Nokia since becoming CEO...and then shuddering at the possibility of Rory being every bit as good an "outsider" CEO at AMD. D:
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
<- clueless about design/layout stuff
Hm, interesting. Francois's posts are certainly always... entertaining. But I'm not sure how he concluded that BD is decoder-limited based on measuring the number of instructions retired per cycle. It could be issue-width limited, or execution-resource-limited, or retire-limited.

In other words, slow throughput at the end of the pipeline does not at all imply the bottleneck (if there is one) is at the beginning! Weren't the integer pipelines supposed to be 2-wide too? They supposedly decoupled the AGUs from the ALUs so they could be used for additional execution units, but the software optimization manual seems to contradict this. We all thought it was just an unprofessional edit job, but it could be there was truth there. Maybe they intended it to work that way, but it had to be disabled because of unforeseen issues. Or maybe the whole AGU/ALU split was a lie, since we're now distrusting AMD's explicit statements about the microarchitecture. :p
That's the superficial conclusion I took from his posts.

The powerpoints say one thing about bulldozer's architecture but when you analyze it with the right tools designed for such things (and Intel would have those tools) you find a very different beast is actually in the silicon.

I actually took that as encouraging, because it means the bulldozer design might actually work as intended when and if its ever fully implemented. Maybe with Piledriver.

But I'm way out of my element here in speculating on this.
 

RampantAndroid

Diamond Member
Jun 27, 2004
6,591
3
81
<- clueless about design/layout stuff

That's the superficial conclusion I took from his posts.

The powerpoints say one thing about bulldozer's architecture but when you analyze it with the right tools designed for such things (and Intel would have those tools) you find a very different beast is actually in the silicon.

I actually took that as encouraging, because it means the bulldozer design might actually work as intended when and if its ever fully implemented. Maybe with Piledriver.

But I'm way out of my element here in speculating on this.

Well, I know nVidia reworked the Fermi chips, but it was not a total rework...I don't think AMD can get a complete rework done and fully tested in the next year. Not one as massive as this article is suggesting is needed.
 
Dec 30, 2004
12,553
2
76
Last edited:

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
For what it's worth, I don't agree with this guy at all, but there are valid reasons I would be biased and my own personal experience might not apply. I was going to write a reply, but there's a lot of stuff already here to respond to....I might write something up later.

I seem to remember having an indepth discussion with CTho when this originally came out, where he was saying that current tools have shown that they are usually more efficient compared to hand designs, and are also usually more compact. He showed examples, and gave the reasoning (since there have been years upon years of designing circuitry, the tools are able to use multiple tricks found by people throughout that time to reduce size, latency, interference, and power.) Since he is the only actual chip designer I know, I took his word for it, and still don't think that this argument holds a lot of water.

If I can find the string of comments, I will link them here.

EDIT: I spent about an hour looking for the comments, and I really don't have anymore time to spend on that task (I shouldn't have spent that much time in the first place, but I got caught up in my task.) I remember the conversation, but I am starting to think that it may have been Hardball and not CTho who it was with. Regardless, most of Intel's designs have been automated for years, with some hand tweaking, and I would expect AMD to do something similar. Hell, Brazos was marketed as the first fully automated design, and it did extremely well for the size and power budget.

Why thank you ;). Yeah, it was me. It might have been in a private message... I think I found a post of yours a long time later and didn't want to resurrect an ancient thread. I'll see if I can find it later. Oh, or maybe that was a question about Zener diodes. Anyway, I'll look.

I seem to remember having an indepth discussion with CTho when this originally came out, where he was saying that current tools have shown that they are usually more efficient compared to hand designs, and are also usually more compact.
Thanks for the post.

I don't completely buy the engineer claim. With the amount we know, we might as well even claim he's saying that to get the spotlight on himself. Admit it, everyone wants it. :)

Tech is extremely complicated nowadays. So many limitations, and engineers approach from ALL angles to solve the problems. Not like 1990s where you get one "macro" feature and the performance skyrocketed. It's always easy to fault it on one thing.

Maybe the engineer worked on the part he complains about.

Yeah, I've wondered who this guy was and what his role was.

Thanks for the post.

I don't completely buy the engineer claim. With the amount we know, we might as well even claim he's saying that to get the spotlight on himself. Admit it, everyone wants it. :)

Tech is extremely complicated nowadays. So many limitations, and engineers approach from ALL angles to solve the problems. Not like 1990s where you get one "macro" feature and the performance skyrocketed. It's always easy to fault it on one thing.

Maybe the engineer worked on the part he complains about.
I know CTho personally. He is a very competent engineer and is extremely capable. I dislike the insinuations you are making about someone you have never met and presumably know very little about. If CTho said it, then I'm sure he believes it based on data that he's collected and his experience.


Patrick Mahoney
Senior Design Engineer
Intel Corp.

* Not speaking for Intel Corp. *
pm, I think inteluser was referring to the engineer that had left amd years ago. Yeah he quoted someone mentioning Ctho9305 but I don't think that was who he was directing his comments towards (otherwise I'd agree with you, Ctho knows what he's talking about)

Thanks :)

Does anyone know if the bobcat was auto designed or hand tuned? Because if it was auto designed he is dead wrong and making excuses that it was not the engineers but the bean counters.

There are slides somewhere talking about how heavily automated & portable the design is. But there are reasons you can't directly say "Bobcat + automation => good, therefore Bulldozer + automation => good". If I get around to writing another post I'll try to explain.

If you divide 2 billion transistors by 315 mm^2, you get about 6.35. If you divide 450 million transistors by 75, you get about 6. So Zacate is actually less dense than Zambezi (TSMC vs GloFo). This means if this "20&#37; bigger" accusation turns out to be remotely true, then it will have some implications for Zacate and presumably Llano as well.

Don't forget to scale for 40nm -> 32nm. Also, cache is very dense; Zambezi has 8MB L2 + 8MB L3; Zacate has 1MB L2. If you can dig up the Hot Chips presentations you might be able to get numbers for just a Bobcat core, and just two Bulldozer cores + one L2 cache, which will be easier to compare (since you eliminate many differences at the SOC level).

According to DrWho (Francois) on XS, the powerpoint slide claims regarding the architecture features of Bulldozer don't match up to reality.

http://www.xtremesystems.org/forums...nally-tested&p=4972103&viewfull=1#post4972103

http://www.xtremesystems.org/forums...nally-tested&p=4972367&viewfull=1#post4972367

http://www.xtremesystems.org/forums...nally-tested&p=4972442&viewfull=1#post4972442

This is relevant when we are attempting to lazy-boy our design fixes, we can't assume the architecture is as has been presented in powerpoint.

for the tl;dr crowd - Francois is basically saying he analyzed the functionality of bulldozer with code designed to tease out the details of the microarchitecture and the integer cores are infact decoder limited, they aren't effectively sharing decoders.

So if AMD wanted to address the IPC issue it would seem they need to address the decoders.
My guess is that there's something wrong with his code, or some subtlety of the architecture that makes it not behave the way he expects (e.g. he's doing something that inserts an extra stall repeatedly). It would be pretty dumb for the AMD compiler people to submit a gcc patch that made gcc assume a 4-wide machine if Bulldozer wasn't 4-wide.

I remember when the 65nm K8's came out, all the reviewers were claiming that the L2 cache latency had gone from 12 cycles to 20, but it had only increased by 2; some other subtlety was throwing off the performance measuring programs. Accurately characterizing a microarchitecture by writing directed tests is very difficult, especially with more complex microarchitectures.

Originally Posted by quest55720 View Post
Does anyone know if the bobcat was auto designed or hand tuned? Because if it was auto designed he is dead wrong and making excuses that it was not the engineers but the bean counters.

Remember this:
BobcatHotChips_August24_8pmET_NDA-17_575px.jpg


You get that kind of a floor plan layout when using synthesis tools, or so I am told. (I'm not a design guy)

Also, unless I am mistaken, Intel's iGPU is also heavily synthesized.
You can actually tell that Sandy Bridge's GPU uses a different methodology for place&route than the core does, because the P&R blocks appear to be abutting, while inside the core you can see gaps between all the pieces. I might label a picture later to clarify what I mean. I'm pretty sure I'm seeing large amounts of P&R even inside the Sandy Bridge core (or at least giving dmens et. al. something to laugh about ;)).

Intel started using automated tools with Prescott. Ok, bad example because that chip underperformed too, but it doesn't mean Intel doesn't do it.
Yeah, Prescott looks like it had a relatively large amount of P&R for an Intel design.

I think what the synthesis tools give in terms of layout is more of a symmetrical layout rather than being more random. Makes sense. That's why GPUs work well on it because its based on largely repeated structures.
I'm not sure I'm interpreting those words correctly, but I disagree completely (with my interpretation of your words ;)). P&R tools actually produce very "random" results, and can take a very regular piece of logic and turn it into a rats nest. For example, when a human builds e.g. a queue to store data that from instructions that haven't yet written to the L1 cache, they'll build a very regular grid; if you throw the same RTL at a synthesis tool, you can end up with an unrecognizable mess. Now, that mess may actually meet all of your constraints and save you man months, but it doesn't look pretty and it's not regular. I might try digging up some die photos and labelling them later.
 
Last edited:

dmens

Platinum Member
Mar 18, 2005
2,275
965
136
^

It's really hard to figure out if a design is synthesized or hand designed by looking at a die without some kind of color map like the Bobcat die shot above. You load the design and do select_cell/s_cell *hierarchy* on each of the first levels of hierarchy, on a hand designed chip the hierarchies would be very orderly, on a mostly synthesized design it'd be blobs, like the Bobcat picture.

The SB core has funny shaped edges because the Haifa design team was very aggressive, what else is new? :)

AFAIK AMD/NVIDIA synthesize one each of their functional units (ALU, shader, sampler, etc) then stitch them at the top level with another run of P&R with locked block placement.

Funny story, I've actually gotten "regular looking" synthesis results by constraining the cells with so many directives, such that the tool couldn't really do very much except do what I said. I call it almost-hand-design, LOL.
 
Last edited:

Joseph F

Diamond Member
Jul 12, 2010
3,522
2
0
no friends at intel tell me they always have projects running.

Only like 30&#37; of what actually gets done do we see.

They do lots of experimentation with die material, i heard they are even playing with graphitine as well.

They build mock cpu's about the size of cd's which work and can do about 1 terraflops of calculations...

They try to get insane IO speeds with SSD's and other combinations of sorts...
They put active servers in cooking ovens to see how long it takes for the entire system to go POP.

They do lots of funny stuff according to my friend, stuff we dont really hear about.

Santa, I've been a good boy this year, and I want one of these with a compatible motherboard. :awe:
 
Last edited:
Oct 14, 2011
93
1
0
^

It's really hard to figure out if a design is synthesized or hand designed by looking at a die without some kind of color map like the Bobcat die shot above. You load the design and do select_cell/s_cell *hierarchy* on each of the first levels of hierarchy, on a hand designed chip the hierarchies would be very orderly, on a mostly synthesized design it'd be blobs, like the Bobcat picture.

The SB core has funny shaped edges because the Haifa design team was very aggressive, what else is new? :)

AFAIK AMD/NVIDIA synthesize one each of their functional units (ALU, shader, sampler, etc) then stitch them at the top level with another run of P&R with locked block placement.

Funny story, I've actually gotten "regular looking" synthesis results by constraining the cells with so many directives, such that the tool couldn't really do very much except do what I said. I call it almost-hand-design, LOL.

I suppose you can build it in minecraft and see if it is. :p
 

jvroig

Platinum Member
Nov 4, 2009
2,394
1
81
I seem to remember having an indepth discussion with CTho when this originally came out, where he was saying that current tools have shown that they are usually more efficient compared to hand designs, and are also usually more compact. He showed examples, and gave the reasoning (since there have been years upon years of designing circuitry, the tools are able to use multiple tricks found by people throughout that time to reduce size, latency, interference, and power.) Since he is the only actual chip designer I know, I took his word for it, and still don't think that this argument holds a lot of water.
I can see how my post could very easily be interpreted as me personally wholeheartedly agreeing with cmaier.

To clarify, I did not mean it to be: "Well, he is now obviously right and the people who ridiculed him have egg on their faces because they have clearly been proven wrong and cmaier is right".

Instead, what I meant, and what I should have written, was: "when cmaier complained / moaned about AMD, he predicted the failure of BD, and things finally turned out to support his side of the story. In a way, especially with x-bit sensationalizing this as if it was news, he now gets the vindication he believes he deserves."

If CTho said what cmaier said is wrong, then it is wrong. That would be the end of the discussion for me, no more questions. I would trust what CTho says more, even if he just woke up and haven't had his morning coffee yet, than what cmaier says at any time of day since I have had no prior experience with cmaier at all.


EDIT: Looks like I'm two pages behind on this discussion, and CTho has already popped in. :thumbsup:
 
Last edited:

WhoBeDaPlaya

Diamond Member
Sep 15, 2000
7,414
402
126
Given AMD's limited resources relative to Intel, it's no surprise that they have to rely on more automation in the synth and/or PnR stages. Even Intel isn't immune - with the ever increasing transistor budget and feature creep, humans are a little hard pressed to keep up.

Here's a relevant slide (a little dated but still applies) :

prod-trend.png



There's also verification, which makes design look like a cakewalk in terms of man hours involved.
 

Dadofamunky

Platinum Member
Jan 4, 2005
2,184
0
0
If, one day years from now, Intel was run by foreign investors instead of domestic engineers, the decline would be similar to AMD's.

The day that happens is the day America just needs to turn itself over to the Chinese. Thank God Intel keeps funding skunk works projects. That's still quintessential Silicon Valley, which is nothing but good for this country.

So far as automated chip layout goes, the ATI side has done quite well with it. I find it hard to believe that the mandated SoC approach is the only thing that went wrong. Can you imagine tweaking every individual transistor in a 2B layout? Nah. Perhaps the team melted down and had a lot of key resignations and the leftovers and replacements just somehow had to finish the job. This happens uncountable times in the valley. Also, others have mentioned Llano and Bobcat, both of which are much better than Intel in their market segments AFAIA. It's really too bad AMD doesn't have a better mobile strategy, because some of their products conceivably could be very successful. Intel certainly isn't spotless in that area and is in fact quite vulnerable. Apple itself was very savvy in their purchase of that semiconductor CPU company, wasn't it? Everyone pooh-poohed it at the time. Now look at them. I bet Intel wishes they'd snapped them up. Or wishes they could buy ARM, which seems to have been around since the dinosaurs but is suddenly kicking major ass in the mobile market - at Intel's expense. But I digress.

Also, bear in mind that all of us crapping on Bulldozer are serious enthusiasts and don't reflect the broader market. Unfortunately for AMD, the thermal issues with BD compared to SB are just deadly, especially with the 2nd-tier and 3rd-tier OEMs that typically use their CPU products. It's just so much EASIER to build a high-speed system with SB. It's flat-out a better product.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
I found my old reply to Martimus discussing automation vs. hand design:
I wonder if AMD will come out with a second generation Bobcat that is optimized by hand, instead of being designed completely by a program as Bobcat is said to be? It is that second part that keep sme from being too excited about this release, as I wonder how truly efficient the design can be.

I saw a really interesting poster from Intel at DAC comparing semicustom hand-implementation to fully automated synthesis, place & route. They found that hand implementation actually didn't buy them anything - in fact, the semicustom design consumed dramatically more power and area while gaining only a trivial amount of performance (~1%?). If you think about it, there are a few reasons that place&route can beat a human:

1) Humans can design fantastic bit-slices, but bit-slices aren't always optimal. Bitslices are great sometimes, but hand design tends to leave a lot of empty space and waste a lot of power. For example, if you have a shifter feeding an adder (like some ugly instruction sets allow), the adder needs the lower bits to be available before the upper bits. A human isn't going to be able to optimize the shifting logic separately at every bit, and is either going to plop down one high-speed shifter optimized for bit 0 everywhere, or, best case, break the datapath into a few chunks and use progressively smaller (lower power, slower) shifters for each block of e.g. 16 bits. A tool can optimize every bit differently.

Some structures are really pathological for humans, like multipliers. The most straightforward way to place them is a giant parallelogram, which leaves two large unused triangles. You can get into some funky methods of folding multipliers to cut down on wasted space, but it gets complicated fast (worrying about routing tracks, making sure you are still keeping the important wires short, etc). A place&route tool can create a big, dense blob of logic that uses area very efficiently.

2) Modern place&route tools have huge libraries of implementations for common structures that they can select. For example, Synopsys has something called DesignWare, which provides an unbelievable selection of circuits for (random example) adders, targeting every possible combination of constraints (latency, power, area, probably tradeoffs of wire delay vs. gate delay, who knows what else). A human doing semicustom implementation doesn't actually have to beat a computer - he has to beat every other human who has attacked the problem before, and had their solution incorporated into these libraries.

3) An automated design can adapt quickly to changes. You have to break a semicustom design up into pieces and create a floorplan for the design, giving each piece an area budget and planning which directions its data comes from/goes to (e.g. "the multiplier's operands come from the left"). Once the designs are done, you now have to jiggle things around to handle parts that came in over/under budget, and you end up with a lot of whitespace. If, half way through the project, you realize you want to make a large change, you may find that too much rework is required and you're stuck with a suboptimal design.

Plop a quarter micron K7 on top of a 32nm llano... is it really likely that the same floorplan has been optimal since the days when transistors were slow and wires were fast, through to the days where wires are slow and transistors are fast? Engineers always talk about logic and SRAM scaling differently, yet the L1 caches appear to take a pretty similar amount of area. Shouldn't 7 process generations have caused enough churn that a complete redesign would look pretty different, even from a very high level? With an autoplaced design, you can try all sorts of crazy large-scale floorplan changes with minimal effort. If you try a new floorplan with a hand-placed design, you won't know for sure that it works until you've redesigned every last piece. You could discover a nasty timing path pretty late, and suddenly be in big trouble. It's interesting to see how on that original K7, the area was used pretty efficiently - pretty much every horizontal slice is the same width. The llano image doesn't look quite as nice. For what it's worth, you can do similar comparisons with Pentium Pro/P2/P3/Banias/etc. On a related note, the AMD website used to have a bunch of great high-res photos of various processors. Anyone know where to find them now?

4) Not all engineers are the best engineers. You might be able to design the most amazing multiplier in the world, but a company might have a hard time finding 100 of you, and big custom designs require big teams.

If you look carefully at die photos of some mainstream Intel processors, it looks like they've actually been using a lot of automated place & route since at least as far back as Prescott. This blurry photo of Prescott shows a mix of what appears to be custom or semi-custom logic at the bottom and top-right, as well as a lot of what appears to be auto-placed logic (note the curvy boundary of logic and what looks like whitespace (darker) left of and above the center... humans just don't do that). I've also read a paper by a company involved in Cell (I think it was Toshiba) that found that an autoplaced version of Cell was faster and smaller than the original semicustom implementation.
 

dmens

Platinum Member
Mar 18, 2005
2,275
965
136
I found my old reply to Martimus discussing automation vs. hand design:

The counter-argument is that design compiler cannot tackle the critical path even with the best possible RTL seed. Even if the tool recognized the critical path and assigned the most aggressive library (which it often refuses to do for various reasons), the tools never end up getting it exactly right because initial placement is still based on density/routability heuristics, which is nothing more than a best guess.

No tool can be expected to get things "exactly right", but the errors it makes can be very time consuming to fix, i.e. underestimated density results in scenic wires. Wires don't scale, so those are big problems in a high frequency design.

With hand design, the risk is somewhat managed in the sense that you have less surprises.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
The counter-argument is that design compiler cannot tackle the critical path even with the best possible RTL seed. Even if the tool recognized the critical path and assigned the most aggressive library (which it often refuses to do for various reasons), the tools never end up getting it exactly right because initial placement is still based on density/routability heuristics, which is nothing more than a best guess.

No tool can be expected to get things "exactly right", but the errors it makes can be very time consuming to fix, i.e. underestimated density results in scenic wires. Wires don't scale, so those are big problems in a high frequency design.

With hand design, the risk is somewhat managed in the sense that you have less surprises.
Which is another way of saying that in engineering the BEST solution is not necessarily the one that absolutely maximizes any given parameter, rather the best solution is one which balances as many competing parameters (debug and verification being part of that) as possible.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
The counter-argument is that design compiler cannot tackle the critical path even with the best possible RTL seed. Even if the tool recognized the critical path and assigned the most aggressive library (which it often refuses to do for various reasons), the tools never end up getting it exactly right because initial placement is still based on density/routability heuristics, which is nothing more than a best guess.

No tool can be expected to get things "exactly right", but the errors it makes can be very time consuming to fix, i.e. underestimated density results in scenic wires. Wires don't scale, so those are big problems in a high frequency design.

With hand design, the risk is somewhat managed in the sense that you have less surprises.

Sure, but nobody doing aggressive design just pushes the button on Design Compiler/IC Compiler and tapes out what the tools spit out the first time. That's one of the common fallacies I've seen in place&route vs. semi-custom (manually designed standard cell implementation) discussions... a P&R advocate should never claim that the tools are magic, and needs to acknowledge that you need multiple engineers for multiple months, and a custom-design advocate has to understand that the competition is not just pushing the button on a tool (so he can't take his RTL, run it through the automated tools and say, "look, these results suck").

You said in an earlier post that you've dealt with constraining the tool to the point where it's really "almost hand design", and I think that's a common situation for aggressive designs. I still think that after all that work, you get better productivity than you would have from semicustom design. The vast majority of the logic is handled with very little [human] effort, and you can really focus on the critical portions...which you "almost" hand design in the worst cases.

As for wires, Design Compiler's Topographical Mode is supposed to be able to account for that (note to everyone: I did not read that article, just the title & looked at the pictures. It could be garbage, but the title hit the keywords I googled).

I think I disagree with the "surprises" part too... with hand design, until you've implemented all the paths in the RTL, you may miss a critical path or underestimate the area of something. After each RTL change, it can be weeks before you get updated timing information. With P&R, your first run will include every path in the design (and give you rough estimates of the areas of every part of the design) in hours or days; each RTL change can get timing feedback in hours or days. Sure, the quality of the initial design will be awful, but every path will be in your reports and you can have people look through them in parallel. There's also risk in hand design from RTL changes that cause a lot of rip-up (e.g. "oops, we need to store an extra 2 bits in each entry of this queue and add an extra read port"). To be fair about RTL changes, while they're easier in P&R design environments through most of the project, late ECOs can be more difficult to implement in a P&R design (especially if they're repeated in a regular structure like a queue) since all the regularity is gone, and the gates may look nothing like the RTL (and there's no human who made the translation, who can tell you which wires correspond to which RTL signals).

Now, if you're careless with the P&R tools, you can get yourself stuck (e.g. by overoptimizing your design before you've done all the necessary steps, e.g. hold fixing (minimum delay / early mode timing)), but every design technique I've dealt with always gave you enough rope to hang yourself. You should be periodically routing the design to ensure that you aren't going to be surprised by a route congestion issue late in the process, and have your quality of results trashed by scenic routes. Of course, just because you should do this doesn't mean everybody does ;).
 

WhoBeDaPlaya

Diamond Member
Sep 15, 2000
7,414
402
126
Which is another way of saying that in engineering the BEST solution is not necessarily the one that absolutely maximizes any given parameter, rather the best solution is one which balances as many competing parameters (debug and verification being part of that) as possible.
Pretty much, since you could conceptualize everything from even a "simple" op-amp design all the way up to an entire chip as an over-constrained optimization problem.
A nice question for undergrads in VLSI is to ask "Is a 100&#37; yield desirable?" (assuming we're not living in fairyland of course).
...To be fair about RTL changes, while they're easier in P&R design environments through most of the project, late ECOs can be more difficult to implement in a P&R design (especially if they're repeated in a regular structure like a queue) since all the regularity is gone, and the gates may look nothing like the RTL (and there's no human who made the translation, who can tell you which wires correspond to which RTL signals)...
Ugh, some ECOs really make you want to stab someone on the design team. Unfortunately, the customer is always right and they're EDA's customers.
 
Last edited:

Martimus

Diamond Member
Apr 24, 2007
4,490
157
106
Also, this quote from IDC might be a little bit Prescient. http://forums.anandtech.com/showpost.php?p=30358536&postcount=231

OK, I'm just going to throw this one out there as a completely wild-ass, no valid reasoning whatsoever, theory/forecast: what if bobcat, the architecture, turns out to be AMD's Banias?

Not trying to make the parallels that Bulldozer becomes AMD's netburst so they need bobcat to save them come 16nm...but look at the big big big picture and scale out the timeline to 3 nodes from now (well 3 nodes from when BD is introduced, so 32nm -> 22nm -> 16nm)...

Unlike Atom, Bobcat has all the fancy shmancy (yes that is a technical engineering term) microarchitectural features of a very modern very advanced processor. OOO, register rename, yaddi yadda.

So go out a few nodes and ask yourself, when power-consumption becomes all the more problematic for AMD (just as 90nm and 65nm did for Intel) would you really be at all surprised if someone in the executive team reaches down into the low-power design group and says "hey, you think you could take that already super-low power processor and maybe scale it up just a tad?".

To me it is handwriting on the wall. Could be I just have too much hope and I want to believe just a little too much :p (just kidding with you Scali ;) you made a good point up there, but don't forget that not everyone wants to live in a fantasy-free world, sometimes it can be fun to dream, don't feel like its your job to bring every dreamer here back to reality in the forum, please let a few of us be blissfully ignorant every now and then)
 

Martimus

Diamond Member
Apr 24, 2007
4,490
157
106
Also, in the same thread form last year, to prove that we were comparing Bulldozer to Netburst even then; I have yet another post that was perhaps a little too accurate:
http://forums.anandtech.com/showpost.php?p=30359318&postcount=236

I can see the parallels here, and not just on the Banias side. I also see the parallels between Bulldozer and Netburst. Bulldozer has an increased pipeline, with a more aggressive prefetcher and branch logic prediction, things that Intel really pushed with Netburst. Even if the architecture fails, it will give AMD the experience needed to improve their branch prediction in future architectures, so it won't be a worthless experience for them.

One thing I don't understand is why Intel seems to have both a faster cache system, along with a more compact cache system to AMD. (I am bringing this up, as this is the other major advantage Intel has, which I had attributed to the fact that they needed better cache to compete with AMD's IMC, but thinking about it now, AMD should have been able to catch up by now.) Nehalem cache is faster (Nehalem (2.66GHz) L1 Cache (4 cycles) L2 Cache (11 cycles) L3 Cache (39 cycles) vs. AMD Phenom II X4 920 (2.80GHz) L1 Cache (3 cycles) L2 Cache (15 cycles) L3 Cache (unknown)), and yet 24&#37; smaller than Shanghai cache. (I found here that L3 Cache measured to 38 cycles on Deneb, which is nearly equal to Nehalem, yet it is far less dense.) I have no experience in how Cache works, but I wonder why Intel seems to be able to pack so much more cache into the same area, even at approximately the same speed. There has to be something I am missing here.

Back to Bobcat, I can easily see how AMDs development on that core will effect developement on their other architectures, if for no other reason than it will force them to find solutions to problems they previously could put off, since the problem wasn't as important on the Server/desktop market. They can then use these solutions in future server/desktop architectures, even if they don't derive the overall architecture from Bobcat.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
Pretty much, since you could conceptualize everything from even a "simple" op-amp design all the way up to an entire chip as an over-constrained optimization problem.
A nice question for undergrads in VLSI is to ask "Is a 100&#37; yield desirable?" (assuming we're not living in fairyland of course).

The method of Lagrange multipliers is a difficult concept for some to take from the textbook and apply to real-world project management philosophy.

I've met many engineers who could see it when they applied it themselves in the pursuit of solving their own problems in the lab but could not see it in the spirit of the decisions made by upper management.

Forest for the trees.

The engineer referenced at the top of this thread seems like he might be one of those types who just never got it, which is sad, really, if you think about it.

His full-name is plastered all over his short-sighted assessments, that's a tough egg-on-face situation to live down and overcome going forward.
 

Munky

Diamond Member
Feb 5, 2005
9,372
0
76
Well, this is news to me, and it explains a few things for sure. BD with 20% more performance and 20% less power consumption would have been FAR more appealing than what we have here.