AMD sheds light on Bulldozer, Bobcat, desktop, laptop plans

Discussion in 'CPUs and Overclocking' started by thilanliyan, Nov 11, 2009.

  1. JFAMD

    JFAMD Senior member

    Joined:
    May 16, 2009
    Messages:
    565
    Likes Received:
    0
    I think the difference between 50% and 5% might be the difference between marketing and engineering. Engineers tend to be very literal.

    If 2 cores get you 180% performance of 1, then in simple terms, that extra core is 50% that gets you the 80%.

    What I asked the engineering team was "what is the actual die space of the dedicated integer portion of the module"? So, for instance, if I took an 8 core processor (with 4 modules) and removed one integer core from each module, how much die space would that save. The answer was ~5%.

    Simply put, in each module, there are plenty of shared components. And there is a large cache in the processor. And a northbridge/memory controller. The dies themselves are small in relative terms.
     
  2. cbn

    cbn Lifer

    Joined:
    Mar 27, 2009
    Messages:
    11,645
    Likes Received:
    129
    Does Intel hyperthreading reduce efficiency when the processor is dealing with a smaller number of threads.

    For example could a quad core with hyperthreading perform worse than a quad core without hyperthreading provided the number of threads does not exceed four? When overclocked? If so how much difference are we talking?
     
  3. Idontcare

    Idontcare Elite Member

    Joined:
    Oct 10, 1999
    Messages:
    21,130
    Likes Received:
    9
    The answer is that yes it does, but not because of the hardware per say but because of thread migration and scheduling as dictated by the OS in question.

    Kind of like how Vista w/o Trim vs Win7 w/trim can make all the difference in the performance of your Intel or OCZ SSD.

    Checkout Inteluser's link above where he highlighted the thread scaling difference for i7 with and w/o HT in the euler3d benches.

    Anytime the shared resources become critical to the execution speed of the threads themselves the increase in threads/cores sharing those resources can degrade performance. We just hope those cases are rare.

    Take either a modern day i7 or PhII with their shared L3$...in theory you could end up with a program that critically depends on the L3$ cache size just for a single thread and as such adding a second thread which then needs that L3$ space as well suddenly results in considerable IMC activity and dram fetches...now all of a sudden both threads are stalling like crazy and your speedup actually became worse than if you had just stuck with a single thread. It can happen, and it does happen with extreme rarity, but the concerns are legitimate.
     
  4. DrMrLordX

    DrMrLordX Diamond Member

    Joined:
    Apr 27, 2000
    Messages:
    8,845
    Likes Received:
    845
    It will be interesting to see what performance from Bobcat will be like in production chips, though I would think that Bobcat will have process technology on its side (at least) when comparing it to k8. It will also be interesting to see how many cores AMD decides to deploy. I could easily see four Bobcat cores in a netbook, or at least two.

    I would agree that Atom has been "kept in its place", so to speak, mostly by the chipset that hobbles it, but also by some of the other issues you brought up.

    32nm Atom with sleep states could probably be rolled out at 2.5-3ghz in dual-core form and stay within its current 4W power envelope (or something very close to it) or 2 ghz in dual-core form at a lower power envelope. Whether or not that would be competitive with a Bobcat dual or quad system remains to be seen, though it would probably be a lot cheaper.

    After that, it's a matter of figuring out what consumers want in different market segments. Personally I think netbook buyers will want more performance with the same battery life, which is where I think Bobcat could win out (at least over next-gen Atom anyway). Then Intel will have to move other products into the netbook market which is where things could get interesting.

    My guess is lack of vision or lack of R&D budget. Also keep in mind that was in an era when Intel was still engaging in incentive programs to keep AMD chips out of certain market segments which may or may not have had an influence on AMD's ability to penetrate new markets. Or, maybe they just didn't see the need for a 300 mhz Athlon XP. Or maybe they couldn't get them running at such low voltages reliably based on their own binning practices; just because Tom's could do it with one system to their own satisfaction doesn't necessarily mean that the same parts would have cleared AMD's QA processes. To further complicate things, back then, AMD wasn't in as much control of their own platforms as they were back then, so it would have been a matter of the chipsets passing Nvidia QA and then the mobos passing OEM QA before being ready for sale.

    I'm sure it could have been done, but I have doubts about how many of the parts AMD and Nvidia generated back then could have run reliabliy at those speeds and power levels.
     
  5. Idontcare

    Idontcare Elite Member

    Joined:
    Oct 10, 1999
    Messages:
    21,130
    Likes Received:
    9
    Coincidentally Intel just published an article today on semiconductor.net discussing this very subject matter:

    How cool is that? Data and all, 320mV!
     
  6. IntelUser2000

    IntelUser2000 Elite Member

    Joined:
    Oct 14, 2003
    Messages:
    4,526
    Likes Received:
    186
    Possible, but not likely. The "Lincroft" Atom will be on the 45nm SoC process, which for an SoC process its high performing, but still a loss from the 45nm HP process. Quite likely the 32nm Atom, Medfield will be on the 32nm SoC process.

    Slight mistake. It was from their presentation about various Multi-threading technologies.

    Here: http://data5.blog.de/media/732/3663732_9bc35365d1_l.png

    It might not exactly be Bulldozer, but it was made back when 2009 Bulldozer was being talked about.

    And what JFAMD said: " I think the difference between 50% and 5% might be the difference between marketing and engineering. Engineers tend to be very literal.

    If 2 cores get you 180% performance of 1, then in simple terms, that extra core is 50% that gets you the 80%."

    I guess it can be interpreted as: "If a hypothetical single(mini) core Bulldozer based CPU without CMT-like capabilities were compared to the CMT-enabled 1 module/2 core version, the performance improvement would be 80% and die size increase would be 50%".

    There is still an IPC advantage of approximately 10% per clock.

    Core 2 Q9650 3GHz vs Phenom II X4 940 3GHz: http://www.anandtech.com/bench/default.aspx?p=49&p2=80

    You can see that the Phenom II X4 outperforms Core 2 Quad in latter part of the multi-threaded benches.

    Now Phenom II X2 550BE 3.1GHz vs Core 2 Duo E8400 3.0GHz

    http://www.anandtech.com/bench/default.aspx?p=56&p2=97

    You can see there are no cases where Core 2 Duo loses to the slightly higher clocked Phenom II X2.

    How did it go from losing in some apps at same clock speed to never losing with lower clock speed? Probably because Core 2 Quad is two dual cores kludged, aka MCM, while Phenom II X4 isn't a MCM Phenom II X2.

    Nehalem isn't a kludge core, but a well architected quad core with fast memory controller and interconnects.
     
    #156 IntelUser2000, Nov 19, 2009
    Last edited: Nov 19, 2009
  7. DrMrLordX

    DrMrLordX Diamond Member

    Joined:
    Apr 27, 2000
    Messages:
    8,845
    Likes Received:
    845
    So you don't think Intel will bother pushing clockspeed on Medfield much? I really don't think it will be a good idea for Intel not to force the issue of performance in that segment. Though, if Tom's Hardware is to be believed (http://www.tomshardware.com/news/Intel-Atom-Medfield,6671.html - yes, I know it's a fairly old article), AMD is going to avoid the netbook market altogether.
     
  8. JFAMD

    JFAMD Senior member

    Joined:
    May 16, 2009
    Messages:
    565
    Likes Received:
    0
    If you start to deconstruct a bulldozer die, you start to see that the *unique* integer components are a small piece.

    Start with probably half the die being cache (I am guessing, I don't have the numbers, only the die size). Then, you have to consider the northbridge circuits. Then the memory controller and the HT PHY. What you are left with is the bulldozer modules. Of those, the fetch, decode, L2 cache and FPU are all shared, so they would not go away by removing the second core from each module.

    You start to see that if you removed one integer core from each module (by taking out the discrete circuitry), you really are talking about a very small part of the die.
     
  9. IntelUser2000

    IntelUser2000 Elite Member

    Joined:
    Oct 14, 2003
    Messages:
    4,526
    Likes Received:
    186
    The whole issue of calling a "Netbook" with processors like Bobcat and Atom is that depending on what the manufacturer wants to do it might be a "Netbook", or become a "Notebook", its hard to draw the line. Doesn't it seem likely the market will at least put Bobcat into the high-end Netbook segment with the rough die and power estimations they gave us? The 1-10W range seems like we'll see single and dual core versions with low and higher TDP versions of each. It looks like AMD will fill the niche Via is leaving behind, that is a notch higher than Atom, but lower than regular laptop CPUs.

    Possibly, we might see dual core versions of Medfield and its derivatives. Maybe that and some architectural enhancements. Clock speed increases doesn't seem too efficient. Dual cores were already hinted with Moorestown, and integrated memory controller on a in-order CPU should do good.
     
  10. Idontcare

    Idontcare Elite Member

    Joined:
    Oct 10, 1999
    Messages:
    21,130
    Likes Received:
    9
    I can't speak for anyone else beyond myself but I think it is perfectly clear what you communicating.

    ~10% of the die is integer units, remove half of those integer units and you've reduced the diesize by ~5%.

    I can see the flip-side of the question being "if AMD's engineers knew they didn't need to make the module's other shared components as beefy as they did because that 5% would end up being removed then how much smaller would the rest of the die have been?".

    Presumably the shared resources were beefed up to diminish some of the anticipated performance degradation that would come from resource contention, L2$ sizes are larger than they otherwise would have been, etc. (i.e. there is a reason the expected thread scaling efficiency is 80% and not 70% or 50% or 10% for two threads in a bulldozer module)

    So if you removed one half the integer units and also removed the excess portions of the shared resources (basically re-optimized the architecture to handle one thread instead of two) then how much smaller eve still would the die have become?

    In that case I imagine the value does approach 50% sans the none thread-count scaling components such as IMC, etc.

    edit: Something occurred to me since I made this post...and that is perhaps the shared resources would actually be left in place even in a hypothetical bulldozer module in which half the Integer units were removed because one advantage of having shared resources is that 100% of those resources are available for assisting the processing rate of a single-thread in the event that only a single-thread is tasked to a given bulldozer module. So for the sake of single-thread/module performance reasons those beefed up shared resources would be retained anyways because they are dual-purpose and removing half the integer units only eliminates one of the two purposes served by those shared resources.
     
    #160 Idontcare, Nov 19, 2009
    Last edited: Nov 19, 2009
  11. Idontcare

    Idontcare Elite Member

    Joined:
    Oct 10, 1999
    Messages:
    21,130
    Likes Received:
    9
    Has it been mentioned explicitly for Bobcat yet whether or not it will be a fusion-product with on-die IGP and possibly dual-purposing as an APU?

    If it is then that would certainly step-up bobcat's game when it came to competing with Atom/Via/Ion out there.
     
  12. JFAMD

    JFAMD Senior member

    Joined:
    May 16, 2009
    Messages:
    565
    Likes Received:
    0
    You are absolutely correct. Pretty soon you start to get into angels dancing on the head of a pin territory. I am a simple guy at heart, I like to address technical things in simplistic terms. Someone can add an asterisk to just about anything that someone else says, but as long as the general premise was directionally correct, they get into diminishing returns.

    And, if someone wanted to split hairs, I would actually have a lower starting point becuase it was actually in 4.something range. I said 5% because there are people who take things to literally, so I tend to be conservative in my statements.
     
  13. Idontcare

    Idontcare Elite Member

    Joined:
    Oct 10, 1999
    Messages:
    21,130
    Likes Received:
    9
    Beautifully stated. That should be included in our forum's TOS.
     
  14. deimos3428

    deimos3428 Senior member

    Joined:
    Mar 6, 2009
    Messages:
    699
    Likes Received:
    0
    It's really not surprising that a dual core scales better than a single core with HT, though, is it? I'm more interested in how it compares to other apples, namely, how does a single BD module scale vs. other dual cores?

    Using the data provided by Idontcare above, it looks like the Opterons are getting 76-92% scaling going from single core to dual, 69-73% going from dual to quad, and 54-58% going from quad to octo. The Xeons weren't faring anywhere near as well with ranges of 68-76%., 46-53%., 39-45%. (I ignored the 6-thread info completely but otherwise there's no fancy math involved here, just dividing the latter by the former and subtracting one to get the scaling as the number of cores doubles.)

    In that light if the BD is scaling at about 80%, it would seem most useful beyond four threads, as we've already got roughly equivalent levels of scaling in the Opteron for four threads or less.
     
  15. Idontcare

    Idontcare Elite Member

    Joined:
    Oct 10, 1999
    Messages:
    21,130
    Likes Received:
    9
    deimos3428, my current interpretation of the "80%" number is that you would reduce existing thread scaling to 0.8x and arrive at the Bulldozer thread scaling equivalent assuming you could hold all other limitations to scaling constant (degree of software parallelism, latency and bandwidth of interprocessor communications topology, single-threaded IPC capability, etc).

    So if an otherwise equivalent CPU generated a speedup of say 6x with 8 cores (so thread scaling efficiency equals 75% for the given app at 8 threads) we should expect an 8core bulldozer CPU to produce a speed of 0.8x*6 = 4.8x speedup for an overall thread scaling efficiency of 60% (again, just for that app and only for 8 threads comparison purposes).

    So the next question would naturally be "why make a cpu that gives you 60% thread scaling in app XYZ for 8 threads if I can get 75% thread scaling in the same app for the same number of threads on a different CPU?".

    The answer would be three-fold...first the "cost" of the bulldozer chip would be presumably less than the full-fledged octo-core comparison chip because incorporating four of those eight thread processors only increased the diesize by 5%. So it favors the price/performance end of things.

    Second is that absolute performance might very well still be higher if the IPC per thread is higher despite the lower thread-scaling efficiency (this is the case for 8 threads on bloomfield w/HT vs. a dual-socket shanghai opteron in the Euler3D bench). Again this would speak to price/performance.

    And a third reason would be that despite having lower overall thread efficiency, owing to the reduced footprint of the module itself over that of implementing two fully isolated cores AMD (and Intel) can elect to "throw more cores" at the problem in an effort to boost the absolute performance higher (enter Magny-Cours, Interlagos, Beckton w/HT) regardless the diminishing thread-scaling efficiency (Amdahl's Law) incurred by doing so.
     
    #165 Idontcare, Nov 19, 2009
    Last edited: Nov 19, 2009
  16. deimos3428

    deimos3428 Senior member

    Joined:
    Mar 6, 2009
    Messages:
    699
    Likes Received:
    0
    Thanks for that excellent explanation.
     
  17. Mothergoose729

    Mothergoose729 Senior member

    Joined:
    Mar 21, 2009
    Messages:
    392
    Likes Received:
    0
    In order for Atom to remain a cutting edge product for consumers Intel engineers need to add OOO back into the chip. Without it the processor will always seem "almost fast", and much more competent chips that do have it will quickly come to market and take its place. Intel also needs to get their graphics situation sorted and and pretty quick. Consumers expect their laptops to be able to do anything, most specifically surf the internet and play all forms of digital media. Netbooks can't really do that with Intel IGP. Asus just recently announced that they will now ship all their netbooks with ion; being the first mass producer of atom netbooks that should say to intel "hint, hint, make better graphics or lose your valuable IGP market".
     
  18. Idontcare

    Idontcare Elite Member

    Joined:
    Oct 10, 1999
    Messages:
    21,130
    Likes Received:
    9
    Why was OOO processing not implemented in Atom to begin with? Was a public reason ever stated?

    I'd assume that while OOO improves performance it doesn't do so while adhering to the 2:1 rule - "a 2% peformance increase can increase power-consumption by no more than 1%".
     
  19. Idontcare

    Idontcare Elite Member

    Joined:
    Oct 10, 1999
    Messages:
    21,130
    Likes Received:
    9
    Goto-san has put up his weekly digest with Bulldozer being the topic this week:

    I liked the graphic as it simplistically illustrates the implicit trade-off between the degree of shared resources and the impact on performance and cost from doing so.
     
  20. Martimus

    Martimus Diamond Member

    Joined:
    Apr 24, 2007
    Messages:
    4,386
    Likes Received:
    1
    http://www.anandtech.com/showdoc.aspx?i=3276&p=6

    They may yet make an OOO Atom chip, but I would expect them to continue reducing the power consumption so they can get this chip into smart phones and similar devices. We may see a divergence where they create a more powerful Atom for netbooks, and a less power hungry Atom for smaller devices.

    EDIT: Also, I believe you are confusing Intels 1% rule for their old 2% rule. It used to be that Intel could could add a feature to a microprocessor design if you get a 1% increase in performance for at most a 2% increase in power. However, this rule was changed so that now Intel may only add a feature if it yields a 1% increase in performance for at most a 1% increase in power consumption. Although I could definitely be wrong when it comes to Atom, and they have the rule you stated specifically for that platform.
     
    #170 Martimus, Nov 20, 2009
    Last edited: Nov 20, 2009
  21. Idontcare

    Idontcare Elite Member

    Joined:
    Oct 10, 1999
    Messages:
    21,130
    Likes Received:
    9
    Yeah this is what I was thinking of when I made my post.
     
  22. TuxDave

    TuxDave Lifer

    Joined:
    Oct 8, 2002
    Messages:
    10,577
    Likes Received:
    3
    If you think that's bizzare, maybe 6-7 years ago I was reading a paper on sub-threshold logic where things never really turn off or on. Of course I think they were talking about clock frequency on the order of kHz.
     
  23. Fox5

    Fox5 Diamond Member

    Joined:
    Jan 31, 2005
    Messages:
    5,958
    Likes Received:
    1
    I got the feeling Atom was going for the absolute lowest production cost and power draw, regardless of power efficiency. (since atom is very inefficient compared to even the 90nm pentium m, let alone core 2 duo)
     
  24. cbn

    cbn Lifer

    Joined:
    Mar 27, 2009
    Messages:
    11,645
    Likes Received:
    129
    To someone like me (who is doesn't know much about computer science) higher IPC makes sense. Even if scaling per core is less the overall effect is still greater.

    On top of that I wonder how much power their dual module (quad core) Bulldozer will draw? I am guessing it won't draw that many watts relative to its processing power.
     
  25. cbn

    cbn Lifer

    Joined:
    Mar 27, 2009
    Messages:
    11,645
    Likes Received:
    129
    I have noticed this L3 cache take up quite a bit of die size as well as adding quite a bit of xtors.

    Something tells me this approach isn't really energy efficient or cost effective. But I guess there comes at point where there is nothing else than can be done.