• We should now be fully online following an overnight outage. Apologies for any inconvenience, we do not expect there to be any further issues.

AMD summit today; Kaveri cuts out the middle man in Trinity.

Page 7 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
where? somewhere between the L1D cache and the load buffer?
Indeed. In the same spot where contiguous data is extracted from a cache line, just eight times in parallel. It's really not a whole lot of extra circuitry. It's basically just simple unidirectional shifters with byte granularity and a narrow 32-bit output (combining multiple ones for 64-bit gather elements or contiguous data). For reference note that Haswell will be capable of 32 floating-point arithmetic operations per cycle. So surely having logic for gathering 8 elements in parallel isn't a big deal.
the obvious goal is to base it on the load buffer [1] in order to gracefully handle cache misses
Exactly. And it also already handles elements that straddle multiple cache lines. Now the only unknowns are how your implementation blends the result register and writes back the mask register. So why would you stop short at four elements per cache line per cycle if you can easily get eight, handle it all in one load unit, and use the other one to write back the mask register?

I mean, you have about 90% of the same circuitry for supporting gather, but since it takes two cycles it's like having 80% higher cost per instruction. AVX2 is all about achieving high throughput at low power consumption. And this also means they're not going to intentionally make things slower just to save a handful of transistors. Saving them is a big waste in performance/Watt.

If you expect them to make it faster in future processors, how about a maskless version which doesn't occupy any arithmetic execution ports? They could even do two of those every cycle! Just compare that to Knights Corner, who's gather width matches the FMA width. Starting out with a 1-cycle gather is clearly a necessity for efficient homogeneous computing. They didn't cheap out on anything else for AVX2...

So I really think you can safely increase your expectations for Haswell.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
And it also already handles elements that straddle multiple cache lines.

very slowly on SNB/IVB for 256-bit vmovups, that's why it's faster to do two 128-bit unaligned loads then merge the results, for example the Intel compiler do just that
 

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
very slowly on SNB/IVB for 256-bit vmovups, that's why it's faster to do two 128-bit unaligned loads then merge the results, for example the Intel compiler do just that
Yes but that's because the load ports are still 128-bit each and they have to sync up to handle 256-bit. Hence dealing with unaligned 256-bit data is very problematic. Haswell will make them 256-bit each so vmovups will become faster than two 128-bit loads.

And none of this is even relevant for gather since each element is only 32 or 64-bit.
 

pelov

Diamond Member
Dec 6, 2011
3,510
6
0
An ARM Cortex A5 CPU would also be fused on the APU for HSA purposes.

Do the what now?

I'm pretty certain this guy has absolutely no idea what he's talking about.
 

Arkadrel

Diamond Member
Oct 19, 2010
3,681
2
0
The problem with APU's is still the memory bandwidth.
The DDR3 dual channel is holding them back even with DDR3-2133.

DDR3-2133 = 64x2.133 / 8 x 2 = 34,128 GB/s (in dual channel)

The problem is even a 7750 has around ~72 GB/s.

DDR4 isnt gonna come soon enough, APUs will need Quad Channel ram soon, to feed the IGP's.

4 x DDR3-1866 could feed a APU ~60 GB/s memory bandwidth.

**** and really how expensive is it to buy 4 sticks of memory instead of 2 nowadays?
im sure the improvement in performance would be bigger than the total system cost(s) those extra ram would be

I wish desktop APU's would come with Quad-Channel support.


**** edit on DDR4:

DDR4 itself is a DRAM interface specification. Its primary benefits compared to DDR3 include a higher range of clock frequencies and data transfer rates (2133–4266 MT/s compared to DDR3's 800–2133[4][5])


DDR4-4266 = 64x4.266 / 8 x 2 = 68.3 GB/s (dual channel DDR4-4266 memory)

I guess 2 sticks of really fast DDR4 could do the trick too..... man we need DDR4 to come out soon.
 
Last edited:
Aug 11, 2008
10,451
642
126
The problem with APU's is still the memory bandwidth.
The DDR3 dual channel is holding them back even with DDR3-2133.

DDR3-2133 = 64x2.133 / 8 x 2 = 34,128 GB/s (in dual channel)

The problem is even a 7750 has around ~72 GB/s.

DDR4 isnt gonna come soon enough, APUs will need Quad Channel ram soon, to feed the IGP's.

4 x DDR3-1866 could feed a APU ~60 GB/s memory bandwidth.

**** and really how expensive is it to buy 4 sticks of memory instead of 2 nowadays?
im sure the improvement in performance would be bigger than the total system cost(s) those extra ram would be

I wish desktop APU's would come with Quad-Channel support.


**** edit on DDR4:

DDR4 itself is a DRAM interface specification. Its primary benefits compared to DDR3 include a higher range of clock frequencies and data transfer rates (2133–4266 MT/s compared to DDR3's 800–2133[4][5])


DDR4-4266 = 64x4.266 / 8 x 2 = 68.3 GB/s (dual channel DDR4-4266 memory)

I guess 2 sticks of really fast DDR4 could do the trick too..... man we need DDR4 to come out soon.

Yes, all those projected improvements are very nice. Especially on the GPU side though, I would be very surprised to see performance equal to a HD7750, especially in anything memory bandwidth limited. If they can pull it off though, I would have to consider it very seriously if the price is right relative to Intel + discrete.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
B/W could be provided by using stacked RAM. Thousand TSVs can be more power efficient than adding 2 DDR4 channels. Latency could improve as well and parallelism would allow to access many blocks of DRAM at once.
 
Last edited:

Homeles

Platinum Member
Dec 9, 2011
2,580
0
0
more about kaveri...
FM2 socket
1H of 2013
15-25% more IPC
DDR3-2133MHz memory and a 4MB L3 Cache
IGP based on 8xxx series..."superior performance than a HD 7750"

http://wccftech.com/amds-kaveri-bas...s-steamroller-cores-compatibility-fm2-socket/

meh....too good to be true
All of them are plausible.

15-25% IPC? Well, Steamroller is a significant redesign of Bulldozer, not just a simple tweak like Piledriver. They'll still be behind Intel.

DDR3-2133? I don't see this being remotely unreasonable for either camp.

4MB L3? They're changing from 32nm to 28nm, so there's potential for more real estate being available for L3 cache. We already know memory addresses are unified, and without L3, that benefit would be much smaller.

As far as graphics performance goes, Haswell GT3 will be an HD 7750 level part, so why not Kaveri?
 
Last edited:

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
Security does not = HSA purposes.

They didn't add the ARM cores for anything remotely involved with HSA and GPGPU. They did it because it was a simple fix in filling a gap in their features/security list.

Ahh sorry, didn't realized he was talking about HSA.

You are right ;)
 

Homeles

Platinum Member
Dec 9, 2011
2,580
0
0
The GPU part has lot of question marks.

The link claims same number of SPs as Trinity, or 384SPs. AMD said at 2013 Financial Analyst Day the 2013 Kaveri would have 8 CUs, or 512SPs. That doubt is also backed by performance. The 7750 have 512SPs.
Yeah, I'm thinking the shader count is off. Not to mention, Kaveri will be running at a lower clock speed than Cape Verde.
 

Olikan

Platinum Member
Sep 23, 2011
2,023
275
126
The link claims same number of SPs as Trinity, or 384SPs. AMD said at 2013 Financial Analyst Day the 2013 Kaveri would have 8 CUs, or 512SPs. That doubt is also backed by performance. The 7750 have 512SPs.

good catch!

but, well... kaveri is probably under tape-out phase, number of shaders might change wildly...

unless AMD puts RCM for gpus aswell....i doubt that they will reach the 7750 performance with 100W TDP
 

pelov

Diamond Member
Dec 6, 2011
3,510
6
0
Estimates put Kaveri(at least the desktop) at 900MHz GPU clock.

Chances are the mobile version will clock quite high as well considering the mobile GCN parts clock quite high (500mhz +).

It'll be interesting to see how AMD deals with the cache/bus width issue they're struggling with, or certainly will be struggling with given the amount you need to feed a strong GPU. Intel's approach is nestled in their fab advantage and cache (remember cache reacts really well to die shrinks) whereas AMD will likely resort to something else in the near future.

Stacking perhaps? packaged RAM? Both of those would be interesting. The latter would require a redesign of the motherboard and could potentially mean you wouldn't have to buy RAM for your system nor would there be channels for it (obviously :p).

The benefits of DDR4 aren't that great and it's one of the reasons it's taking so long to come to market. Triple channel and quad channel RAM costs money and inflates the overall cost of the platform (remember they're making larger thus fewer chips per wafer as well). DDR4 isn't going to change anything radically nor provide significantly increased throughput capacity, but at least it's an option. I don't think it'll be enough to feed the on-die GPUs considering at just how fast a rate both companies are increasing their on-die GPU performance capabilities.

Whatever it is AMD decides to do they'll certainly have to get creative. Apparently, Intel already has :p
 
Last edited:

Arkadrel

Diamond Member
Oct 19, 2010
3,681
2
0
good catch!

but, well... kaveri is probably under tape-out phase, number of shaders might change wildly...

unless AMD puts RCM for gpus aswell....i doubt that they will reach the 7750 performance with 100W TDP

Doesnt a 7750 have a max TPD of like 50watts?

Now remove some of the components that are present on the card, and the GDDR5....
In IGP version that 50watt TPD discrete card -> 30watt TPD IGP? or something?

I could see a IGP matching a 7750 for horse power, however it would lack memory bandwidth.
So even if it would be fast at lower resolutions, you couldnt use it well for higher resolutions.
Haveing atbest "half" the memory bandwidth is the real issue, compaired to a 7750.

best case, would be DDR3-2133 users and even they would probably run into memory bandwidth issues.

*** IF AMD reach IGP levels near the 5770-7750 level of performance, that would be pretty amasing though.
There are still 5770-7750's selling for around 100$ worth, for a discrete card.

Getting that level of performance from a IGP has to count for something.
 
Last edited:

pelov

Diamond Member
Dec 6, 2011
3,510
6
0
Doesnt a 7750 have a max TPD of like 50watts?

Now remove some of the components that are present on the card, and the GDDR5....
In IGP version that 50watt TPD discrete card -> 30watt TPD IGP? or something?

I could see a IGP matching a 7750 for horse power, however it would lack memory bandwidth.
So even if it would be fast at lower resolutions, you couldnt use it well for higher resolutions.
Haveing atbest "half" the memory bandwidth is the real issue, compaired to a 7750.

best case, would be DDR3-2133 users and even they would probably run into memory bandwidth issues.

*** IF AMD reach IGP levels near the 5770-7750 level of performance, that would be pretty amasing though.
There are still 5770-7750's selling for around 100$ worth, for a discrete card.

Getting that level of performance from a IGP has to count for something.

They're already hitting a roadblock and it's been that way since the Llano.

http://www.realworldtech.com/fusion-llano/2/

The Fusion GPUs have a dedicated non-coherent interface to the memory controller (the Radeon Memory Bus or Garlic, shown with a dotted line) for commands and data. The bus is 256-bits (32B) wide in each direction and is replicated for each memory channel (2x32B read and 2x32B write for Llano, half for Zacate). Garlic operates on the Northbridge clock – up to 720MHz for notebook versions of Llano and 492MHz for Zacate. This is a factor of 2-3X more bandwidth than memory can provide (roughly 17GB/s measured), which is needed to handle bursts of memory transactions (e.g. texture reads).
The GPU has a separate interface for sending memory requests that target the coherent system memory. The Fusion Control Link (or Onion) is a 128-bit (16B) bi-directional bus that feeds into a memory ordering queue shared with the coherent requests from each of the 4 cores. Onion runs at up to 650MHz for notebook variants of Llano (10.4GB/s read + 10.4GB/s write) and 492MHz for Zacate. An arbiter in the IFQ is responsible for selecting coherent requests (based on memory ordering) to send to the memory controller. Desktop versions of Llano will probably run Garlic and Onion faster still, given the extra power budget.
The memory controller arbitrates between coherent (i.e. ordered) and non-coherent accesses to memory. Llano has two 64-bit channels of DDR3 memory that must operate independently, while the smaller Fusion cousin only has a single channel. The GPU memory is interleaved across both channels for maximum streaming bandwidth and requests will close DRAM pages after an access. In contrast, system memory is optimized for latency and locality; contiguous requests will tend to stay to one memory channel and keep DRAM pages open. The memory can run up to 1.86GT/s for a total of 29.8GB/s memory bandwidth on Llano. It also contains an improved hardware prefetcher that tracks 8 different strides or sequence of strides and speculatively fills into the memory controller (rather than the caches).
In contrast, Sandy Bridge has tigher integration – using the on-die ring interconnect and L3 cache. Data is passed through the ring interconnect, but can be shared either through the cache or memory. The ring interconnect is 32B wide with 6 agents and operates at the core frequency (>3GHz). Data usually coming from either the 4 slices of the L3 cache or the memory controller, which resides in the system agent. The peak bandwidth is over 400GB/s, but the practical bandwidth since many accesses have to go through multiple stops on the ring interconnect. The Sandy Bridge power management is also fully unified for both CPU and GPU, so that when one is idle, the other may ‘borrow’ the thermal headroom.

Onion and Garlic are still the two seasonings of choice for Trinity.

To add to the above, though, Trinity does now feature a "borrow" type of TDP style turbo (that's actually more efficient and responsive than Ivy Bridge).

llano-2.png


You can't rely on DDR memory like that in order to provide the necessary speed and throughput. It's just not gonna' happen. We know this from GPUs: wider bus generally means much better performance. You can overclock as much as you want, but as soon as you're hitting the ceiling you're going to be limited. Sandy and Ivy both feature a "dedicated" portion of the RAM for the HDxxxx graphics. Kaveri will change things in that the unified address space should help significantly quicken certain tasks but ultimately you're still hitting that same roadblock in garlic (not onion, really, as that's the CPU side and that implementation, and in turn Trinity's implementation, isn't necessarily starving the CPU of resources).

If you're sticking to the same approach and relying very heavily on an already cramped bus and DDR widths (and speeds, but like I said, the speeds aren't going to be enough here) you're going to hit diminishing returns on throughput (GFLOPs, I guess).

So... maybe tie it in with cache, sort of like Intel is doing + unified address space? That should help improve performance significantly, particularly as far as compute goes, but it's going to be an uphill battle against a competitor who's going to be making smaller chips and capable of slapping more cache on the die. Furthermore, it doesn't necessarily gel with their HSA agenda (you'll need a butt-load of cache to replace those DDRs for big block writes/reads).

Like I said, AMD will have to get very creative. Sticking to FM2 for Kaveri means that whatever they've done isn't going to be that creative step they'll need for the big improvements. Granted, Intel still has a ways to go with respect to catching AMD on the graphics front (AMD is so far ahead that they're actually limiting their own performance...). Haswell is apparently a big step but don't expect anything extraordinary. More EUs + more cache should spell better performance but don't expect miracles.
 
Last edited:

Tuna-Fish

Golden Member
Mar 4, 2011
1,672
2,547
136
DDR3-2133MHz memory and a 4MB L3 Cache
IGP based on 8xxx series..."superior performance than a HD 7750"

No way, no how. Even at the very upper limit that DDR3-2133 theoretically gives, 34GB/s, you can access ~500 MB per frame when running at 60fps. That's not good enough. Realistically, we are talking half that. Memory bandwidth will cap this thing hard.

(edit) I somehow loaded an old version of this page -- didn't notice that everyone else had pointed it out too.
 
Last edited:

Arkadrel

Diamond Member
Oct 19, 2010
3,681
2
0
I dont understand why AMD are so pigheaded though.

There should be Quad Channel motherboards, with 4x DDR3 slots.
So the memory bandwidth for the APU's doubled.

Its needed if their gonna aim for 7750 like performance.

(someone says value segment, and i ll kick their arses, the price increase cant be that much compaired to performance gain).
 

pelov

Diamond Member
Dec 6, 2011
3,510
6
0
I dont understand why AMD are so pigheaded though.

There should be Quad Channel motherboards, with 4x DDR3 slots.
So the memory bandwidth for the APU's doubled.

Its needed if their gonna aim for 7750 like performance.

(someone says value segment, and i ll kick their arses, the price increase cant be that much compaired to performance gain).

That costs money. Triple and quad channel configurations are expensive.
 
Status
Not open for further replies.