• We should now be fully online following an overnight outage. Apologies for any inconvenience, we do not expect there to be any further issues.

Vega refresh - Expected? How might it look?

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Mar 11, 2004
23,444
5,852
146
Where are people getting AMD confirmed a refresh of Vega on 14nm+? I see people saying that but don't know where its coming from. Just because they showed that Vega would be on 14nm+ doesn't mean it is the Vega 64/56 version. They are supposed to have smaller and allegedly a larger (Vega 20, with 32GB HBM2) version of Vega on the way. Plus it could be speaking about an APU (not sure if people have found out for sure if the first Ryzen one will be or not, but could see them pushing an updated one out next year, hopefully in time for back to school).

Hopefully we'll see tweaked version of Vega, but I'm not sure why people are hyping that right now, as its likely not due for months if not close to a year.

When it was already a year late they should have just waited and fixed up the last remaining issues. Stuff like that is forgivable if you launch first, and a total joke if you launch a year after a competitor where the product cycle is a year long.

They also just immediately squandered hard-won reputation from Ryzen with a garbage launch for a mediocre product. Whether it eventually becomes a decent product with the issues and features fixed - who knows - but you get only 1 first impression.

I don't agree. I think they absolutely had planned on things being better (namely HBM2 production, but also sounds like the interposer is a main culprit to the low production thus far; which while that sucks, working with interposer designs is going to be important, even Intel is touting it, and Nvidia is likewise going to be using it too, thus AMD feels its best to cut their teeth on it earlier to try and get an advantage in dealing with it so the hardships now they hope will pay off later). They had to get production going to get cards out for both developers and themselves to work with on the software side. Between that and because AMD needs all the revenue they can bring in as quickly as they can, they can't wait til they have enough stock for wide release. I think they fully planned on launching without all the features enabled as they thought the baseline performance was enough to match the 1080 and 1070 (which seems to be the case), and then would work on enabling other big features (which they knew would take time, but they have to have cards to actually work with to get there) to spoil the Volta launch (legitimately that is how significant those features can have on performance is that it would put them in another performance class entirely; we'll see if they can actually get them there ever, let alone in time for Volta).

I think AMD expected to be in higher production before now. I think that's partly why Nvidia released the 1080Ti (and were willing to keep price reasonable), they were expecting to be spoiling a more imminent Vega launch (and were hedging their bets that performance might be exceptional. I think AMD manually showed, by basically going in and creating two versions of different scenes, the performance difference between normal versus culled for instance, how stuff like primitive shaders and the NGG path improve performance (and possibly efficiency/power use as well), hence why some of what has been said.

I feel that their biggest mistake with Vega was a mistake of product placement and marketing. If they had instead released it as a "prosumer" compute and feature card that also happened to be able to game well enough to be relevant, then a lot of the butthurt out there would be diminished. As we're seeing, it mines reasonably well (though not as power efficiently as desired) and it seems capable of high end OpenCL computing within its price range. If it was sold for that purpose, instead of as a gaming card first, I think the reaction would have been more... charitable. Plus, it would give them extra volume to continue to work out software and gaming bugs.

Sometimes I read posts in this subforum and really have to wonder what people are thinking. They did release Vega as a prosumer card. I disagree, as we see people are actively looking for every reason to be "butthurt" about this stuff (and especially AMD GPUs), and we've been seeing this for years now. I take it you missed how "butthurt" people were about Vega FE? It was competitive enough that it forced Nvidia to follow suit by making Titan a true "prosumer" card, and yet we still saw incessant bitching about even Vega FE being a "total disaster" (to the point people said they should have just outright killed Vega entirely and never even brought it to market in any form). Except they're not going to get that volume from that market alone, plus if they don't have cards for gamers to use, game developers are not going to spend any time targeting them (unless AMD puts in all the effort, which they just don't have the resources to do that; not to mention developers already were complaining that AMD wasn't working with them enough). So it would delay progress on supporting Vega features that makes it worthwhile, meaning it would then have the same problems just in the future.

There's two things people are not taking into account. AMD likely couldn't just stop Vega production even if they wanted to (which they don't, but this is to try to put this "they should've killed Vega off and not done it at all" nonsense to bed), as they likely already had contracts for that production (well in advance, before Vega taped out and also before HBM2 probably entered production), so they'd be on the hook for the costs with no product to sell to recoup them (and in spite of what some people on here seem to believe in spite of their "basic economics lessons", selling something even if you don't make money on it is better than being stuck paying for it without any return; I'm also personally very skeptical of the claims that AMD's other products are taking up all of GFs wafers to the point that AMD is losing out on overall production of them by putting some towards Vega, while Vega production helps them with the wafer deal which is why it was produced at GF in the first place). Also, without cards to work with, they can't enable the features and improve the performance as they wouldn't have hardware to test with (as many people condemning GCN point out, theoretical performance is meaningless and the hard part is getting as much real world performance out of the hardware as possible).
 

Jackie60

Member
Aug 11, 2006
118
46
101
Remember the X1800XT then a month or two later X1900/1950XT. Do that AMD, I cancelled X1800 bought GeForce 7900 512mb or whatever and bam out comes X1900/1950. That was how to do a refresh. That was ATI Iirc not AMD.
 

xpea

Senior member
Feb 14, 2014
458
156
116
I don't agree. I think they absolutely had planned on things being better (namely HBM2 production, but also sounds like the interposer is a main culprit to the low production thus far; which while that sucks, working with interposer designs is going to be important, even Intel is touting it, and Nvidia is likewise going to be using it too, thus AMD feels its best to cut their teeth on it earlier to try and get an advantage in dealing with it so the hardships now they hope will pay off later).
BS from the so called tech journalists. Anyone in the industry knows that an interposer of the size of Vega 10 + 2 stacks of HBM2 produces yields above 98% these days...
 

LightningZ71

Platinum Member
Mar 10, 2017
2,559
3,249
136
Sometimes I read posts in this subforum and really have to wonder what people are thinking. They did release Vega as a prosumer card. I disagree, as we see people are actively looking for every reason to be "butthurt" about this stuff (and especially AMD GPUs), and we've been seeing this for years now. I take it you missed how "butthurt" people were about Vega FE? It was competitive enough that it forced Nvidia to follow suit by making Titan a true "prosumer" card, and yet we still saw incessant bitching about even Vega FE being a "total disaster" (to the point people said they should have just outright killed Vega entirely and never even brought it to market in any form). Except they're not going to get that volume from that market alone, plus if they don't have cards for gamers to use, game developers are not going to spend any time targeting them (unless AMD puts in all the effort, which they just don't have the resources to do that; not to mention developers already were complaining that AMD wasn't working with them enough). So it would delay progress on supporting Vega features that makes it worthwhile, meaning it would then have the same problems just in the future.

There's two things people are not taking into account. AMD likely couldn't just stop Vega production even if they wanted to (which they don't, but this is to try to put this "they should've killed Vega off and not done it at all" nonsense to bed), as they likely already had contracts for that production (well in advance, before Vega taped out and also before HBM2 probably entered production), so they'd be on the hook for the costs with no product to sell to recoup them (and in spite of what some people on here seem to believe in spite of their "basic economics lessons", selling something even if you don't make money on it is better than being stuck paying for it without any return; I'm also personally very skeptical of the claims that AMD's other products are taking up all of GFs wafers to the point that AMD is losing out on overall production of them by putting some towards Vega, while Vega production helps them with the wafer deal which is why it was produced at GF in the first place). Also, without cards to work with, they can't enable the features and improve the performance as they wouldn't have hardware to test with (as many people condemning GCN point out, theoretical performance is meaningless and the hard part is getting as much real world performance out of the hardware as possible).

I would argue that FE was more a "pro" card than a prosumer card. It's pricing was way up there, and they were very explicit that it was for pro type workloads.

Vega 56/64 is a performance competitive design for high end modern gaming (which I define by 1070-1080ti level performance). It isn't besting those cards on price/performance or performance/watt, but it's in the ballpark for price, and power, at least to a big chunk of enthusiasts (who are arguably the target market) isn't much of an issue. However, these cards were not released in a vacuum, they were released in the middle of a mining craze. AMD had to know multiple months in advance that the cards were going to be attractive to miners and that they'd sell as many of them as they could make at a competitive price to Nvidia's offerings in the mainstream market. (Titan is above that market). They also couldn't have been oblivious to its power/performance shortcomings, unless they are just really bad at running a company (which, while arguable in many areas, managing to survive competing with a 900lb gorilla like Intel should at least bear out a base level of competence). Knowing those figures, they should have gone ProSumer market and focused on compute and creative uses for the card with a promise for "effective" gaming performance. The hash rates would have come out shortly, and with the large memory sizes on the cards, and what's required for Etherium these days, they had to realize that they'd sell as many of them as they could make at a useful price no matter what.

I think that they absolutely had to get something out. Covering Fixed costs and at least a portion of the variable costs will always be a better play than not making anything at all. I argue that they could have asked for $50 more for the cards, and given the demand for cards in the market, they would still have sold them as quickly as they can make them.

I can't see this particular iteration of the card ever being power/performance competitive. I think that AMD will be forced to spin another improved design and process for it in the window between now and consumer volta hitting the market. It doesn't have to be perfect, it just needs to drop power demand at least 10% or more, and boost effective performance at least 10-20%, at least in the DX12 games that they have been touting. Much of this can be realized through improved drivers as features are ironed out and enabled.
 
May 11, 2008
22,565
1,472
126
Agree that they are always to early and to techhie. But lets see where vega lands when they finally enable ngg that is what makes vega interesting from a gamers perspective.

And lets also remember they had to adress dx11 and st cpu perf lock and nv lock of dx11 mt perf. Dx12 and vulcan did that. The uptake is dog slow. But it is a major long term strategic win vs to far bigger oponents too force those standards in. It was a nessesary move and the too forward looking tech had a part in winning it via the consoles.

And if we look at vega its made to be scalable. Surely tech made to be also used in next gen consoles. From ground up with IF in its core.
And while hbcc is uninteresting for gamers short term its imo outright revolutionary for some of the pro market.
But yeaa we are waiting for the ngg path and some extra hbm2 efficiency to top it off.

I wonder if HBCC would simplify the use of features such as virtual textures that are streamed in when needed. It would mean huge unique levels could be designed and used more easily.
 

maddie

Diamond Member
Jul 18, 2010
5,158
5,545
136
I wonder if HBCC would simplify the use of features such as virtual textures that are streamed in when needed. It would mean huge unique levels could be designed and used more easily.
It does. Here is a benefit to a HBCC. Allowing a huge address space for data and allowing the fine caching of data exceeding the directly attached video memory capacity.

Software aren't written to manage the L1, L2, & L3 caches while running. The CPU automatically manages this. The same thing should happen here.

We can now imagine the HBCC as elevating the HBM2 memory block into a L3 cache equivalent. Since Fiji came out and a lot were belittling the 4GB, I've written that this would be possible.

The extended address space advantage does not need a very fast, low latency memory pool to work, but will obtain advantages if it's available. It allows simplifying the use of very large worksets.

The use for gaming, etc, needs the video card memory to have a very high bandwidth in excess of what the card will need to feed it's computational units. This memory must time-slice between supplying the GPU and flushing old data with new from an off card storage. I say this to suggest that those saying for AMD to use GDDR5X for example, might not work, as AFAIK, it has higher latency. I don't know about GDDR6.

Also since we now have more MB of data in flight for a given time, seeing that not all of the needed data is stored once, but exchanged on the fly, the power used will increase. This only applies to cards with memory fully saturated.

We get:
Low latency memory needed
High bandwidth needed
Low power memory desired
 
Last edited:

krumme

Diamond Member
Oct 9, 2009
5,956
1,596
136
It does. Here is a benefit to a HBCC. Allowing a huge address space for data and allowing the fine caching of data exceeding the directly attached video memory capacity.

Software aren't written to manage the L1, L2, & L3 caches while running. The CPU automatically manages this. The same thing should happen here.

We can now imagine the HBCC as elevating the HBM2 memory block into a L3 cache equivalent. Since Fiji came out and a lot were belittling the 4GB, I've written that this would be possible.

The extended address space advantage does not need a very fast memory pool to work, but will obtain advantages if it's available. It allows simplifying the use of very large worksets.

The use for gaming, etc, needs the video card memory to have a very high bandwidth in excess of what the card will need to feed it's computational units. This memory must time-slice between supplying the GPU and flushing old data with new from an off card storage. I say this to suggest that those saying for AMD to use GDDR5X for example, might not work, as AFAIK, it has higher latency. I don't know about GDDR6.

Also since we now have more MB of data in flight for a given time, seeing that not all of the needed data is stored once, but exchanged on the fly, the power used will increase. This only applies to cards with memory fully saturated.

We get:
Low latency memory needed
High bandwidth needed
Low power memory desired
We have to ask zlatan about this but my guess would be the next gen consoles would be a market force needed to drive this?
I mean to me it looks like even new game models is needed to really push this thinking.
 

maddie

Diamond Member
Jul 18, 2010
5,158
5,545
136
We have to ask zlatan about this but my guess would be the next gen consoles would be a market force needed to drive this?
I mean to me it looks like even new game models is needed to really push this thinking.
Remember the 4GB Vega demo that AMD showed? The increase in FPS, both mins and average? It appears to work transparently with existing games.

As to working more efficiently with special coding? I don't know.
 

ub4ty

Senior member
Jun 21, 2017
749
898
96
It does. Here is a benefit to a HBCC. Allowing a huge address space for data and allowing the fine caching of data exceeding the directly attached video memory capacity.

Software aren't written to manage the L1, L2, & L3 caches while running. The CPU automatically manages this. The same thing should happen here.

We can now imagine the HBCC as elevating the HBM2 memory block into a L3 cache equivalent. Since Fiji came out and a lot were belittling the 4GB, I've written that this would be possible.

The extended address space advantage does not need a very fast memory pool to work, but will obtain advantages if it's available. It allows simplifying the use of very large worksets.

The use for gaming, etc, needs the video card memory to have a very high bandwidth in excess of what the card will need to feed it's computational units. This memory must time-slice between supplying the GPU and flushing old data with new from an off card storage. I say this to suggest that those saying for AMD to use GDDR5X for example, might not work, as AFAIK, it has higher latency. I don't know about GDDR6.

Also since we now have more MB of data in flight for a given time, seeing that not all of the needed data is stored once, but exchanged on the fly, the power used will increase. This only applies to cards with memory fully saturated.

We get:
Low latency memory needed
High bandwidth needed
Low power memory desired

In reality, no memory storage is equivalent to L1,L2,L3 latencies.
HBCC does not elevate HBM2 into L3 cache. HBM2 has huge latencies when viewed in relation to L1,L2,L3 thus is known as main memory. The latency of Accessing System memory through HBCC is huge compared to HBM2 memory access. Not only do you have to deal w/ System memory access latency which is ~70ns but you have to communicate across PCI-E which has a huge access latency.

Cut out the marketing words and colorful analogies you get to hard numbers :
k0t1e.png


L1 = 0.5/1ns
L2 = 7ns
L3 = ~20ns
HBM2 = (~50ns)
-----
PCIE Latency (one-way) = 90ns (lowest estimate)
So, multiply this by 2 and add on DDR4 System memory access time (70ns)
So, best case, you're looking at 250ns latency... to get something from main memory onto the GPU. Likely you can slap another 100ns-200ns on to this for actually pulling the memory into HBM2 once its there and evicting what's in there.

HBCC is nothing more than a dynamic memory paging engine whose performance has not been established. It's nowhere on scale w/ L1,L2,L3 on die cache and nowhere on scale w/ HBM2 memory access time. Dynamic paging is nothing new and until Radeon publishes hard numbers and details about it, it is nothing more than pie in the sky marketing.

Memory handling engines are very hard to craft and can have huge latency creep if not implemented properly. The fact that Vega is such a low performer even w/ huge compute potentials highlights this and likely points to memory management latencies causing memory starvation. I recall looking into Nvidia vs Radeon performance on a particular flow and the Radeon was like 10x slower for a particular memory operation. I looked into why and it had to do w/ the hardware pipeline they use for PCIE communication. It's stuff like this that gets glossed over in high level marketing language like HBCC. When someone make an amazing new feature that has great performance, typically you see numbers attached and details. When it's bogus and fud, you get colorful analogies and non standard language. HBCC is probably a leftover piece of hardware from their workstation cards that have such a controller for extended onboard storage on the Video card. As such, don't expect this to operate in any efficient way that the marketing language would have you believe. Furthermore, there's no way this behaves in any complicated way w/o developer coding. As such, It's literally just a memory paging controller that makes wild assumptions about what needs to be paged in and out... Many times it will be wrong or page in blocks that aren't needed. The thing to remember is that it has to evict things from HBM2 in order to page in new memory. This, when implemented wrong or in highly volatile memory access can lead to memory thrashing which actually leads to worse performance. I really wish they stuck to industry terms and just published hard figures but you can see how it obviously sells products to misinformed people.
 
  • Like
Reactions: xpea

maddie

Diamond Member
Jul 18, 2010
5,158
5,545
136
In reality, no memory storage is equivalent to L1,L2,L3 latencies.
HBCC does not elevate HBM2 into L3 cache. HBM2 has huge latencies when viewed in relation to L1,L2,L3 thus is known as main memory. The latency of Accessing System memory through HBCC is huge compared to HBM2 memory access. Not only do you have to deal w/ System memory access latency which is ~70ns but you have to communicate across PCI-E which has a huge access latency.

Cut out the marketing words and colorful analogies you get to hard numbers :
k0t1e.png


L1 = 0.5/1ns
L2 = 7ns
L3 = ~20ns
HBM2 = (~50ns)
-----
PCIE Latency (one-way) = 90ns (lowest estimate)
So, multiply this by 2 and add on DDR4 System memory access time (70ns)
So, best case, you're looking at 250ns latency... to get something from main memory onto the GPU. Likely you can slap another 100ns-200ns on to this for actually pulling the memory into HBM2 once its there and evicting what's in there.

HBCC is nothing more than a dynamic memory paging engine whose performance has not been established. It's nowhere on scale w/ L1,L2,L3 on die cache and nowhere on scale w/ HBM2 memory access time. Dynamic paging is nothing new and until Radeon publishes hard numbers and details about it, it is nothing more than pie in the sky marketing.

Memory handling engines are very hard to craft and can have huge latency creep if not implemented properly. The fact that Vega is such a low performer even w/ huge compute potentials highlights this and likely points to memory management latencies causing memory starvation. I recall looking into Nvidia vs Radeon performance on a particular flow and the Radeon was like 10x slower for a particular memory operation. I looked into why and it had to do w/ the hardware pipeline they use for PCIE communication. It's stuff like this that gets glossed over in high level marketing language like HBCC. When someone make an amazing new feature that has great performance, typically you see numbers attached and details. When it's bogus and fud, you get colorful analogies and non standard language. HBCC is probably a leftover piece of hardware from their workstation cards that have such a controller for extended onboard storage on the Video card. As such, don't expect this to operate in any efficient way that the marketing language would have you believe. Furthermore, there's no way this behaves in any complicated way w/o developer coding. As such, It's literally just a memory paging controller that makes wild assumptions about what needs to be paged in and out... Many times it will be wrong or page in blocks that aren't needed. The thing to remember is that it has to evict things from HBM2 in order to page in new memory. This, when implemented wrong or in highly volatile memory access can lead to memory thrashing which actually leads to worse performance. I really wish they stuck to industry terms and just published hard figures but you can see how it obviously sells products to misinformed people.
HBCC is nothing more than a dynamic memory paging engine whose performance has not been established. It's nowhere on scale w/ L1,L2,L3 on die cache and nowhere on scale w/ HBM2 memory access time. Dynamic paging is nothing new and until Radeon publishes hard numbers and details about it, it is nothing more than pie in the sky marketing.

What do absolute ns latencies have to do what I'm saying? The fact is that video cards have their data accessed at latencies = video card values, whether HBM2, GDDR5 or GDDR5X.

Missing the point totally and you are wrong in saying "nowhere on scale w/ HBM2 memory access time."


HBCC allows the entire dataset, if exceeding the video card memory capacity, to operate as if all of the data can be accessed at HBM2 latency. It isolates the PCIe memory latency from the video ram. After all, you're dealing with GBs of cache here, so your analogy of memory thrashing falls flat as traditional operational models had very low video memory capacity. Surely You must know, that trashing increases significantly if the buffer is too small and there comes a buffer value where thrashing will not occur You really shouldn't use that example without proper analysis.

edit:
This should refute your post. How is this possible if what you say is correct?
Remember the 4GB Vega demo that AMD showed? The increase in FPS, both mins and average?

edit 2:
Additionally, the original L2 & L3 CPU caches were actually off die as the node limitations at the time didn't allow enough transistors. There is no rule saying that a L2 or L3 must have X latency.

https://www.techopedia.com/definition/17183/level-3-cache-l3-cache
quote:
"The L3 cache is usually built onto the motherboard between the main memory (RAM) and the L1 and L2 caches of the processor module. This serves as another bridge to park information like processor commands and frequently used data in order to prevent bottlenecks resulting from the fetching of these data from the main memory. In short, the L3 cache of today is what the L2 cache was before it got built-in within the processor module itself.

The CPU checks for information it needs from L1 to the L3 cache. If it does not find this info in L1 it looks to L2 then to L3, the biggest yet slowest in the group. The purpose of the L3 differs depending on the design of the CPU. In some cases the L3 holds copies of instructions frequently used by multiple cores that share it. Most modern CPUs have built-in L1 and L2 caches per core and share a single L3 cache on the motherboard, while other designs have the L3 on the CPU die itself."
 
Last edited:
May 11, 2008
22,565
1,472
126
In reality, no memory storage is equivalent to L1,L2,L3 latencies.
HBCC does not elevate HBM2 into L3 cache. HBM2 has huge latencies when viewed in relation to L1,L2,L3 thus is known as main memory. The latency of Accessing System memory through HBCC is huge compared to HBM2 memory access. Not only do you have to deal w/ System memory access latency which is ~70ns but you have to communicate across PCI-E which has a huge access latency.

Cut out the marketing words and colorful analogies you get to hard numbers :
k0t1e.png


L1 = 0.5/1ns
L2 = 7ns
L3 = ~20ns
HBM2 = (~50ns)
-----
PCIE Latency (one-way) = 90ns (lowest estimate)
So, multiply this by 2 and add on DDR4 System memory access time (70ns)
So, best case, you're looking at 250ns latency... to get something from main memory onto the GPU. Likely you can slap another 100ns-200ns on to this for actually pulling the memory into HBM2 once its there and evicting what's in there.

HBCC is nothing more than a dynamic memory paging engine whose performance has not been established. It's nowhere on scale w/ L1,L2,L3 on die cache and nowhere on scale w/ HBM2 memory access time. Dynamic paging is nothing new and until Radeon publishes hard numbers and details about it, it is nothing more than pie in the sky marketing.

Memory handling engines are very hard to craft and can have huge latency creep if not implemented properly. The fact that Vega is such a low performer even w/ huge compute potentials highlights this and likely points to memory management latencies causing memory starvation. I recall looking into Nvidia vs Radeon performance on a particular flow and the Radeon was like 10x slower for a particular memory operation. I looked into why and it had to do w/ the hardware pipeline they use for PCIE communication. It's stuff like this that gets glossed over in high level marketing language like HBCC. When someone make an amazing new feature that has great performance, typically you see numbers attached and details. When it's bogus and fud, you get colorful analogies and non standard language. HBCC is probably a leftover piece of hardware from their workstation cards that have such a controller for extended onboard storage on the Video card. As such, don't expect this to operate in any efficient way that the marketing language would have you believe. Furthermore, there's no way this behaves in any complicated way w/o developer coding. As such, It's literally just a memory paging controller that makes wild assumptions about what needs to be paged in and out... Many times it will be wrong or page in blocks that aren't needed. The thing to remember is that it has to evict things from HBM2 in order to page in new memory. This, when implemented wrong or in highly volatile memory access can lead to memory thrashing which actually leads to worse performance. I really wish they stuck to industry terms and just published hard figures but you can see how it obviously sells products to misinformed people.

In the upper most right corner of that picture. Should that not read HDD instead of SSD ?
There is a typo in that picture.
 

krumme

Diamond Member
Oct 9, 2009
5,956
1,596
136
Memory requirements explode outside of gaming.

As Raja have said memory cost on a gpu constitutes a big and bigger part. It simply needs to be adressed. We can also look at 3gb vs 6gb 1060 prices or 4 vs 8gb polaris. Its imo pretty steep cost vs the benefit. Most of the time its doing nothing.

Treating the memory as a cache is surely here to stay. Whatever the name. If pci adds 50 or 200 ns.

Yes amd said they could make it work only using driver side and showed a demo. But hey. As it is now its disabled as default and i amd pretty sure its because it hurts performance. And i am also sure we need to see the ngg path enabled and memory efficiency go up before they start working on this feature for games - if not fix the basics like the avfs that seems weird. It must be outrageously complex to implement hbcc only working on driverside.
 

ub4ty

Senior member
Jun 21, 2017
749
898
96
What do absolute ns latencies have to do what I'm saying? The fact is that video cards have their data accessed at latencies = video card values, whether HBM2, GDDR5 or GDDR5X.

Missing the point totally and you are wrong in saying "nowhere on scale w/ HBM2 memory access time."
It has everything to do w/ what you're saying. Otherwise you're not saying anything. HBCC is an attempt to reduce memory paging latency from GPU across to CPU land. Thus, the only thing of value to be discussed are latencies, relative latencies, and actual performance. The community doesn't profit from a regurgitation of marketing language or incorrect analogies. I stated the hard numbers and facts behind how this works. It's the only thing of value that needs to be discussed.

If it doesn't cut the paging time successfully or was implemented like crap (Likely given that Radeon has serious issues in their memory management pipeline) then this is nothing more than additional hardware wasting energy (not the first time w/ Radeon). Furthermore, there's no way to do this effectively w/o developer codified support as you, using common sense, wont know what to effectively page in and out which would lead to worse performance (cache eviction -> memory thrashing). Kicking useful frequently accessed things out of GPU memory and replacing them with crap. This things aren't easy to get right. It's one of the sole reasons why some of both AMD and Radeon's hardware pipelines have been so bad : Bad memory caching and Memory Management subsystems.

HBCC allows the entire dataset, if exceeding the video card memory capacity, to operate as if all of the data can be accessed at HBM2 latency.
No it does not. You are 100% wrong defined directly by the latency of paging things in and out.

It isolates the PCIe memory latency from the video ram. After all, you're dealing with GBs of cache here, so your analogy of memory thrashing falls flat as traditional operational models had very low video memory capacity. Surely You must know, that trashing increases significantly if the buffer is too small and there comes a buffer value where thrashing will not occur You really shouldn't use that example without proper analysis.
Ah', so you seem to somewhat know what you're talking about. At scale, PCIe Memory latency can be mitigated across a larger transfer. However, where do you put the larger transfer once it gets to the GPU if HBM2 is full? you evict stuff from HBM2... Where does that go? The large the input the larger the eviction. What if you just evicted something you need? How to decide?So, clearly, as with all caching, there is the potential for thrashing. How much? How well does it perform? Well, don't you think if it was impressive that AMD would have detailed it? If this requires no input from developers then where are the benchmarks showing how it performs on non canned code? The details that are out from reviewers show it performs worse and marginally similar to it being disabled which highlights the fluff.

You have a complete hardware module attached to the pipeline. Obviously they tested it and proofed it during development. So, where are the numbers? latency? etc. Nowhere to be found.. Just a market slide mention. So, I really should use my fundamental understand of computer architecture to cut through bullshit and be critical of something that hasn't been proven...

edit:
This should refute your post. How is this possible if what you say is correct?
Remember the 4GB Vega demo that AMD showed? The increase in FPS, both mins and average?

edit 2:
Additionally, the original L2 & L3 CPU caches were actually off die as the node limitations at the time didn't allow enough transistors. There is no rule saying that a L2 or L3 must have X latency.
I recall the demo. It's what is known as a canned demo as it hasn't been reproduced. Thus, there is no telling how this performs and/or if it requires extended developer code.
My skepticism is 100% warranted. I question why you keep siding w/ unsubstantiated marketing and try to twist your way through known architectural limitations. Not sure if you're shilling or only understand this on a marginal level.

https://www.techopedia.com/definition/17183/level-3-cache-l3-cache
quote:
"The L3 cache is usually built onto the motherboard between the main memory (RAM) and the L1 and L2 caches of the processor module. This serves as another bridge to park information like processor commands and frequently used data in order to prevent bottlenecks resulting from the fetching of these data from the main memory. In short, the L3 cache of today is what the L2 cache was before it got built-in within the processor module itself.

The CPU checks for information it needs from L1 to the L3 cache. If it does not find this info in L1 it looks to L2 then to L3, the biggest yet slowest in the group. The purpose of the L3 differs depending on the design of the CPU. In some cases the L3 holds copies of instructions frequently used by multiple cores that share it. Most modern CPUs have built-in L1 and L2 caches per core and share a single L3 cache on the motherboard, while other designs have the L3 on the CPU die itself."
I know exactly how this stuff works.
Were talking about pinned virtual memory being accessed by a DMA controller integrated into the GPU's memory management :

Vega-HBCCslide.jpg

There's no reason to confuse the language or complicate things. All it does is page things in and out of non-local memory and integrate into the local memory meanwhile evicting things to do so. No reason to on ad naseum about L1/L2/L3 which this is nothing of the sort.
 
Last edited:
  • Like
Reactions: xpea

ub4ty

Senior member
Jun 21, 2017
749
898
96
In the upper most right corner of that picture. Should that not read HDD instead of SSD ?
There is a typo in that picture.
It's an old picture. There are newer ones w/ updated components and respective latencies.
Picked the 1st one I found while googling as it conveys the point of wildly differing access latencies for respective components and scales therein. People often throw around components while ignoring the access latencies.

Also, to put this marketing gimmick to rest : All modern video cards have HBCCs as its nothing more than a DMA controller paging memory and communication back and forth through system memory. The only question is how its implemented, how well it performs, and what code/driver support it needs to function. Radeon hasn't provided any of these details thus its nothing more than fanciful marketing until then.

Hilariously, Nvidia outperforms Radeon cards by 10fold factors when it comes to this area of the pipeline. So, don't expect some earth shattering change when the details come as well as real world non-canned performance.

I'm really getting tired of the Vega b.s and I was 100% on board waiting for its arrival.
The more I look into the technical details about the card the more I understand how little of note these marketed features are. I don't judge AMD CPU division in the same light. They seem to have actually gotten themselves together and fixed such glaring issues in their hardware pipeline.
 
  • Like
Reactions: xpea

maddie

Diamond Member
Jul 18, 2010
5,158
5,545
136
Furthermore, there's no way to do this effectively w/o developer codified support as you, using common sense, wont know what to effectively page in and out which would lead to worse performance

However, where do you put the larger transfer once it gets to the GPU if HBM2 is full? you evict stuff from HBM2... Where does that go? The large the input the larger the eviction. What if you just evicted something you need? How to decide?

I know exactly how this stuff works.
Nothing personal and not trying to upset you.

Just one question?

How do CPUs use their caches? Does the programmer need to write special code to use the caching algos?
 

ub4ty

Senior member
Jun 21, 2017
749
898
96
Nothing personal and not trying to upset you.

Just one question?

How do CPUs use their caches? Does the programmer need to write special code to use the caching algos?

You're not upsetting me. I'm simply trying to cut through marketing b.s and get at what matters. There seems to be a gap in communication across our posts. I know a great deal of what I am talking about and you seemingly do as well. Thus, it seems information is just being conveyed wrong. I don't take anything personal. So, lets just resolve the gap.

I've made chips. I've written low level code.
I know what flaws and issues can arise. As such, I know how to see beyond marketing b.s. That's what I'm speaking about. If we get hung up on comp arch 101, we'll be wasting time here.

If you've truly created a new caching mechanism that doesn't reside on chip and thus isn't tightly integrated to the immediate compute data pipeline, you are privy to implementing an underperforming memory subsystem because you either :
A) Don't have developer involvement to utilize it most efficiently
B) Thought you covered all the memory access patters yourself but were wrong

AMD and Radeon have done this before. This is not speculation.. If you need a primer on such analysis and how prominent companies can screw up such things it can be found here : https://www.nextplatform.com/2015/04/28/thoughts-and-conjecture-on-knights-landing-near-memory/

As such, my commentary stands and is sound.

As for L1/L2/L3 on a CPU, A GPU and its memory management and hierarchy are not like a CPUs especially when you begin talking about something like HBCC. As such, there's no reason to try to compare L1/L2/L3 that functions transparently (whatever that means) to HBCC. That being said, I can write code right now that makes L1/L2/L3 cache performance horrid on any processor. So obviously there's more to the 'developer involvement' story. If you want zero developer involvement w.r.t to a cache paging mechanism, you're going to have a slew of cases where you have issues or you could be that amazing that you've covered them all.

So, I get the tone of your remark. However, you're yet again missing the point. So, let me ask you : Have you ever designed a chip's caching subsystem? Have you ever written low level code in consideration of a processor's memory hierarchy. If not and you want to understand my commentary a little more google the various discussions on memory hierarchy design/caching and developer involvement/code design awareness. If you understand all of this, this shouldn't be such a struggle as we'd be talking the same language.

> mfw HBCC but on an nvidia card
hpgmg_nvvp_unified_memory_details.png


What is (onDemand paging)....

Cliff notes :
> On demand dynamic paging of memory to and from GPUs already exists. It is enabled by a DMA controller on the GPU paging game data in and out. Every modern GPU has this. You can call this whatever you want : transparent/opaque. If developers don't code their systems for it, it performs like arse. So, you can have fully transparent migration it will just make algorithmic assumptions which may or may not make your program run like trash. The same goes for CPUs for people who ignorantly write code that doesn't account for L1/L2/L3 cache hierarchies. It runs and it may run fast enough to you but underneath could be a nightmare.

Really nothing more to discuss here until they publish more details and performance numbers. A comp. arch 101 discussions isn't going to elevate this any further :kissingclosed:
 
Last edited:
  • Like
Reactions: evilr00t and xpea

ub4ty

Senior member
Jun 21, 2017
749
898
96
Memory requirements explode outside of gaming.

As Raja have said memory cost on a gpu constitutes a big and bigger part. It simply needs to be adressed. We can also look at 3gb vs 6gb 1060 prices or 4 vs 8gb polaris. Its imo pretty steep cost vs the benefit. Most of the time its doing nothing.

Treating the memory as a cache is surely here to stay. Whatever the name. If pci adds 50 or 200 ns.

Yes amd said they could make it work only using driver side and showed a demo. But hey. As it is now its disabled as default and i amd pretty sure its because it hurts performance. And i am also sure we need to see the ngg path enabled and memory efficiency go up before they start working on this feature for games - if not fix the basics like the avfs that seems weird. It must be outrageously complex to implement hbcc only working on driverside.

Every card already pages memory in and out from main memory on demand. Think about it for a second... LOL

So, until they give more details and performance numbers this is just rebranding of a common feature or some new hardware that could or could not boost performance marginally. It's not a game changer but its marketed as that. The game changer is in their workstation and pro-line cards and the game is changed because they have the memory actually on the GPU connected to HBCC not offboard and across PCI-E. It's literally like these just transplanted the architecture meant for something far more higher performance and for cost cutting reasons left things in a frankenstein state which explains why they market these features yet many of them aren't functional .. Literally trying to figure out how to enable them for a frankenstein architecture not meant for them while marketing it to people as a revolutionary concept.
 
  • Like
Reactions: xpea

maddie

Diamond Member
Jul 18, 2010
5,158
5,545
136
You're not upsetting me. I'm simply trying to cut through marketing b.s and get at what matters. There seems to be a gap in communication across our posts. I know a great deal of what I am talking about and you seemingly do as well. Thus, it seems information is just being conveyed wrong. I don't take anything personal. So, lets just resolve the gap.

I've made chips. I've written low level code.
I know what flaws and issues can arise. As such, I know how to see beyond marketing b.s. That's what I'm speaking about. If we get hung up on comp arch 101, we'll be wasting time here.

If you've truly created a new caching mechanism that doesn't reside on chip and thus isn't tightly integrated to the immediate compute data pipeline, you are privy to implementing an underperforming memory subsystem because you either :
A) Don't have developer involvement to utilize it most efficiently
B) Thought you covered all the memory access patters yourself but were wrong

AMD and Radeon have done this before. This is not speculation.. If you need a primer on such analysis and how prominent companies can screw up such things it can be found here : https://www.nextplatform.com/2015/04/28/thoughts-and-conjecture-on-knights-landing-near-memory/

As such, my commentary stands and is sound.

As for L1/L2/L3 on a CPU, A GPU and its memory management and hierarchy are not like a CPUs especially when you begin talking about something like HBCC. As such, there's no reason to try to compare L1/L2/L3 that functions transparently (whatever that means) to HBCC. That being said, I can write code right now that makes L1/L2/L3 cache performance horrid on any processor. So obviously there's more to the 'developer involvement' story. If you want zero developer involvement w.r.t to a cache paging mechanism, you're going to have a slew of cases where you have issues or you could be that amazing that you've covered them all.

So, I get the tone of your remark. However, you're yet again missing the point. So, let me ask you : Have you ever designed a chip's caching subsystem? Have you ever written low level code in consideration of a processor's memory hierarchy. If not and you want to understand my commentary a little more google the various discussions on memory hierarchy design/caching and developer involvement/code design awareness. If you understand all of this, this shouldn't be such a struggle as we'd be talking the same language.

> mfw HBCC but on an nvidia card
hpgmg_nvvp_unified_memory_details.png


What is (onDemand paging)....

Cliff notes :
> On demand dynamic paging of memory to and from GPUs already exists. It is enabled by a DMA controller on the GPU paging game data in and out. Every modern GPU has this. You can call this whatever you want : transparent/opaque. If developers don't code their systems for it, it performs like arse. So, you can have fully transparent migration it will just make algorithmic assumptions which may or may not make your program run like trash. The same goes for CPUs for people who ignorantly write code that doesn't account for L1/L2/L3 cache hierarchies. It runs and it may run fast enough to you but underneath could be a nightmare.

Really nothing more to discuss here until they publish more details and performance numbers. A comp. arch 101 discussions isn't going to elevate this any further :kissingclosed:
Good post.

I'm not as pessimistic as you however, mainly because of the high bandwidth available and the use of the large #GB of HBM2 as a cache. I see this as mitigating as lot of the traditional problems. A main case of this is when you have to flush & reload a lot of the data due to nonlocality.

As you say however, we'll have to wait & see.
 

ub4ty

Senior member
Jun 21, 2017
749
898
96
Good post.

I'm not as pessimistic as you however, mainly because of the high bandwidth available and the use of the large #GB of HBM2 as a cache. I see this as mitigating as lot of the traditional problems. A main case of this is when you have to flush & reload a lot of the data due to nonlocality.

As you say however, we'll have to wait & see.

In 2017, the usage of the word pessimistic has become a meme for people who can't stomach the truth.. Don't out yourself like that and glad we could come to a consensus :sunglasses:. That's what these drawn out exchanges are for : understanding a higher truth.

Every modern card does what HBCC does. The only question is if they've managed to do it better w/ this additional or reworded packaging. All that comes down to is hard numbers. Do I get excited when I hear these terms and see the slides? Sure... Then I ask the deeper questions. I've gone through enough company's architectures vs. marketing slides lately to have a bone to pick w/ the flat out lying that occurs about such features. In some cases, such new approaches to caching were so bad that they literally disabled the feature in the hardware pipeline.

IMO if you want to trace where this cache controller came from, locate the point in time and slide deck in which Raj was rambling on about putting m.2 ssds on the video card to boost storage and serve as a yuge local cache. That's where this HBCC phenomon came from. It's a feature meant for and more appropriately used on their higher end cards thats now on the lower but will be gimped and scaled down to a point of non-use.

I am critical of Radeon but Nvidia pulls the same crap. Sadly, the marketing slides don't get updated thus you don't find this out until someone probes the hardware and finds the gimping.

Edit : Here you go .. Radeon Pro (SSG)
https://pro.radeon.com/en-us/product/pro-series/radeon-pro-ssg/
http://www.legitreviews.com/amd-radeon-pro-ssg-m-2-ssds-board-1tb-flash_184641
http://www.pcgamer.com/amd-radeon-pro-ssg-pairs-vega-with-2tb-of-memory/

This is what an HBCC is really supposed to be used for. Why is it referenced on consumer vega? because its just sitting on the common die and they figured they'd market it as feature and reuse it for something. That something, given that its not ready on launch, literally seems to be a moving finish line.

And here's a "pessimistic" reviewer highlighting the pro use case for what HBCC was actually meant for :
Note how much he highlights how great this technology is w/ flash storage on the video in that it cuts down on the huge PCI-E latency. So, congrats, you got a cut down hand-me down pro feature w/ the local flash cut out and redirected over PCI-E to main system memory... all w/ the marketing fluff of being called HBCC. I already have a rough idea of the performance. Now I look forward to how they actually implemented it.
 
Last edited:
  • Like
Reactions: xpea

eek2121

Diamond Member
Aug 2, 2005
3,420
5,066
136
Where are people getting AMD confirmed a refresh of Vega on 14nm+? I see people saying that but don't know where its coming from. Just because they showed that Vega would be on 14nm+ doesn't mean it is the Vega 64/56 version. They are supposed to have smaller and allegedly a larger (Vega 20, with 32GB HBM2) version of Vega on the way. Plus it could be speaking about an APU (not sure if people have found out for sure if the first Ryzen one will be or not, but could see them pushing an updated one out next year, hopefully in time for back to school).

Hopefully we'll see tweaked version of Vega, but I'm not sure why people are hyping that right now, as its likely not due for months if not close to a year.



I don't agree. I think they absolutely had planned on things being better (namely HBM2 production, but also sounds like the interposer is a main culprit to the low production thus far; which while that sucks, working with interposer designs is going to be important, even Intel is touting it, and Nvidia is likewise going to be using it too, thus AMD feels its best to cut their teeth on it earlier to try and get an advantage in dealing with it so the hardships now they hope will pay off later). They had to get production going to get cards out for both developers and themselves to work with on the software side. Between that and because AMD needs all the revenue they can bring in as quickly as they can, they can't wait til they have enough stock for wide release. I think they fully planned on launching without all the features enabled as they thought the baseline performance was enough to match the 1080 and 1070 (which seems to be the case), and then would work on enabling other big features (which they knew would take time, but they have to have cards to actually work with to get there) to spoil the Volta launch (legitimately that is how significant those features can have on performance is that it would put them in another performance class entirely; we'll see if they can actually get them there ever, let alone in time for Volta).

I think AMD expected to be in higher production before now. I think that's partly why Nvidia released the 1080Ti (and were willing to keep price reasonable), they were expecting to be spoiling a more imminent Vega launch (and were hedging their bets that performance might be exceptional. I think AMD manually showed, by basically going in and creating two versions of different scenes, the performance difference between normal versus culled for instance, how stuff like primitive shaders and the NGG path improve performance (and possibly efficiency/power use as well), hence why some of what has been said.



Sometimes I read posts in this subforum and really have to wonder what people are thinking. They did release Vega as a prosumer card. I disagree, as we see people are actively looking for every reason to be "butthurt" about this stuff (and especially AMD GPUs), and we've been seeing this for years now. I take it you missed how "butthurt" people were about Vega FE? It was competitive enough that it forced Nvidia to follow suit by making Titan a true "prosumer" card, and yet we still saw incessant bitching about even Vega FE being a "total disaster" (to the point people said they should have just outright killed Vega entirely and never even brought it to market in any form). Except they're not going to get that volume from that market alone, plus if they don't have cards for gamers to use, game developers are not going to spend any time targeting them (unless AMD puts in all the effort, which they just don't have the resources to do that; not to mention developers already were complaining that AMD wasn't working with them enough). So it would delay progress on supporting Vega features that makes it worthwhile, meaning it would then have the same problems just in the future.

There's two things people are not taking into account. AMD likely couldn't just stop Vega production even if they wanted to (which they don't, but this is to try to put this "they should've killed Vega off and not done it at all" nonsense to bed), as they likely already had contracts for that production (well in advance, before Vega taped out and also before HBM2 probably entered production), so they'd be on the hook for the costs with no product to sell to recoup them (and in spite of what some people on here seem to believe in spite of their "basic economics lessons", selling something even if you don't make money on it is better than being stuck paying for it without any return; I'm also personally very skeptical of the claims that AMD's other products are taking up all of GFs wafers to the point that AMD is losing out on overall production of them by putting some towards Vega, while Vega production helps them with the wafer deal which is why it was produced at GF in the first place). Also, without cards to work with, they can't enable the features and improve the performance as they wouldn't have hardware to test with (as many people condemning GCN point out, theoretical performance is meaningless and the hard part is getting as much real world performance out of the hardware as possible).

It makes a ton of sense on AMD's part to do a refresh. 14nm+ is speculated to be a performance optimized process. Indeed if you do some digging, you'll find evidence to support this claim...but I'm not going to post the links out of respect for the privacy of certain folks. At any rate, Navi won't be shipping in volume until 2019. There will be a Vega refresh in 2018, Expect 4 stacks of HBM, higher clocks, mature drivers, and lower TDP. The Vega you see today is the result of AMD trying to rush a card out the door on 14LPP. It never should have happened, but AMD had no choice. If they had been able to use another foundry, Vega would have run circles around the competition.
 

ub4ty

Senior member
Jun 21, 2017
749
898
96
It makes a ton of sense on AMD's part to do a refresh. 14nm+ is speculated to be a performance optimized process. Indeed if you do some digging, you'll find evidence to support this claim...but I'm not going to post the links out of respect for the privacy of certain folks. At any rate, Navi won't be shipping in volume until 2019. There will be a Vega refresh in 2018, Expect 4 stacks of HBM, higher clocks, mature drivers, and lower TDP. The Vega you see today is the result of AMD trying to rush a card out the door on 14LPP. It never should have happened, but AMD had no choice. If they had been able to use another foundry, Vega would have run circles around the competition.
Once they get the money, they need to cut completely new dies for vega w/o all the extra power sucking pro features that will either be gimped or are running background disabled. There's a reason why Nvidia has so many die cuts for the same micro-architecture and why they use far less power than vega RX per performance. I can respect Radeon for what they've done on their budget and understand its quite amazing. I also understand they likely didn't have the budget to cut completely separate dies for consumers. In a way they signaled to this by reducing how much they were pushing RX Vega and how it rolled out. I'm just surprised with it out now how much they continue to highlight features that I know are either disabled or going to be gimped. It's here that they are setting up false expectations that will cost them in the long run because they're setting people up for disappointment. Now, could I be wrong? yeah .. but don't you think they'd confirm it and not leave it up in the air if they intend to keep and make such features functional? They've have to have brain dead marketing/etc to not... So, there it is.

I was hoping for magic but am slowly realizing its not gonna happen... they have a pro line that cost 4-40x as much to sell after-all.

Vega you see today is literally a cut die made for pro compute workloads w/a GPU pipeline shoved in. Shrinking gate size doesn't fix this. Spinning a custom consumer die w/ out all the pro level features does.
 

krumme

Diamond Member
Oct 9, 2009
5,956
1,596
136
Its the software driving things forward today.
Protocols is the technical platforms of most importance.
The hardware is made to cary those new protocols. The hardware is the servant.
Adding a fancy name to it doesnt change that. Be it IF or hbcc or some fancy software centric shader from nv or amd.
 
May 11, 2008
22,565
1,472
126
Naa. Ms is one thousands of a second.
So 1000MB/s is pci ssd speed.

Yeah, you are right.
Two weeks with hardly any sleep again, less than 3 to 4 hours a day. I was getting blind for details. :(
Luckily, i am now in the period of catching up again. I slept 12 hours in a row today. :)
 
  • Like
Reactions: krumme

richaron

Golden Member
Mar 27, 2012
1,357
329
136
Every modern card does what HBCC does. The only question is if they've managed to do it better w/ this additional or reworded packaging. All that comes down to is hard numbers. Do I get excited when I hear these terms and see the slides? Sure... Then I ask the deeper questions. I've gone through enough company's architectures vs. marketing slides lately to have a bone to pick w/ the flat out lying that occurs about such features. In some cases, such new approaches to caching were so bad that they literally disabled the feature in the hardware pipeline.

Dude you've talked a lot about how everything already does the "HBCC" thing, but you are missing most of the points.

I think you need to accept/realise the huge CPU/driver overhead advantage with dGPU "directly" accessing virtual memory over a conventional PCIe link (or any link for that matter). HBCC with hardware virtual memory and direct access improves over the conventional methods without even talking about the HSA/hUMA/IF hardware capabilities. The alternative, and what you seem to think is comparable, is a relative mess of CPU run drivers and memory conversions with a massive latency disadvantage.
 
Last edited: