Will AMD support AVX-512 and Intel TSX ?

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
It's unfortunate that AMD was too late in the game to bring a competing transactional memory extension like their own proposed ASF extension so it would be in their best interest to support an already established extension like TSX. Will AMD get this supported in their upcoming new microachitecture or will they eventually support it in the relative near future if they can at all ?

As there are many versions of AVX-512 out there it would be splendid if AMD we're to get basic AVX-512 support that Skylake-X processors would already have like AVX-512 F, CD, VL, BW, and DQ.

There's another bonus too if AMD we're to support all of these extensions and if the next generation of game consoles we're to use AMD APUs again then it would cement mainstream usage of these extensions in common performance critical applications sooner rather than later since Intel wouldn't be able to fuse off the functionality with lower tier CPUs such as Pentiums or Celerons easily. (Consoles have support for AVX but Intel's latest Skylake Pentium's don't even support AVX!)
 

NTMBK

Lifer
Nov 14, 2011
10,239
5,026
136
Meh, I'm honestly not that fussed about AVX-512, especially for games consoles. The current consoles execute AVX on 128-bit vector units, and that seems like the most natural width for game code. 4 element FP32 vectors let you accelerate the vast majority of 3D game code, because XYZW homogeneous co-ordinates and 4x4 transform matrices are a fundamental part of 3D game design. 16-element vectors are a lot trickier to utilize efficiently, and when you start getting data wide enough to saturate AVX-512 you would be better offloading to the integrated GPU- which already works in the same memory pool.

I would rather see console CPUs focus on making branchy, difficult to predict scalar code faster. Stick with 128-bit SIMD units, and spend your transistor budget on branch prediction, cache hierarchy, OoO execution width and windows, and so on. Leave the vector maths to the GPU.
 

krumme

Diamond Member
Oct 9, 2009
5,952
1,585
136
Meh, I'm honestly not that fussed about AVX-512, especially for games consoles. The current consoles execute AVX on 128-bit vector units, and that seems like the most natural width for game code. 4 element FP32 vectors let you accelerate the vast majority of 3D game code, because XYZW homogeneous co-ordinates and 4x4 transform matrices are a fundamental part of 3D game design. 16-element vectors are a lot trickier to utilize efficiently, and when you start getting data wide enough to saturate AVX-512 you would be better offloading to the integrated GPU- which already works in the same memory pool.

I would rather see console CPUs focus on making branchy, difficult to predict scalar code faster. Stick with 128-bit SIMD units, and spend your transistor budget on branch prediction, cache hierarchy, OoO execution width and windows, and so on. Leave the vector maths to the GPU.

Yeaa. My guess is jaguar got avx for that reason. Hardly usable in a tablet or low watt 13".
As for the branchy code i think the solution is in the game engine not predictors cache what not. Its a wall anyway. I dont know how Dan Bakers new engine model without a main thread looks from that perspective. The solution need to be in this kind of thinking eg with new hardware to back it up.
I am sure our new consoles will see a solution to this integrated in the hardware.
 

krumme

Diamond Member
Oct 9, 2009
5,952
1,585
136
And surely 512 wide vectors is nonsense in a cpu. As explained by agner. Its like asking an elephant to enter your car and take a ride. There is better ways and the dog slow uptage for even avx2 proves it. Not to mention even avx is just started.
 
  • Like
Reactions: Drazick

Atari2600

Golden Member
Nov 22, 2016
1,409
1,655
136
AVX512 will remain forever niche.

AMD would be better served dedicating the die area toward something else. Or not at all and having a smaller die.
 
  • Like
Reactions: Drazick

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
AMD plans to only support Advanced Synchronization Facility and 512-bit ISAs with their Clustered Multithreaded architectures.

The particular implementation of AMD's 512-bit will always allow for FlexFPU-ness.
1x 8x64 or 16x32 => 2x 4x64 or 4x 8x32 => 4x 2x64 or 8x 4x32
Either, by slotted functionality (One unit with four almost independent datapaths) or full units(four 128-bit units, with front-end and back-end trickery).

Doing the math based on the last update in 2015 would mean the CPU which would support TSX-AVX512-like abilities would launch 2019.
 

inf64

Diamond Member
Mar 11, 2011
3,703
4,030
136
AVX512 will remain forever niche.

AMD would be better served dedicating the die area toward something else. Or not at all and having a smaller die.
AMD is likely going to go full 256bit FP units with Zen2 or Zen3 (my guess in Zen2). Then they will support AVX512 the same way they support AVX256 today(executing them via 128bit units in 2 cycles).
Going full 512bit FP is basically not going to happen on 7nm as you will blow up the power budget if you want to keep the clocks the same and your complexity will skyrocket. AVX512 will be super niche feature that will be irrelevant for 90+% of server/workstation market and 99% of desktop market.
 
  • Like
Reactions: Drazick

LTC8K6

Lifer
Mar 10, 2004
28,520
1,575
126
Why would AMD support AVX512 in Zen2 if it is not needed for anything anytime soon? Zen2 isn't that far off.
 
Last edited:
  • Like
Reactions: Drazick

inf64

Diamond Member
Mar 11, 2011
3,703
4,030
136
Why would AMD support AVX512 in Zen+ if it is not needed for anything anytime soon? Zen+ isn't that far off.
Zen+ is not the same thing as Zen2. AMD plans for a 14nm++ revision of current offerings. Zen2 is 7nm product and all the chances are that this won't be ready until the Q4 of 2018 due to GF 7nm ramp plans.

PS AMD supports AVX256 now and it is not needed at all since the apps that use that ISA are very scarce.
 
  • Like
Reactions: Drazick

LTC8K6

Lifer
Mar 10, 2004
28,520
1,575
126
Zen+ is not the same thing as Zen2. AMD plans for a 14nm++ revision of current offerings. Zen2 is 7nm product and all the chances are that this won't be ready until the Q4 of 2018 due to GF 7nm ramp plans.

PS AMD supports AVX256 now and it is not needed at all since the apps that use that ISA are very scarce.
I meant Zen2 though.
 
  • Like
Reactions: Drazick

mattiasnyc

Senior member
Mar 30, 2017
356
337
136
Doesn't this have to do with segments on the market though? I mean, a fair amount of people and businesses who chose x99 would have been using systems that used both many PCIe lanes and 'wider' AVX2. So if AMD wants to target that market it needs to be competitive there, and that'll possibly include AVX2 / AVX512.

If you look at for example some of the content creation benchmarks that were done the results were "skewed" because of this. And so the comments ended up being that if you need a specific task to be done faster then a Ryzen won't be the best chip for you, regardless of whether or not it performs better for the money in other areas.

So curiously with Threadripper some will have an interesting choice to make between a bunch of extra PCIe lanes and wider AVX....
 

Atari2600

Golden Member
Nov 22, 2016
1,409
1,655
136
AMD is likely going to go full 256bit FP units with Zen2 or Zen3 (my guess in Zen2). Then they will support AVX512 the same way they support AVX256 today(executing them via 128bit units in 2 cycles).

Yes, sorry, that's what I meant! Quasi support is grand.

Native support is a... marginal.... use of die space IMO.
 
  • Like
Reactions: Drazick

Yakk

Golden Member
May 28, 2016
1,574
275
81
I don't see AVX-512 as a high priority, or even needed for AMD as noted above. That type of work is better suited to the GPU and APU workloads which AMD also makes.
 

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,654
136
Doesn't this have to do with segments on the market though? I mean, a fair amount of people and businesses who chose x99 would have been using systems that used both many PCIe lanes and 'wider' AVX2. So if AMD wants to target that market it needs to be competitive there, and that'll possibly include AVX2 / AVX512.

If you look at for example some of the content creation benchmarks that were done the results were "skewed" because of this. And so the comments ended up being that if you need a specific task to be done faster then a Ryzen won't be the best chip for you, regardless of whether or not it performs better for the money in other areas.

So curiously with Threadripper some will have an interesting choice to make between a bunch of extra PCIe lanes and wider AVX....
Depends on how you look at it. We know that Intel has an AVX offset (CPU slows down when running AVX due to higher power demands) so we don't know what the 299X AVX work is going to be like. On the server AMD by using 4 independant dies is going to allow for much higher clocks (my guess is 25-33% higher clocks than intels 24c offerings). So realistically AMD's Ryzen R7 performance is on par with i7, but down on clock speed. Threadripper will probably be closer on clockspeed do to lower stock speeds on the i9's and the AVX offsets, but will be even with cores so be way down on AVX2 performance, but also be half the price. EPYC will be clocked higher, have much more cores, more memory bandwidth, the difference between Epyc and a 24c Xeon is going to be really damn small. So outside the two actual occasions where someone is using AVX512 (not I didn't say usecases), the only market where AVX2 is a true weakness to AMD is in the Workstation/HEDT market with Threadripper and in that market will be roughly half the price. Also it looks like Threadripper will be available nearly 6 months before the 14-18c option. Which would even the playing field for ThreadRipper where it realistically would competing with a 12c i9 which would cost 50% even with it having 25% less threads. So a clock adavantage a core advantage, but a AVX per core disadvantage it would be slower than the i9 but not by much.

TLDR; Kind of got ranty but the point being that Zen gives AMD a lot of leg room in AVX2 by suppling more resources for AVX2 even if the per core performance is less than Intel. The only platform it would be an issue on would be X399. But even there the high cores are going to be ghosts for awhile. ThreadRipper will maintain R7 like clock options and realistically is only competing up to the 12c i9 at 50% more than the fastest 16c option. All in all for what AMD "loses" in AVX2. It doesn't actually lose too much.
 

mattiasnyc

Senior member
Mar 30, 2017
356
337
136
TLDR; Kind of got ranty but the point being that Zen gives AMD a lot of leg room in AVX2 by suppling more resources for AVX2 even if the per core performance is less than Intel. The only platform it would be an issue on would be X399. But even there the high cores are going to be ghosts for awhile. ThreadRipper will maintain R7 like clock options and realistically is only competing up to the 12c i9 at 50% more than the fastest 16c option. All in all for what AMD "loses" in AVX2. It doesn't actually lose too much.

I should have stated that I was sort of talking about x399/Threadripper, not the server parts. I sort of agree that it doesn't lose that much, but I think the perception in part of the target segment is different. They'll look at some of the benchmarks and determine that it's an underperforming chip based on that task that it performs significantly less well on.
 

mattiasnyc

Senior member
Mar 30, 2017
356
337
136
I don't see AVX-512 as a high priority, or even needed for AMD as noted above. That type of work is better suited to the GPU and APU workloads which AMD also makes.

I should have noted I was talking about x399 / Threadripper. I think there's a type of workstation setup where AVX is used and run on the CPU and if not it would have to be programmed onto the GPU instead as far as I understand it, and that's not something that people want to do, because, well, why spend that money?
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,355
1,548
136
AMD plans to only support Advanced Synchronization Facility and 512-bit ISAs with their Clustered Multithreaded architectures.

And from what tea leaves did you divine this?

Just FYI: AMD has no more clustered multithreaded architectures. They ditched it, there are no people working on anything CMT derived anymore.
 
  • Like
Reactions: inf64

Atari2600

Golden Member
Nov 22, 2016
1,409
1,655
136
Here - this is something useful for comparing with and without AVX2.

Courtesy of Dell:
5342.Figure1.JPG



We can make the rough and very approximate assumption that Zen is equivalent to Broadwell minus AVX2 for simplicity.

So, in the LS-DYNA run, BDW benefits from AVX2 by between 14 and 24% (with 24% being a big outlier - the other 4 comparisons being 13.9%-14.9%).
From that, we could roughly say then that AVX2 speeds up BDW by 20%. A very conservative assumption for Intel given its more like 15%.

Further assuming that any software you have is not licensed per thread...

If you have the choice between a 6 core BDW or a 8 core Zen, then per clock you'll get more out of Zen.
If you have the choice between a hypothetical 12 core BDW or a 16 core Zen, then per clock you'll get more out of Zen.
If you have the choice between a hypothetical 14 core BDW or a 16 core Zen, with our 20% assumption, you'll get more out of BDW (although its likely to be a wash).
 

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,654
136
I should have stated that I was sort of talking about x399/Threadripper, not the server parts. I sort of agree that it doesn't lose that much, but I think the perception in part of the target segment is different. They'll look at some of the benchmarks and determine that it's an underperforming chip based on that task that it performs significantly less well on.
But is it really underperforming. If the rumors in the top 16c chip is true at $1k. It will priced at the cost of a 10c Intel chip. Clocks will be a wash, with TR possibly having an all core advantage when running AVX2. It will have have a 6 core advantage. So while it will probably be down on AVX the difference will be marginal at best. If going all out and willing to spend 50% on the CPU by getting the 12c i9 would be the only time AMD really misses out on AVX performance and the only time in Zen's roadmap it really has any measurable loss to Intel that isn't gaming related.

Think about it this way. If you get the 6c, 8c, 10c, i9's AMD can match performance or beat AVX2 performance with a chip priced the same or cheaper. With tons of other resources available. It's only once you step out of the price brackets of AMD and say you are getting the best there is to offer even as poor value, that you get to a point the i9 lineup looks better. So really this a question of AMD being a poor AVX2 performer at their pricing spots, but them puttering out at a pricing escalation spot that they aren't going to compete in.

Sure some people will claim AVX2 performance matters more than anything else and gets either the 12c or 18 once that if ever is out. But really those won't be competing with AMD because they don't have anything in the price brackets. The question is going to really be outside the people who use that number to convince themselves to get those 12c+ options, how many people will really need AVX2 performance so much all the other advantages is wiped out by being a little slower (16c TR vs 12c i9), or just plain insanely priced like the 14+ options.
 

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
Meh, I'm honestly not that fussed about AVX-512, especially for games consoles. The current consoles execute AVX on 128-bit vector units, and that seems like the most natural width for game code. 4 element FP32 vectors let you accelerate the vast majority of 3D game code, because XYZW homogeneous co-ordinates and 4x4 transform matrices are a fundamental part of 3D game design. 16-element vectors are a lot trickier to utilize efficiently, and when you start getting data wide enough to saturate AVX-512 you would be better offloading to the integrated GPU- which already works in the same memory pool.

I would rather see console CPUs focus on making branchy, difficult to predict scalar code faster. Stick with 128-bit SIMD units, and spend your transistor budget on branch prediction, cache hierarchy, OoO execution width and windows, and so on. Leave the vector maths to the GPU.

Ehh, I wouldn't use an integrated GPU for everything when they have a far more restricted programming model compared to CPUs and there's very little register space too. There's essentially too many pitfalls for an integrated GPU to entirely replace tasks that CPU's SIMD extension would handle.

First of all there's an overhead cost to be paid between CPU and GPU communication despite all of AMD's gallant efforts with HSA whereas the SIMD extension is usually integrated onto the CPU.

Second, there's not many established frameworks out there that would embrace the idea of heterogeneous programming. Many app writers would rather an ISA extension instead of dealing with other ISAs altogether if we take a look at how slowly CUDA adoption has progressed and then we have the Cell processor too which is a spectacular example of a failed heterogeneous processor. (it didn't help that it also had an extreme NUMA system) The consensus is that symmetry is highly valued when it comes to hardware design.

Third, I feel that GPUs are arguably too wide for their own good to take advantage of the many different vectorizable loops that a CPU SIMD would. A growing trend with GPUs is that you need an extreme amount of data level parallelism to exploit it but another problem that arises with it is that you're also limited with the amount of vector registers you have on a GPU. With a CPU the common worst case you can do is spill to the L2 or even the L3 cache but with a GPU you'll run your caches dry very quickly and then soon enough you'll hit the wall that is the main memory which will cause significant slowdowns when you're practically bandwidth bound.

In short GPUs have too many pitfalls in comparison to CPU SIMDs. I think that AVX-512 strikes the ideal balance between programming model and parallelism. It makes vectorization at the compiler level easier too compared to AVX2 or previous SIMD extensions. Game consoles should have AVX-512 since it meets the needs of game engine developers and arguably hardware designers but a bonus side effect is a lower latency.

AVX-512 is too a low hanging fruit when we want to increase the performance of next gen console CPUs (we're already at our limit on the CPU side for current gen consoles and there will always be loops that are too small to see a speedup on GPUs so a wider SIMD would come in handy along with a vastly more refined programming model. I believe AVX-512 is the console manufacturers salvation in their perf/mm^2 and perf/watt targets when we consider that ALUs are cheap these days. They can't increase clocks without sacrificing perf/watt and they definitely do not want something as beefy as Ryzen to sacrifice so much die space for relatively little gains in ILP (graphics and physics programmers are going to get mad where all that die space went into) so I don't think it's coincidence that AVX-512 would slot in nicely in these scenarios ...
 
  • Like
Reactions: Carfax83 and dogen1

Atari2600

Golden Member
Nov 22, 2016
1,409
1,655
136
I believe AVX-512 is the console manufacturers salvation in their perf/mm^2 and perf/watt targets when we consider that ALUs are cheap these days. They can't increase clocks without sacrificing perf/watt and they definitely do not want something as beefy as Ryzen to sacrifice so much die space for relatively little gains in ILP (graphics and physics programmers are going to get mad where all that die space went into) so I don't think it's coincidence that AVX-512 would slot in nicely in these scenarios ...

That is full of contradictions.
 
  • Like
Reactions: Drazick

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
AMD plans to only support Advanced Synchronization Facility and 512-bit ISAs with their Clustered Multithreaded architectures.

The particular implementation of AMD's 512-bit will always allow for FlexFPU-ness.
1x 8x64 or 16x32 => 2x 4x64 or 4x 8x32 => 4x 2x64 or 8x 4x32
Either, by slotted functionality (One unit with four almost independent datapaths) or full units(four 128-bit units, with front-end and back-end trickery).

Doing the math based on the last update in 2015 would mean the CPU which would support TSX-AVX512-like abilities would launch 2019.

Just like how they 'planned' on supporting FMA4, XOP and CVT16 but how did that turn out ?

Your post is nothing more than conjecture of AMD only supporting ASF when it was only a proposal to begin with and there's already a higher marketshare with TSX along with higher developer support so it's too late for AMD to gain any traction with their own transactional memory technology ...
 

Atari2600

Golden Member
Nov 22, 2016
1,409
1,655
136
The least you could do is elaborate ...

"I believe AVX-512 is the console manufacturers salvation in their perf/mm^2"

Except, per mm^2, including AVX512 is very poor value for processing every other instruction.


"and perf/watt targets"

Except, you have to radically down clock your core to remain within thermal margins. Which introduces all sorts of issues in code parallelisms where its not simply feeding every core with similar instructions and in decoupling the core running AVX512 and the uncore feeding other threads.


"They can't increase clocks without sacrificing perf/watt"

They can. Process shrinks. IIRC neither console on a store shelf is on the latest process at the minute.

Note also that AVX512 sacrifices clocks to fit in the same power envelope as other operations.


"and they definitely do not want something as beefy as Ryzen"

As opposed to a core that devotes significant space to delivering one subset of instructions to the detriment of others?


"to sacrifice so much die space for relatively little gains in ILP"

If you have AVX512, your probably going to be sacrificing the number of cores you have to use (to fit in the same mm^2 and power budget).


"so I don't think it's coincidence that AVX-512 would slot in nicely in these scenarios ..."

From a programmer perspective, it may be nice to have. But, if you are wise enough to appreciate the losses you'll have elsewhere in the CPU to deliver that AVX512 (all other things being equal), then it makes little sense.
 
  • Like
Reactions: Drazick