Haswell Integrated GPU = Larrabee3?

IntelUser2000 · Jan 27, 2012

CPUarchitect said:
According to the first post there will be 16 core parts. How many cores will be active remains to be seen of course. Likewise Knights Corner is said to have "more than 50 cores" which likely means 64 and several disabled depending on yield.

I think we're giving too much attention to a resume. About 2 years or so ago, there was a slide that said 12 cores max on Ivy Bridge EX, and 16 for Haswell. I assume targets change as the product gets closer to reality.

Yes but notice that Haswell will double the computing density with FMA while Knights can't pull that trick any more.

How do you know that? Stampede, the 2013 supercomputer that uses Knights Corner has 10 PFlops, and 8 of them comes from KC. The successor to Knights Corner is said to increase Stampede's output to 15PFlops, or more, suggesting co-processor part will double.

Assuming Haswell EX is a 16 core part @ 3GHz, it'll have a theoretical peak of 768GFlops. 2x Knights Corner using my estimates will result in 2.4 TFlops. That's still a big gap.

Granted, but DDR4 is on its way as well and CPUs require less RAM bandwidth due to massive caches.

Yes, comparing DDRx to GDDR5 and its derivatives, sure. 4 DDR4-3200 will result in 100GB/s but the GDDR5 in Radeon HD 6970 is already at 175GB/s. Also in practice, 4 DDR4 channels will remain close to 50GB/s than 100GB/s because they only reach max frequencies at the end of its lifetime. We're also talking about the EX, that uses very conservative speeds. But lets say it uses DDR4-2133 for 68GB/s.

Knights Ferry had a cache size that's way bigger than regular GPUs too. Of course not as big as the EX.

Most importantly, its a co-processor that's not designed to replace Xeons entirely. And you seem right that it doesn't make sense. But who knows what Intel's real goal is?

IntelUser2000 · Jan 27, 2012

The real reason Haswell's GPU should be based on Gen X is because the driver support is already in place thanks to predecessors. Complete revamp is bad news.

Do you think all Haswell parts will have an IGP? And do you happen to know how many GFLOPS 20 Gen X EU's amount to?

Sandy Bridge achieves 8GFlops/EU @ 1GHz, and Ivy Bridge is said to double that with enhanced co-issue. But even if it doubled again, would it really matter for compute? Would they bother to stick something like DP units there? You don't see high DP performance even on enthusiast cards, only on workstation variants. Why would they do it differently on integrated?

And how much room would there be to stick it in the non-PC parts like EPs and EXs? Is there enough gains to be worth the trouble?

Khato · Jan 28, 2012

IntelUser2000 said:
The real reason Haswell's GPU should be based on Gen X is because the driver support is already in place thanks to predecessors. Complete revamp is bad news.

Heh, but with the current state of Intel drivers, would a complete revamp really be all that bad? It's pretty much a given that they've learned a fair amount from Sandy Bridge seeing as how it's their first real feedback on 3D gaming performance. Taking that and starting from scratch rather than attempting to correct the flaws in their current driver sure sounds like it would result in a superior platform for Ivy Bridge that they could continue to build upon.

Interested as I am to see how Ivy Bridge performs, what I'm really looking forward to is Haswell. It should be awesome.

bronxzv · Jan 28, 2012

CPUarchitect said:
Point taken, but we were talking about converging LRBni and AVX, which both logically split the vectors into 128-bit lanes. So while the minimum width is indeed 128-bit

As explained the AVX2 any to any permute instructions truely require a 256-bit width, I just remarked in the advance program for ISSCC a description of a permute engine that may be well the one in Haswell :
http://www.miracd.com/ISSCC2012/WebAP/

see top of page 26
A 280mV-to-1.1V 256b Reconfigurable SIMD Vector 8:30 AM

Permutation Engine with 2-Dimensional Shuffle

in 22nm CMOS

note that the permute instructions are very useful for speeding up some nasty cases, for example it wasn't possible before to rotate a full 256-bit register, now with proper offsets you can achieve left and right rotations of 8 32-bit elements + a lot more
the way it's described as "2D dimentional" I suppose it's very fast with a single clock throughput

IntelUser2000 · Jan 29, 2012

Khato said:
Heh, but with the current state of Intel drivers, would a complete revamp really be all that bad? It's pretty much a given that they've learned a fair amount from Sandy Bridge seeing as how it's their first real feedback on 3D gaming performance.

Do you remember the G965? With that it really was nothing but a display renderer, at least initially. Most owners couldn't accept it could actually be slower than the predecessor, when "paper" specs were far superior. It took a year to enable hardware features, by which then the G35 arrived, though with minimal improvements(and I mean minimal, like 0-5%). But every generation after that was a huge improvement, in all aspects.

I'm pretty confident a complete revamp, and I mean not like from VLIW4 to GCN, or HD Graphics 3000 to Ivy Bridge graphics, would be lot worse than what they are doing, and a repeat of G965-Sandy Bridge. Sandy Bridge graphics is a lot better than most people here say, of course they are not equivalent to the competition, but lot of people bang their head over what aren't breaking deals.

Even Sandy Bridge's issues that are pointed out in LOT of you know, the internet, is due to hardware, not drivers. Like lower texture quality and lack of 23.976 playback. And guess what Ivy Bridge fixes?

And really, how many successful product launches were there that was a nearly 100% overhaul? It especially did not work for Intel, like Itanium, Larrabee, and Pentium 4. You could say something lot different about the successors, or the ones that learned from their mistakes.

Let's not ignore that Knights Corner, that's coming in a not too far away 2013, is said to forgo all 3D graphics related changes that were in Knights Ferry, the original Larrabee. The chances of Haswell being based on that is very little just based on that.

Don't forget this either:

Khato · Jan 29, 2012

IntelUser2000 said:
And really, how many successful product launches were there that was a nearly 100% overhaul? It especially did not work for Intel, like Itanium, Larrabee, and Pentium 4. You could say something lot different about the successors, or the ones that learned from their mistakes.

Let's not ignore that Knights Corner, that's coming in a not too far away 2013, is said to forgo all 3D graphics related changes that were in Knights Ferry, the original Larrabee. The chances of Haswell being based on that is very little just based on that.

Sorry, what made perfect sense given the driver tangent that I was going off on does indeed take on a different meaning if viewed from the primary context of the thread. Definitely should have been more clear that I was speaking on the subject of a complete driver revamp for the GenX architecture. As you state, the basic drivers for the current architecture were started back with BWR when 'good enough graphics' was still the mantra and little more than windows compliance mattered. Since I don't get into the driver side of things at all, I'm quite curious as to whether or not the current drivers are really a good thing for the architecture going forward, or if some legacy assumptions that can't be worked around will drag it down?

As for the Knights Corner... I forget whether there was anything explicitly said about it having the few 3D graphics blocks removed from the design?

CPUarchitect · Jan 30, 2012

bronxzv said:
I'll say where unrolling is unecessary to be more precise, there are a lot of cases with no single spill in the unrolled version and the unrolled version not faster, that's why I can't understand your POV on the matter

I'm sure there are plenty of anecdotal examples of that. If performance is limited by bandwidth or port contention then unrolling won't help even if the absence of spilling doesn't make it worse. But it's not the common case (if you eliminate the bandwidth bottleneck by using AVX-128, an issue that Haswell most likely won't have). The majority of software suffers from instruction and memory latencies to some extent.

and since you use the IP / trade secrets excuse, when it's so easy to devise a convincing example when we have something concrete to show, I'm afraid I'll not learn much from you

I could post a big chunk of OpenCL code here, but what exactly do you hope to learn from that? It's not as if you can execute it without having the entire application. And I'm terribly sorry but I can not share these applications. Also note that the memory latencies depend on the input data. I'm sure there are open-source applications out there which use OpenCL and which could be made to run on the CPU, but this is as "easy" to you as it is to me. I'm not convinced it's worth the effort just for the purpose of this conversation.

So with all due respect it should simply be common knowledge to anyone familiar with assembly programming that instruction latencies matter. Bulldozer achieves worse performance than Sandy Bridge because of it. But that doesn't mean there is no room left for further improvement. AVX-1024 instructions, executed in four cycles, offer a massive improvement in latency hiding.

At the very least I hope you don't need any convincing that it helps cover cache miss latencies and this would in many cases improve performance?

the workload that scale the best in our series of 3D models for the regression tests is at 31% speedup from HT, note that power consumption isn't significantly up with HT enabled according to the ASUS AI suite

Indeed SMT helps. I wasn't contesting that. But most of what SMT does, AVX-1024 would also provide (for vector workloads) but at a cheaper cost (and power consumption would go down instead of slightly up).

In fact AVX-1024 is almost like running four threads with identical AVX-256 instructions in lock-step. The difference is AVX-1024 doesn't require four times the instructions (also think of the scalar code in between), and because it guarantees the lock-step execution the memory accesses are more regular so your cache efficiency goes up. So if you like SMT, you should definitely be in favor of AVX-1024!

And they can still be used together. But note that if 2-way SMT offers 30%, then 4-way SMT may only offer 5% on top of that. That's just not worth doubling the associated resources. AVX-1024 has a much better cost/gain ratio.

CPUarchitect · Jan 30, 2012

bronxzv said:
that's 14nm, and it's certainly not a lot of chip area, even if we get 512-bit wide registers as I'm expecting as the next natural step

The problem is that with 512-bit execution it's not just the register space that has to increase but also every data bus, every ALU, every load/store port, cache bandwidth has to be four times than what we have today, etc. That's a far bigger investment than what is needed for AVX-1024 executed on 256-bit units.

It's possible that at some point in the future the execution width will increase to 512-bit, but not likely before we get AVX-1024 (or AVX-512) executed in multiple cycles.

don't forget that AVX-1024 isn't disclosed yet and the FMA4 debacle

Again, Intel disclosed that "AVX is designed to support 512 or 1024 bits in the future". They wouldn't say stuff like that if they didn't intend to at least implement it by executing these wide vectors in multiple cycles.

And AMD disclosed that it will support FMA3. So nothing is stopping them from implementing AVX-1024 after they completed AVX2 support either.

you seem to think the register file takes a lot of chip area, are you sure ?

No, I'm not worried about the effect on the total chip size at all. The problem is that register files have to service many reads and writes simultaneously and still be very fast, and this is affected by their size. It also affects the layout of other structures which are critical to performance. Anyway, it helps a lot to have a smaller process.

IntelUser2000 · Jan 30, 2012

Khato said:
Definitely should have been more clear that I was speaking on the subject of a complete driver revamp for the GenX architecture.

That does make a whole lot of difference then. Maybe the comparison is like how Intel didn't overhaul the clock on the 6 series chipset so it didn't have the ability to properly output 24p. Do we trust on Intel to do everything again from scratch though?

As for the Knights Corner... I forget whether there was anything explicitly said about it having the few 3D graphics blocks removed from the design?

There were texture units on KF, and that took about 1/6th of the die. To be honest, they never said it directly that its taken out on KC, but the whole repurposing towards HPC and the iGPU team saying things about fixed function being better in some cases, doesn't that make it more or less a done deal?

Oh, and maybe I read something about that from RWT and other technical discussions, but I can't find it now. I will keep in mind though.

bronxzv · Jan 30, 2012

CPUarchitect said:
won't have). The majority of software suffers from instruction and memory latencies to some extent.

sure, it's critical for branchy scalar integer code but not really for throughput oriented codes where instructions throughput and memory bandwidth are far more important limiters, I'm sure you know that

for example if you have a kernel dominated by low throughput instructions such as VSQRTPS/PD and VDIVPS/PD, throughput is your key limiter, something that your idea for AVX-1024 will not improve at all, unlike SMT which allow to to execute other instructions from another thread and thus minimize the impact of these low throughput instructions

CPUarchitect said:
Bulldozer achieves worse performance than Sandy Bridge because of it.

I have no tested my code on Bulldozer so I can't comment on it but from the data points I have seen I'll say throughput oriented code suffer mostly from the lackluster cache bandwidth

CPUarchitect said:
At the very least I hope you don't need any convincing that it helps cover cache miss latencies and this would in many cases improve performance?

Only if the cache line size is increased to 128 bytes I'll say, otherwise you'll stall more often than with 4 independent 32 bytes operation, without studying a simulation I suppose we can't say anything conclusive

CPUarchitect said:
Indeed SMT helps. I wasn't contesting that. But most of what SMT does, AVX-1024 would also provide (for vector workloads) but at a cheaper cost

In some cases, yes, but not all, think to the examples with low throughput instructions

CPUarchitect said:
So if you like SMT, you should definitely be in favor of AVX-1024!

I'm not against it, I just said I don't think it's a likely future to have it executed over 256-bit hardware as the next logical step (i.e. not improving the theoretical peak throughput per core) for all the reasons given above in this thread

CPUarchitect said:
But note that if 2-way SMT offers 30%, then 4-way SMT may only offer 5% on top of that

Where did you find this 5% figure ? One may ask why Power 7, MIC, Oracle Tx ... are all with 4 hardware threads per core or more

bronxzv · Jan 30, 2012

CPUarchitect said:
It's possible that at some point in the future the execution width will increase to 512-bit, but not likely before we get AVX-1024 (or AVX-512) executed in multiple cycles.

I don't see why, I will consider a big fail if the theoretical peak per core is unchanged in Skylake vs. Haswell, they can't pull the FMA trick another time as you said

if 22nm MIC can have 64x full width 512-bit units, I don't really see why a (16 cores) 14nm Skylake can't have 32x full width 512-bit units (2 units per core)

CPUarchitect said:
Again, Intel disclosed that "AVX is designed to support 512 or 1024 bits in the future". They wouldn't say stuff like that if they didn't intend to at least implement it by executing these wide vectors in multiple cycles.

Public intents and products are two different things, we all know that plans may change due to changing market conditions, anyway I'm not against this since I see AVX-512 as the most likely next step, I just find rather unlikely to see AVX-1024 introduced at the same time than AVX-512, and I can't really imagine AVX-512 being introduced without increasing the peak throughput

Khato · Jan 30, 2012

IntelUser2000 said:
That does make a whole lot of difference then. Maybe the comparison is like how Intel didn't overhaul the clock on the 6 series chipset so it didn't have the ability to properly output 24p. Do we trust on Intel to do everything again from scratch though?

Haha, quite true! But then again, perhaps drivers have been such an issue for them because of the fact that they're dealing with the remnants of a 'good enough' driver stack? Maybe I'm just being optimistic, but Intel certainly seems to be making rapid progress on features as well as performance (e.g. the one screenshot of HD2500 demonstrating decent anisotropic filtering.)

IntelUser2000 said:
There were texture units on KF, and that took about 1/6th of the die. To be honest, they never said it directly that its taken out on KC, but the whole repurposing towards HPC and the iGPU team saying things about fixed function being better in some cases, doesn't that make it more or less a done deal?

Oh, and maybe I read something about that from RWT and other technical discussions, but I can't find it now. I will keep in mind though.

I forget, was there a more recent update on Larrabee than this blog? Since that merely states no discrete graphics in the short-term, and point #4 states that research will continue. Given that I'd expect KNC design to be mostly complete at the point of that article, unless the decision to drop graphics had been made awhile ago it probably was in the design already. And if they're intending to continue research into the possibility, then it'd likely make sense to keep the texture sampling in the design so as to be able to run graphics testing on actual silicone. Yeah, bunch of postulation

I guess the other query there would be whether or not any of the texture sampling hardware could actually be useful for HPC?

CPUarchitect · Jan 30, 2012

IntelUser2000 said:
I think we're giving too much attention to a resume. About 2 years or so ago, there was a slide that said 12 cores max on Ivy Bridge EX, and 16 for Haswell. I assume targets change as the product gets closer to reality.

That resume appears to be more recent than the slide you're referring to. And besides, Ivy Bridge is much less of a throughput architecture than Haswell will be. So only the core count of the latter is relevant to this discussion.

How do you know that? Stampede, the 2013 supercomputer that uses Knights Corner has 10 PFlops, and 8 of them comes from KC. The successor to Knights Corner is said to increase Stampede's output to 15PFlops, or more, suggesting co-processor part will double.

You missed my point. Thanks to FMA, Haswell will likely double the peak GFLOPS at a relatively minor increase in transistor budget per core. In contrast, Knight's Corner's successor can't double the throughput without pretty much doubling the transistor cost.

Assuming Haswell EX is a 16 core part @ 3GHz, it'll have a theoretical peak of 768GFlops. 2x Knights Corner using my estimates will result in 2.4 TFlops. That's still a big gap.

That's a meaningless comparison without knowing the die size(s). A dual-socket system would put the CPU back in the picture while a quad-socket system would exceed the performance of two Knight's Corner chips.

And beside the total system cost it also depends on effective performance which option is most attractive. GPUs typically only achieve part of their peak throughput in real-world applications, and I assume the MIC to be no different. CPUs are much better at dealing with intermittent scalar/sequential code, and they're also better at handling irregular memory accesses.

Yes, comparing DDRx to GDDR5 and its derivatives, sure. 4 DDR4-3200 will result in 100GB/s but the GDDR5 in Radeon HD 6970 is already at 175GB/s. Also in practice, 4 DDR4 channels will remain close to 50GB/s than 100GB/s because they only reach max frequencies at the end of its lifetime. We're also talking about the EX, that uses very conservative speeds. But lets say it uses DDR4-2133 for 68GB/s.

Knights Ferry had a cache size that's way bigger than regular GPUs too. Of course not as big as the EX.

Again keep multi-socket systems in mind. Also, the need for RAM bandwidth depends a lot on the number of threads and available cache memory. With fewer threads a larger part of the working set can fit into the cache and there are less evictions caused by contention.

Most importantly, its a co-processor that's not designed to replace Xeons entirely. And you seem right that it doesn't make sense. But who knows what Intel's real goal is?

Knight's Corner is a remnant of the Larrabee project. Its main goal appears to be to try and recover some of the investment in a high margin market. Other than that it serves as a nice proving ground for high-throughput technology. For instance Haswell will very likely borrow the same gather implementation. But in the long run I don't see how the convergence would leave much of a future for the MIC.

CPUarchitect · Jan 30, 2012

IntelUser2000 said:
Sandy Bridge achieves 8GFlops/EU @ 1GHz, and Ivy Bridge is said to double that with enhanced co-issue.

As far as I know there can only be co-issue of another FMA when there's no transcendental operation. And I doubt it can do transcendentals at the same rate as FMA. Hence I'd be surprised if the peak FLOPS per EU doubled.

But even if it doubled again, would it really matter for compute? Would they bother to stick something like DP units there? You don't see high DP performance even on enthusiast cards, only on workstation variants. Why would they do it differently on integrated?

And how much room would there be to stick it in the non-PC parts like EPs and EXs? Is there enough gains to be worth the trouble?

I don't think general purpose compute on Ivy Bridge's IGP will be much more relevant than it was on Sandy Bridge, if that's what you're asking. And with a doubling of the CPU throughput with Haswell, there just isn't any time to create momentum for GPGPU.

But GPUs do still have a significant advantage in terms of power efficiency, which is why I expect Intel to implement something like AVX-1024 after Haswell.

IntelUser2000 · Jan 30, 2012

Khato said:
Haha, quite true! But then again, perhaps drivers have been such an issue for them because of the fact that they're dealing with the remnants of a 'good enough' driver stack? Maybe I'm just being optimistic, but Intel certainly seems to be making rapid progress on features as well as performance (e.g. the one screenshot of HD2500 demonstrating decent anisotropic filtering.)

Yes, but I'd like to point out the AF improvement is due to hardware. Sandy Bridge's hardware isn't capable of doing the same. I guess better drivers can improve compatibility and reduce issues like artifacts.

Biggest issues pointed out with Sandy Bridge seems to be the 23.976 not being 23.976, image quality, most of which is due to AF, and compatibility. The first two is definitely brought on par(ok, the first isn't perfect, but neither is the competition). For the 3rd point, Portal 2 was one example of big developer support for Sandy Bridge's iGPU, while Carmack suggests future integrated graphics, even for Intel will be acceptable. The recognition brought on by Sandy Bridge and dev tools like GPA will improve compatibility. If not just for a tick in the box, Sandy Bridge's GPU is certified for 50 games while Ivy Bridge will be at 100.

(Intel brought GPA around the time of Ironlake graphics. I've said that would do a lot in getting developers to optimize for Intel graphics, not just AMD/Nvidia. The least of because its a darn good, comprehensive tool. And we see Sandy Bridge is far better than the predecessor, in terms of games being able to run and play)

I forget, was there a more recent update on Larrabee than this blog? Since that merely states no discrete graphics in the short-term, and point #4 states that research will continue.

Do we expect Intel to really enter the discrete graphics market? I mean other than professional 3D workstation ones? And if we go by rumors, its that the architecture is a brand new one. Kirk, their Data Center guy, has said that while per thread performance is behind, future variants will use Atom cores and will get closer to Xeons. I might as well predict it exists not only to go against Nvidia with GPGPU, but future ARM servers(like Calexda) with many cores.

I think this is the most on-topic thread I've read recently.

nyker96 · Jan 30, 2012

CPUarchitect said:
That is highly surprising considering that Haswell is confirmed to add AVX2 support, which includes 'gather' and fused multiply-add instructions. That makes it practically equivalent to the Larrabee instruction set. So it doesn't make sense to have CPU and IGP cores with instruction sets which are very similar, yet not identical.

When GPUs started to have similar vertex and pixel processing capabilities, their cores unified. The primary thing that is lacking from AVX2 is power efficiency. But that can be solved with AVX-1024: The AVX encoding is known to be extendable to 1024-bit instructions. By executing them in four cycles on the existing 256-bit SIMD units, the power-hungry front-end of the CPU can be clock gated to achieve considerable power savings and rival the efficiency of a GPU.

.... still working on my PhD to understand this one ;]

bronxzv · Jan 31, 2012

nyker96 said:
.... still working on my PhD to understand this one ;]

note that the link provided by CPUarchitect
http://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions/
isn't written by an Intel employee and doesn't look like a reliable source for discussing AVX and the VEX encoding, for example the author states that AVX can work only in 64-bit mode which is factually wrong

btw the timings he provides with perfect 2x speedup from SSE to AVX doesn't match at all with what we see in practice, probably due to a very poorly optimized SSE version

CPUarchitect · Jan 31, 2012

bronxzv said:
note that the link provided by CPUarchitect
http://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions/
isn't written by an Intel employee

Intel® AVX: New Frontiers in Performance Improvements and Energy Efficiency

bronxzv · Jan 31, 2012

CPUarchitect said:
Intel® AVX: New Frontiers in Performance Improvements and Energy Efficiency

thanks, ""May 2008", nice memories, FMA was also announced with 4 operands in these ancient days (*)

I see that they were promising no more than AVX-512 for packed integers
"
Future Vector Integer support to 256 and 512 bits.
"

also note how it was phrased for AVX-1024 "even 1024 bits" as if 512 bits is more likely

out of curiosity I just checked the latest Intel® Advanced Vector Extensions Programming Reference (319433-011) and there is no mention of 512-bit and 1024-bit variants, in fact at page 4-3 it's stated that
"
Vector length encoding: This 1-bit field represented by the notation VEX.L. L= 0
means vector length is 128 bits wide, L=1 means 256 bit vector. The value of this
field is written as VEX.128 or VEX.256 in this document to distinguish encoded
values of other VEX bit fields.
"
I was thinking there is a provision for more lengths here, my bad
so AVX-512 and AVX_1024 encoding isn't disclosed so far, right ?, and should use some other reserved bits, do you know which ones ? I mean do you have a concrete idea of the encoding for these ?
IMO it would have been cleaner to have a 2-bit field for the vector length if they were planning 4 different length from the start, don't you think ?

* : Intel® Advanced Vector Extensions Programming Reference, March 2008 (319433-002)

CPUarchitect · Jan 31, 2012

bronxzv said:
I don't see why, I will consider a big fail if the theoretical peak per core is unchanged in Skylake vs. Haswell, they can't pull the FMA trick another time as you said

Like I said before, better power efficiency can also be traded for better performance. Note for instance that the Sandy Bridge-E disables two cores so it doesn't exceed the TDP target. By saving 25% of power per core it would have been able to enable those cores and achieve 33% higher performance (assuming good yield). Alternatively they could clock things higher, or do a bit of both.

You need to realize that at 14 nm and below, transistors will be incredibly cheap, so adding more cores isn't much of a problem. The real problem is the increase in power consumption this would bring. So anything that improves the clock gating opportunities is a very attractive proposition.

Also, while AVX-1024 executed on 256-bit units indeed wouldn't increase peak performance per core per clock, in reality that peak performance is hardly ever reached. At least part of that is to blame on cache misses and imperfect scheduling, and AVX-1024 does help with that so it achieves higher effective performance per core per clock.

if 22nm MIC can have 64x full width 512-bit units, I don't really see why a (16 cores) 14nm Skylake can't have 32x full width 512-bit units (2 units per core)

The MIC is an in-order architecture at a relatively low clock frequency. And it has just one 512-bit unit per core, while you're asking for two per core on a much more complex out-of-order architecture at a higher clock frequency. And since Sandy Bridge actually has three 256-bit units it can be argued that any 512-bit design would also need three units, not two, to avoid reducing the performance for legacy workloads. And you also need twice the bandwidth per core compared to the MIC.

The ALUs themselves are really quite small, so that's not the problem. The real problem is getting data in and out of them while still meeting timing constraints. And with a superscalar out-of-order architecture at high frequency that's a whole lot harder. Even at 14 nm it just isn't realistic to expect 512-bit units.

If you want higher throughput, just add more cores. Sandy Bridge-E has 8 cores at 32 nm, so at 14 nm they could pack 32 cores into the same area. But like I said above, you need AVX-1024 executed in four cycles to keep the power consumption acceptable.

I'm not against this since I see AVX-512 as the most likely next step, I just find rather unlikely to see AVX-1024 introduced at the same time than AVX-512, and I can't really imagine AVX-512 being introduced without increasing the peak throughput

I hope that by now it has become clear that peak throughput per core per clock doesn't determine total effective throughput. You have to take into account how many cores can be used, what their switching activity will be like, and how efficient they will be. If you look at the big picture, AVX-1024 executed on 256-bit units is far more likely to be implemented long before they'd consider adding 512-bit units.

One last thing to keep in mind is that Intel is going to use the same core architecture for everything ranging from Xeons in supercomputers to low-power chips for ultrabooks. In particular the latter would make having 512-bit units impossible to justify, while the Xeons can simply have lots and lots of cores.

bronxzv · Jan 31, 2012

CPUarchitect said:
The real problem is the increase in power consumption this would bring.

sure, thus all the researchs for ultra low voltage operation, look at Intel's forthcoming disclosures at ISSCC 2012 in a few days
note that Haswell will enjoy twice the theoretical peak throughput than Ivy Bridge per core on the very same process (including twice the cache bandwidth, your words) assuming 2 FMA per core which is much needed, it's not a stretch to think it can be doubled again 2 years later on a process more than 2x more dense and with very probably lower volts, particularly based on the push that Intel do toward exascale, and yes I know that mobile parts will have less core than server parts

CPUarchitect said:
be like, and how efficient they will be. If you look at the big picture, AVX-1024 executed on 256-bit units is far more likely to be implemented long before they'd consider adding 512-bit units.

as already said I feel it's *very unlikely*, now it will be interesting to see what other people have to say (for example a poll among software/hardware guys)

anyway, we will see one day which one of us was better at predicting the future but I'm afraid we will have to wait a couple more years to find out, btw have you already predicted something controversial right in the past ?

CPUarchitect · Jan 31, 2012

bronxzv said:
thanks, ""May 2008", nice memories, FMA was also announced with 4 operands in these ancient days (*)

I see that they were promising no more than AVX-512 for packed integers
"
Future Vector Integer support to 256 and 512 bits.
"

also note how it was phrased for AVX-1024 "even 1024 bits" as if 512 bits is more likely

Yes, that's indeed oddly phrased. But as you noted yourself in the meantime the FMA specification has changed, and it will be 7 years since that announcement before we can expect the register width to double again. So they've had lots of time to evaluate all options. The only thing that is certain is that the ISA is extendable up to 1024-bit, and integer operations should be no exception.

out of curiosity I just checked the latest Intel® Advanced Vector Extensions Programming Reference (319433-011) and there is no mention of 512-bit and 1024-bit variants, in fact at page 4-3 it's stated that
"
Vector length encoding: This 1-bit field represented by the notation VEX.L. L= 0
means vector length is 128 bits wide, L=1 means 256 bit vector. The value of this
field is written as VEX.128 or VEX.256 in this document to distinguish encoded
values of other VEX bit fields.
"
I was thinking there is a provision for more lengths here, my bad
so AVX-512 and AVX_1024 encoding isn't disclosed so far, right ?, and should use some other reserved bits, do you know which ones ? I mean do you have a concrete idea of the encoding for these ?
IMO it would have been cleaner to have a 2-bit field for the vector length if they were planning 4 different length from the start, don't you think ?

Page 4-4 states that "The 3-byte VEX leaves room for future expansion with 3 reserved bits. REX and the 66h/F2h/F3h prefixes are reclaimed for future use."

Yes it would have been "cleaner" to have a 2-bit field for the vector length, but it really doesn't matter from a hardware perspective. It will only annoy compiler writers. Not pinning down a second bit from the start allows them to still choose between either using one of the reserved bits, or a prefix. Also, the last VEX byte doesn't have room for 2 length bits anyway.

CPUarchitect · Jan 31, 2012

bronxzv said:
sure, thus all the researchs for ultra low voltage operation, look at Intel's forthcoming disclosures at ISSCC 2012 in a few days

Near-threshold computing is an order of magnitude slower. It's great for leaving your system idle for days, but doesn't actually help reduce power for high throughput situations.

note that Haswell will enjoy twice the theoretical peak throughput than Ivy Bridge per core on the very same process (including twice the cache bandwidth, your words) assuming 2 FMA per core which is much needed, it's not a stretch to think it can be doubled again 2 years later on a process more than 2x more dense and with very probably lower volts, particularly based on the push that Intel do toward exascale, and yes I know that mobile parts will have less core than server parts

Like I said before, it's only viable when there's a significant shift in workloads (i.e. the majority of software uses vectorization successfully). But it takes many years before the majority of systems will even be capable of AVX2. So it presents a significant risk for Intel to widen the SIMD units to 512-bit in the 2015 timeframe. Unlike AVX-1024 executed on 256-bit units, it takes a significant amount of transistors, which could otherwise have been used for additional cores or another feature. And due to the routing difficulties it would probably cost either speed or additional power consumption (due to larger transistors to meet the timing demands).

That's a very risky gamble, with little to gain in terms of effective performance when you take power limits and efficiency into account. In contrast, with my proposal there is practically no risk. If the adoption of AVX-1024 is slow there's nothing lost. And when it is adopted fairly quickly they've still got plenty of time to widen the execution units before it becomes a better alternative than to have more cores (if ever).

bronxzv · Jan 31, 2012

CPUarchitect said:
options. The only thing that is certain is that the ISA is extendable up to 1024-bit,

sure since one single more bit will be enough to select AVX-512 and AVX-1024 and as you say below there is 3 free bits + reserved prefixes

CPUarchitect said:
and integer operations should be no exception.

indeed, there is no technical reason to limit the width of the integer instructions, I really hope that from now on (AVX2) the integer instructions width will stay in sync with fp instructions as it's the case in MIC since KF

CPUarchitect said:
Page 4-4 states that "The 3-byte VEX leaves room for future expansion with 3 reserved bits. REX and the 66h/F2h/F3h prefixes are reclaimed for future use."

so the exact encoding for AVX-512 and AVX-1024 isn't disclosed yet, so we can't really talk at the moment like if it's something that will come for sure

bronxzv · Jan 31, 2012

CPUarchitect said:
Near-threshold computing is an order of magnitude slower. It's great for leaving your system idle for days, but doesn't actually help reduce power for high throughput situations.

This one doesn't look 10x slower for example (page 26 of the advance program)

10.3 A 1.45GHz 52-to-162GFLOPS/W Variable-Precision 9:30 AM
Floating-Point Fused Multiply-Add Unit with Certainty

Tracking in 32nm CMOS

CPUarchitect said:
Like I said before, it's only viable when there's a significant shift in workloads (i.e. the majority of software uses vectorization successfully).

good point, though IIRC you were mentioning the fact that thanks to gather a lot of codes will be vectorizable (not that I buy much the argument)

CPUarchitect said:
But it takes many years before the majority of systems will even be capable of AVX2. So it presents a significant risk for Intel to widen the SIMD units to 512-bit in the 2015 timeframe.

I don't really feel the risk since it's easy to clock gate (or even power gate AFAIK) unused lanes, like it's already the case on Sandy Bridge for the 128 high bits. It will also take less chip area to double the SIMD width than to double the number of cores to get the same peak FLOPS.

CPUarchitect said:
Unlike AVX-1024 executed on 256-bit units,

It looks like your idea will be more difficult to design and validate than a simple doubling of the SIMD width but I'm a software guy

CPUarchitect said:
contrast, with my proposal there is practically no risk.

there is the risk that the competition (IBM, Fujitsu, Oracle) reach the market with way more powerful solutions before than you because you stop increasing the peak FLOPS

Haswell Integrated GPU = Larrabee3?

Elite Member

Elite Member

Golden Member

Senior member

Elite Member

Golden Member

Senior member

Senior member

Elite Member

Senior member

Senior member

Golden Member

Senior member

Senior member

Elite Member

Diamond Member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member