Intel "Haswell" Speculation thread

Page 6 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

denev2004

Member
Dec 3, 2011
105
1
0
Sandy Bridge has a 256-bit multiplier (MUL) and 256-bit adder (ADD) unit. If Haswell had only one 256-bit fused multiply-add (FMA) unit, it would severely hurt floating-point performance. Legacy software doesn't use the FMA instructions, so only one MUL or ADD can be executed each clock cycle, instead of both simultaneously. And even software which does use FMA won't be able to achieve the same performance as Sandy Bridge because FMA requires a dependent MUL and ADD, which isn't always the case. Also given that gather is all about throughput computing it wouldn't make sense to cripple performance in the execution units. It would also fly against the goal of achieving higher performance/Watt.

Note also that Bulldozer features two 128-bit FMA units per module, on a 32 nm process. So it won't be an issue for Intel to equip Haswell with two 256-bit FMA units on 22 nm. The 256-bit paths are already there in Sandy Bridge.

So there should be no doubt that Haswell will feature two 256-bit FMA units, thereby doubling the peak throughput.

Knight's Corner is an in-order architecture aimed at the HPC market. It doesn't compete against desktop CPUs and it doesn't (have to) support legacy applications that make use of SSE. So its design is not an indication that Haswell would only have a single FMA unit.
Won't their be a problem happens in memory subsystem, as long as there's so much data should be read for the 2 widen AVX2 FMA 256bit unit ? At least, power cost shall increase a lot.

I once heard some guys talking an FMA+MUL or FMA+ADD unit would be helpful. Kinda like this idea now :)

It just sounds strange if the cpu in HPC market have the same amount of execution unit as PC cpu have in a single core....
 
Last edited:

denev2004

Member
Dec 3, 2011
105
1
0
thats actually a good thing, and is less watts per core. Remember desktop sandy was 4 core max and haswell is 6.
Are you sure?

Up to 2~6 cores available in consumer market
35, 45, 65, 95W TDP for desktop processors.

Is consumer market strictly equal to desktop processor?
 

Edrick

Golden Member
Feb 18, 2010
1,939
230
106
The "Haswell New Instructions" look particularly delicious, though.

That is going to be the best part about Haswell IMO. I feel that these new instructions could be some of the best ever from Intel.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
Do you have a link? I haven't seen this mentioned.

I hope its not from Wikipedia. They've got lot of baseless parts about Haswell in there.

We can't rule it out of course. Perhaps they'll get it as a top SKU. It's like when they announced quad core Kentsfields by surprise.
 

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
I am pretty sure AMD's Next Generation Core will push the boundaries even further.
Push what boundaries? The computational density of GPU architectures has been going down for the last several architectures. They have to invest ever more transistors into creating a fully programmable design that can handle more diverse workloads.

With the CPU's computational density for parallel workloads going up instead, it's clear they're converging ever closer together. So it's impossible to conceive a future for GPGPU.
 

CHADBOGA

Platinum Member
Mar 31, 2009
2,135
832
136
Push what boundaries? The computational density of GPU architectures has been going down for the last several architectures. They have to invest ever more transistors into creating a fully programmable design that can handle more diverse workloads.

With the CPU's computational density for parallel workloads going up instead, it's clear they're converging ever closer together. So it's impossible to conceive a future for GPGPU.
Don't tell piesquared that.

He has his heart set on GPGPU being AMD's secret weapon to slay Intel. :awe:
 

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
Won't their be a problem happens in memory subsystem, as long as there's so much data should be read for the 2 widen AVX2 FMA 256bit unit ? At least, power cost shall increase a lot.
Not really. The Core 2 Duo doubled the L2 cache bandwidth and doubled the SIMD processing width (and added x86-64 support), and yet it hardly consumed more power than the Core Duo on the same process node. So increasing the vector throughput really doesn't have much of an impact on a CPU's power consumption. Note that GPUs have even wider vectors and many more execution units, yet all the data traffic this generates is no particular cause for concern.

All that matters is performance per Watt. With AVX2, performance will go up considerably, while the peak power consumption shouldn't be much higher than Ivy Bridge. Note that Ivy Bridge will be exceptionally power efficient thanks to the 22 nm Tri-Gate technology. And since the process will have matured by the time Haswell goes into mass production, I'm expecting the power consumption to be practically the same.
I once heard some guys talking an FMA+MUL or FMA+ADD unit would be helpful. Kinda like this idea now :)
No, they're both unattractive. An FMA unit can execute three types of instructions, while a MUL or ADD unit can only execute one type. It's better to have an FMA unit execute a MUL or ADD than not use a unit at all in two out of three cases. Note that all current GPU designs only have FMA units. GF104 even has three of them, not any other combination of FMA/MUL/ADD.

The thing is, an FMA unit by itself isn't all that expensive (GPUs have tons of them). The relatively expensive part is all the routing between the register file and the ALUs and the forwarding network between ALUs. With an FMA+MUL configuration you'd have five input operands and two output operands to route. To have two FMA units you only need to route 8 instead of 7 in total. It would be a real shame to not go that extra step and end up with a lot of hardware that is only rarely used. Note again that Bulldozer already has two 128-bit FMA units at 32 nm. Widening them to 256-bit shouldn't be much of a problem at 22 nm.

Another practical issue with an asymmetric configuration is how do you dispatch instructions between the execution units? In an FMA+MUL configuration, when do you execute a MUL instruction on the FMA unit and when on the MUL unit? There isn't much time for making complex load balancing decisions this deep into the pipeline. A symmetric configuration makes it a lot simpler.
It just sounds strange if the cpu in HPC market have the same amount of execution unit as PC cpu have in a single core....
Knight's Corner has a many more cores than Haswell. Also note that Knight's Corner has only one scalar unit per core while Haswell should have three like previous designs. So it's not because the desktop CPU becomes capable of high-throughput workloads that it would become any less suited for legacy workloads. Furthermore, HPC products are aimed at software that contain a lot of explicit parallelism (using large vectors and matrices). AVX2 should also be useful for extracting parallelism from code that is seemingly scalar, by having compilers auto-vectorize loops that have independent iterations. So it's similar hardware technology but it's put to use in other cases.
 

janas19

Platinum Member
Nov 10, 2011
2,352
1
0
Not really. The Core 2 Duo doubled the L2 cache bandwidth and doubled the SIMD processing width (and added x86-64 support), and yet it hardly consumed more power than the Core Duo on the same process node. So increasing the vector throughput really doesn't have much of an impact on a CPU's power consumption. Note that GPUs have even wider vectors and many more execution units, yet all the data traffic this generates is no particular cause for concern.

All that matters is performance per Watt. With AVX2, performance will go up considerably, while the peak power consumption shouldn't be much higher than Ivy Bridge. Note that Ivy Bridge will be exceptionally power efficient thanks to the 22 nm Tri-Gate technology. And since the process will have matured by the time Haswell goes into mass production, I'm expecting the power consumption to be practically the same.

No, they're both unattractive. An FMA unit can execute three types of instructions, while a MUL or ADD unit can only execute one type. It's better to have an FMA unit execute a MUL or ADD than not use a unit at all in two out of three cases. Note that all current GPU designs only have FMA units. GF104 even has three of them, not any other combination of FMA/MUL/ADD.

The thing is, an FMA unit by itself isn't all that expensive (GPUs have tons of them). The relatively expensive part is all the routing between the register file and the ALUs and the forwarding network between ALUs. With an FMA+MUL configuration you'd have five input operands and two output operands to route. To have two FMA units you only need to route 8 instead of 7 in total. It would be a real shame to not go that extra step and end up with a lot of hardware that is only rarely used. Note again that Bulldozer already has two 128-bit FMA units at 32 nm. Widening them to 256-bit shouldn't be much of a problem at 22 nm.

Another practical issue with an asymmetric configuration is how do you dispatch instructions between the execution units? In an FMA+MUL configuration, when do you execute a MUL instruction on the FMA unit and when on the MUL unit? There isn't much time for making complex load balancing decisions this deep into the pipeline. A symmetric configuration makes it a lot simpler.

Knight's Corner has a many more cores than Haswell. Also note that Knight's Corner has only one scalar unit per core while Haswell should have three like previous designs. So it's not because the desktop CPU becomes capable of high-throughput workloads that it would become any less suited for legacy workloads. Furthermore, HPC products are aimed at software that contain a lot of explicit parallelism (using large vectors and matrices). AVX2 should also be useful for extracting parallelism from code that is seemingly scalar, by having compilers auto-vectorize loops that have independent iterations. So it's similar hardware technology but it's put to use in other cases.

+1
 

Sweepr

Diamond Member
May 12, 2006
5,148
1,142
131
I'm personally expecting a ~10-15% increase in IPC over Ivy Bridge (~15-20% over Sandy Bridge) with a focus on low power comsumption and closing the gap in IGP performance with AMD APUs. The IPC increase + >4GHz stock clock for the most powerful quad-core parts should be enough to convince LGA1156/1366 and even some LGA1155 users to upgrade.
 

jpiniero

Lifer
Oct 1, 2010
14,605
5,225
136
Yeah, I wouldn't expect 6 cores+GPU on Haswell. It's pretty likely that the increase in TDP is due to the new features and the 95W part could also have higher clock speeds.
 

Obsoleet

Platinum Member
Oct 2, 2007
2,181
1
0
I'm personally expecting a ~10-15% increase in IPC over Ivy Bridge (~15-20% over Sandy Bridge) with a focus on low power comsumption and closing the gap in IGP performance with AMD APUs. The IPC increase + >4GHz stock clock for the most powerful quad-core parts should be enough to convince LGA1156/1366 and even some LGA1155 users to upgrade.

The question is, closing the gap with today's AMD APUs? Or closing the gap with AMD's APUs in 2013?
Just a thought.

My i7 640M has serviceable video already. IMO they just need a little bump to put it over the edge into acceptability.

I'm most excited for a Haswell ultrabook. Actually, I'm excited for ultrabooks IvyBridge and beyond for trigate. I'm failing to get excited for a new desktop, even Haswell. I'm leaning more towards keeping my current rig running, than an upgrade in 2013 or 2014.
 

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
The question is, closing the gap with today's AMD APUs? Or closing the gap with AMD's APUs in 2013?
APUs are heavily bandwidth limited. AMD can't increase the processing power without increasing the bandwidth accordingly. And they're already pushing it with 1866 MHz DDR3. Of course they can add more channels or adopt DDR4, but that's going to severely increase the cost.

Furthermore, a modest quad-core Haswell CPU will have close to 500 GFLOPS in flexible processing power. An A8-3850 is at 480 GFLOPS for the GPU. So all Intel has to do is assist the IGP with the CPU cores. AVX2 makes them highly suitable for vertex and geometry shaders. When unplugged they could restrict the graphics processing to the IGP to save some power.

After Haswell it could evolve into a fully homogeneous architecture with AVX-1024. These 1024-bit instructions would be executed in four cycles on the 256-bit AVX units, very much like on a GPU. It has the advantage of allowing to clock gate the CPU's power hungry front-end for 3/4 of the time, and offers latency hiding. The advantage over a heterogeneous architecture is that the full chip can be used for graphics processing (or any other task for that matter).
 

denev2004

Member
Dec 3, 2011
105
1
0
Not really. The Core 2 Duo doubled the L2 cache bandwidth and doubled the SIMD processing width (and added x86-64 support), and yet it hardly consumed more power than the Core Duo on the same process node. So increasing the vector throughput really doesn't have much of an impact on a CPU's power consumption. Note that GPUs have even wider vectors and many more execution units, yet all the data traffic this generates is no particular cause for concern.

All that matters is performance per Watt. With AVX2, performance will go up considerably, while the peak power consumption shouldn't be much higher than Ivy Bridge. Note that Ivy Bridge will be exceptionally power efficient thanks to the 22 nm Tri-Gate technology. And since the process will have matured by the time Haswell goes into mass production, I'm expecting the power consumption to be practically the same.
Well I think bandwidth is not the only the problem.
More data means more register files. Sandy Bridge use a new register system to reduce this problem. But Haswell can't get a chance to do so
Vector peformance pre Watt is essential, but not the only one. Scalar peformance pre watt, which is useful in high IOps application, is also a point we need to think of.

No, they're both unattractive. An FMA unit can execute three types of instructions, while a MUL or ADD unit can only execute one type. It's better to have an FMA unit execute a MUL or ADD than not use a unit at all in two out of three cases. Note that all current GPU designs only have FMA units. GF104 even has three of them, not any other combination of FMA/MUL/ADD.

The thing is, an FMA unit by itself isn't all that expensive (GPUs have tons of them). The relatively expensive part is all the routing between the register file and the ALUs and the forwarding network between ALUs. With an FMA+MUL configuration you'd have five input operands and two output operands to route. To have two FMA units you only need to route 8 instead of 7 in total. It would be a real shame to not go that extra step and end up with a lot of hardware that is only rarely used. Note again that Bulldozer already has two 128-bit FMA units at 32 nm. Widening them to 256-bit shouldn't be much of a problem at 22 nm.
I don't know much about EE so for me it just seems like FMA unit is gonna cost more space for its unit and the memory subsystem = =

BTW once heard an idea that DLP design is uneffiency in dealing with branch, is that correct?:colbert:
But it really sounds like DLP may reduce the amount of cores, in an muilt-thread application with lots of branch, it will have problem

Another practical issue with an asymmetric configuration is how do you dispatch instructions between the execution units? In an FMA+MUL configuration, when do you execute a MUL instruction on the FMA unit and when on the MUL unit? There isn't much time for making complex load balancing decisions this deep into the pipeline. A symmetric configuration makes it a lot simpler.
Can't they get a way to deal with it? There are some asymmetric design I know..
Anyway, Sounds like single MUL instruction is better to go to MUL unit.
 

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
Well I think bandwidth is not the only the problem.
More data means more register files. Sandy Bridge use a new register system to reduce this problem.
AMD Bulldozer also has a physical register file (PRF), if that's what you're referring to.
Vector peformance pre Watt is essential, but not the only one. Scalar peformance pre watt, which is useful in high IOps application, is also a point we need to think of.
There are very few things left that can be done to improve scalar performance per Watt, without sacrificing single-threaded performance. One thing they could do is to extend macro-op fusion so that a scalar move instruction followed by a dependent arithmetic instruction is decoded as one non-destructive instruction.

It adds some complexity to the decoding stages, which consumes a bit of power, but fused instructions are executed in half the time and they also free up a slot in the decoded instruction cache and schedulers so other instructions benefit too. Although in theory it could double the execution rate, in practice an IPC improvement of 5-10% for scalar code would be more realistic. Still, if it only consumes 1% more power (all else being equal) then it's a worthwhile performance/Watt improvement.
I don't know much about EE so for me it just seems like FMA unit is gonna cost more space for its unit and the memory subsystem = =
FMA units are tiny. The RV770 has 800 of them, at just under a billion transistors on a 55 nm process. A quad-core Haswell only needs 64 of them, on a 22 nm process, five years later. So it's really not a lot to ask for. And they're surrounded by so much other logic that if you replaced them with MUL and ADD units instead, it would probably have less than 1% of impact on the transistor count.

As for the memory subsystem note that even accessing a single byte of data requires fetching a 64-byte cache line. Sandy Bridge can read 2 x 16-byte, but it would be straightforward to extend it to 2 x 32-byte. Again note that GPUs have had much higher aggregate L1 cache bandwidth for many years now, while still achieving good power efficiency. Likewise the L2 cache bandwidth per core hasn't changed since 2006, so you shouldn't worry about doubling it on a process with three times smaller feature size.

It may seem like overkill for scalar code, but that actually doesn't matter. It would just take only half the time to transfer a cache line, after which that bus can go back to sleep. So the average power consumption remains the same, and might even be lower thanks to a simpler design.
BTW once heard an idea that DLP design is uneffiency in dealing with branch, is that correct?:colbert:
Yes. Each of the SIMD components undergoes the same operation, so when you want other operations to be performed on some of them, the old results have to be thrown out and the new values have to be computed with different instructions, and then merged back in.

The problem gets worse with wider vectors. So for code that takes many different control paths it's a good thing that AVX supports vectors that are no wider than the execution units, and that it has multiple execution units.

Of course the opposite is true for less branchy code. With AVX-512 and AVX-1024 executed on 256-bit units you could get better latency hiding and lower power consumption, like on a GPU, but unlike a GPU you can still use AVX-256 to improve efficiency for more granular code.
Can't they get a way to deal with it? There are some asymmetric design I know..
Anyway, Sounds like single MUL instruction is better to go to MUL unit.
Then you're not taking advantage of the ability to execute two MUL instructions simultaneously.

Note that pair of FMA units can execute FMA+FMA, FMA+ADD, FMA+MUL, MUL+ADD, MUL+MUL, and ADD+ADD. The latter two also help legacy code. With an FMA+MUL configuration where a MUL instruction is always executed by the MUL unit, you can only execute FMA+MUL or ADD+MUL. And it doesn't help legacy code. So it's insane to go through the trouble of implementing FMA+MUL and get so little benefit, while with a tiny bit more effort you get a vastly superior solution.

Yes it's possible to make FMA+MUL units also handle MUL+MUL instructions, but you have to be careful not to block FMA instructions from executing. For instance take the following code: MUL, MUL, FMA, FMA. If they're all independent then if you start by executing MUL+MUL then the FMAs will take two more cycles, for a total of three cycles. If instead you execute MUL+FMA twice it completes in two cycles. Making such a decision is certainly possible in theory, but in practice it requires time during pipeline stages which are already critical for achieving good clock rates.

With dual FMA units you don't have that problem. So for all the above reasons it is most likely that Haswell will have dual FMA units.
 

denev2004

Member
Dec 3, 2011
105
1
0
AMD Bulldozer also has a physical register file (PRF), if that's what you're referring to.
But Nehalem doesn't

There are very few things left that can be done to improve scalar performance per Watt, without sacrificing single-threaded performance. One thing they could do is to extend macro-op fusion so that a scalar move instruction followed by a dependent arithmetic instruction is decoded as one non-destructive instruction.
What I means is just even we can't improve it, we should think of ways to not to reduce it a lot...

Note that pair of FMA units can execute FMA+FMA, FMA+ADD, FMA+MUL, MUL+ADD, MUL+MUL, and ADD+ADD. The latter two also help legacy code. With an FMA+MUL configuration where a MUL instruction is always executed by the MUL unit, you can only execute FMA+MUL or ADD+MUL. And it doesn't help legacy code. So it's insane to go through the trouble of implementing FMA+MUL and get so little benefit, while with a tiny bit more effort you get a vastly superior solution.
Isn't Intel's AVX unit made by 2 128 unit, means legacy code can be also supported?

BTW, there is another opinion

Anyway my opinion is that the optimal design choice is probably FMA+ADD. This is not just for power/cost reasons but also performance: a FMA unit will have very similar latency for MUL but significantly higher latency for ADD than a standalone unit. AMD had to compromise ADD latency on Bulldozer which doesn't seem ideal to me at this point in time.
 

TuxDave

Lifer
Oct 8, 2002
10,572
3
71
Isn't Intel's AVX unit made by 2 128 unit, means legacy code can be also supported?

What he means is that given an FMA + MUL config where you force all MUL to go to the MUL and not to the FMA, you will gain zero perf since it basically becomes ADD + MUL which is what Sandybridge already has.
 

Idontcare

Elite Member
Oct 10, 1999
21,118
58
91
What he means is that given an FMA + MUL config where you force all MUL to go to the MUL and not to the FMA, you will gain zero perf since it basically becomes ADD + MUL which is what Sandybridge already has.

Crap, you mean we aren't going to get something for nothing after all? :p

I've often wondered just how much opportunity is even out there for ISA-based performance improvements to IPC. After tweaking on x86 for what now, 40yrs?

Is there much left to improve upon (in the ISA itself) or is it really the microarchitecture that delivers the performance/watt and IPC improvements here out?
 

TuxDave

Lifer
Oct 8, 2002
10,572
3
71
Crap, you mean we aren't going to get something for nothing after all? :p

I've often wondered just how much opportunity is even out there for ISA-based performance improvements to IPC. After tweaking on x86 for what now, 40yrs?

Is there much left to improve upon (in the ISA itself) or is it really the microarchitecture that delivers the performance/watt and IPC improvements here out?

Oh there's definitely good stuff left.
 

AtenRa

Lifer
Feb 2, 2009
14,001
3,357
136
APUs are heavily bandwidth limited. AMD can't increase the processing power without increasing the bandwidth accordingly. And they're already pushing it with 1866 MHz DDR3. Of course they can add more channels or adopt DDR4, but that's going to severely increase the cost.
Llano uses 1866MHz, Trinity will use 2133MHz. Memory prices drops every year and memory speed rises too. 2013-2014 will see the introduction of DDR-4 and perhaps Haswell will utilize a DDR-4 memory controller too.

Furthermore, a modest quad-core Haswell CPU will have close to 500 GFLOPS in flexible processing power. An A8-3850 is at 480 GFLOPS for the GPU. So all Intel has to do is assist the IGP with the CPU cores. AVX2 makes them highly suitable for vertex and geometry shaders. When unplugged they could restrict the graphics processing to the IGP to save some power.
Well, Llano has 480GFOPs in 2011, Trinity will have 50% more compute power in 2012 and next gen AMD Fusion chip in 22nm SOI HKMG in 2013 will probably have double Trinities Compute power (GCN). That is only the compute from the iGPU ALU units, add to that the CPU FP units as well and AMDs 22nm APUs with GCN will be a beast in FP compute power.

After Haswell it could evolve into a fully homogeneous architecture with AVX-1024. These 1024-bit instructions would be executed in four cycles on the 256-bit AVX units, very much like on a GPU. It has the advantage of allowing to clock gate the CPU's power hungry front-end for 3/4 of the time, and offers latency hiding. The advantage over a heterogeneous architecture is that the full chip can be used for graphics processing (or any other task for that matter).
Well, that’s where AMD going with Fusion. But right now it seems AMD has the lead in Compute FP power and I don’t see Haswell to be able to change that with AVX only.