Anand's Haswell Architecture article follow up questions...

Hulk · Mar 29, 2013

With Haswell on the horizon I finally got around to taking a deep dive into Anand's architecture article from last year. First I have to say it's an amazing read. Anand is in my opinion the best tech writer in the world. He has a gift. He's always been good at it and he gets better all the time. There is so much between the lines...

Anyway I have a few nit-picky questions I'm hoping the CPU gurus around here an help me with.

1. On page two "Platform Retargeting" Anand writes
"There will be four client focused categories of Haswell, and I can only talk about three of them now. There are the standard voltage desktop parts, the mobile parts and the ultra-mobile parts: Haswell, Haswell M and Haswell U."

This seems to indicate the three Haswells are Haswell, Haswell M, and Haswell U

Then a little farther down that page he writes
"It's the Haswell U/ULT parts that brings about the dramatic change. These will be a single chip solution, with part of the voltage regulation typically found on motherboards moved onto the chip's package instead. "

This seems to imply that Haswell U/ULT are both the 3rd Haswell because he said he couldn't discuss the 4th Haswell. Or perhaps he is telling us something?

But then on Page 4 "The Fourth Haswell" he writes
"Just before this year's IDF Intel claimed that Haswell ULT would start at 10W, down from 17W in Sandy/Ivy Bridge. Finally, at IDF Intel showed a demo of Haswell running the Unigen Heaven benchmark at under 8W:"

Since this section is titled the 4th Haswell are we to assume that the 4th Haswell is the ULT part that is sub 10W? That one that he couldn't discuss? Or perhaps he's leaving us to infer that there will be desktop parts, "Haswell," mobile parts (at higher clocks than current mobile CPUs at this TDP) in the standard 35W range "Haswell M," mobile parts for ultrabooks in the 17W range "Haswell U," and the 4th Haswell in the sub 10W range "Haswell ULT?"

I'm thinking that Anand is "saying" without "saying" that the 4th Haswell is an ultralow voltage part intended for tablets and perhaps even smaller devices. He is making that point not with his words but with the Intel IDF demo. This leads me to suspect there is still a surprise in store for us as to how low (and small) Haswell will go.

2. Where exactly does the "front end" of the execution engine end and the "back end" begin? Or more to the point, is the Decode Queue the front end or the back end?

3. "Haswell's Wide Execution Engine" page, Anand writes
"Simply being able to pick from more instructions to execute in parallel is one thing, we haven't seen an increase in the number of parallel execution ports since Conroe."

Anand is always deliberate in his writing and I'd like to know what he was getting at? I'm pretty sure that's just a typo and he meant to write the following but I'm not sure?
"Simply being able to pick from more instructions to execute in parallel is one thing, BUT we haven't seen an increase in the number of parallel execution ports since Conroe."

4. Under section Decoupled L3 Cache Anand writes
Ivy Bridge saw the addition of a small graphics L3 cache to mitigate this situation, but ultimately giving the on-die GPU independent access to the big, primary L3 cache without worrying about power concerns was a big issue for the design team.
Im not completely understanding this? I think it means the GPU received its own cache with Ivy Bridge, as Anand writes the addition of a small graphics L3 cache. Or does this mean a portion of the L3 was dedicated to the GPU? And then the next sentence confuses me even more. I think he is saying the Intel design team knew that the issue of giving the GPU frequency control of the CPU L3 or not was a big deal for Intel but they ultimately decided with Ivy to keep the CPU+uncore and GPU on separate frequency domains?
Also, now that Haswell has returned to the 3 clock domain design does is the small graphics L3 cache from Ivy still there?
I dont understand the 2nd to last sentence of this section.
There are now dedicated pipes for data and non-data accesses to the last level cache.
Finally the last sentence of this section.
Haswells memory controller is also improved, with better write throughput to DRAM. Intel has been quietly telling memory makers to push for even higher DDR3 frequencies in anticipation of Haswell.
I take this to mean that Intel knows that if there is a slight memory weakness with Haswell it comes from the increased latency of the L3. Which they hope can be mitigated by pushing memory manufacturers for faster main memory. As usual Anand puts quite a bit in between the lines but youve gotta really read to pull it out.

5. Just a comment. Gotta love Anands style. So great. Who else could equate writing well threaded code for independent tasks with the visualization of grabbing a low hanging apple off a tree!
Parallelizing truly independent tasks is the low hanging fruit, but its the tasks that all access the same data structure that can create a problems.

6. It seems as though Intel deliberately over engineers just a little bit, either the front end or back end of the instruction engine, and then catches up and then surpasses that end in the next tock or two. The widening of the back end of Haswell seems very significant to me. I'm thinking that we're going to finally see a wider than 4 unit front end with the next tock? Possible?

jpiniero · Mar 29, 2013

I do think what Anand was referring to is the 10/"7" W processor. That's still too much for a real tablet, but it could make a real thin ultrabook.

Idontcare · Mar 30, 2013

I wonder what kind of clockspeed they could get out of Haswell if it was TDP constrained to just 1W max power consumption? Think they could get 200MHz?

Blandge · Mar 30, 2013

Idontcare said:
I wonder what kind of clockspeed they could get out of Haswell if it was TDP constrained to just 1W max power consumption? Think they could get 200MHz?

I suspect it would be unstable, and/or Atom performs significantly better at 1W.

Idontcare · Mar 30, 2013

Blandge said:
I suspect it would be unstable, and/or Atom performs significantly better at 1W.

I doubt Atom would perform better, but atom would be a hell of a lot cheaper to manufacture given the die size and development expense.

Running haswell at 1W is the "let them eat cake" edition

I was just curious how low the clocks have to be in order for it to be stable. Presumably there is a clockspeed at which it is stable (enough volts) and it doesn't use more than 1W...even if the clockspeed is just 100KHz.

Hulk · Mar 30, 2013

Idontcare said:
I doubt Atom would perform better, but atom would be a hell of a lot cheaper to manufacture given the die size and development expense.

Running haswell at 1W is the "let them eat cake" edition I was just curious how low the clocks have to be in order for it to be stable. Presumably there is a clockspeed at which it is stable (enough volts) and it doesn't use more than 1W...even if the clockspeed is just 100KHz.

Perhaps quantum mechanics rears it's head and the threshold operating voltage is quantized regardless of clockspeed. Just a guess.

Charles Kozierok · Mar 30, 2013

I can't answer any of this definitively, but my speculations...

Hulk said:
This seems to imply that Haswell U/ULT are both the 3rd Haswell because he said he couldn't discuss the 4th Haswell. Or perhaps he is telling us something?

I think there is going to be a ULT part and a ULX part, differing only in power use. Both will have the PCH integrated. Maybe Anand was referring to those two products here.

Hulk said:
Since this section is titled the 4th Haswell are we to assume that the 4th Haswell is the ULT part that is sub 10W? That one that he couldn't discuss?

My guess is it's the ULX.

Hulk said:
2. Where exactly does the "front end" of the execution engine end and the "back end" begin? Or more to the point, is the Decode Queue the front end or the back end?

The usual dividing line between the front and back ends is between decoding and scheduling/dispatching, which is also where you move from in-order to out-of-order processing. Anand does mention this in the article. But of course, there is no actual formal division, it's just a way of looking at things.

Hulk said:
Anand is always deliberate in his writing and I'd like to know what he was getting at? I'm pretty sure that's just a typo and he meant to write the following but I'm not sure?

Probably; the comma should be a semi-colon.

Hulk said:
I’m not completely understanding this? I think it means the GPU received it’s own cache with Ivy Bridge, as Anand writes “the addition of a small graphics L3 cache.” Or does this mean a portion of the L3 was dedicated to the GPU?

I believe the GPU in IB has its own L3 cache. Cf: http://www.realworldtech.com/ivy-bridge-gpu/6/

Hulk said:
I take this to mean that Intel knows that if there is a slight memory weakness with Haswell it comes from the increased latency of the L3.

Not sure how you concluded that from what he wrote. I think they just want faster memory because it will improve graphics performance.

Hulk said:
5. Just a comment. Gotta love Anand’s style. So great. Who else could equate writing well threaded code for independent tasks with the visualization of grabbing a low hanging apple off a tree!

"Low hanging fruit" is actually a pretty common idiom meaning that you start with the stuff that's easy to do before you move on to the more difficult things.

Homeles · Mar 30, 2013

Hulk said:
Also, now that Haswell has returned to the 3 clock domain design does is the “small graphics L3 cache” from Ivy still there?

Yes, it is still there. Die shots of Haswell confirm this. It is (one of, I'd guess the leftmost of the two) the object in pink on the extreme left.

http://imgon.net/di-TDNO.png

“Haswell’s memory controller is also improved, with better write throughput to DRAM. Intel has been quietly telling memory makers to push for even higher DDR3 frequencies in anticipation of Haswell.”
I take this to mean that Intel knows that if there is a slight memory weakness with Haswell it comes from the increased latency of the L3. Which they hope can be mitigated by pushing memory manufacturers for faster main memory. As usual Anand puts quite a bit in between the lines but you’ve gotta really read to pull it out.

I don't think it's a result of the L3 taking a latency hit; I see it as it just being another evolution in Intel's IMC. Since Conroe, it's taken a step forward with every generation with the possible exception of Westmere.

Idontcare · Mar 30, 2013

Hulk said:
Perhaps quantum mechanics rears it's head and the threshold operating voltage is quantized regardless of clockspeed. Just a guess.

Well that is true, there is a point (threshold voltage) below which the device won't reliably operate regardless of clockspeed.

I'm thinking of the test where the CPU is set to that voltage and then one steps the clockspeed down as needed to get power under 1W.

Now it may be the case that the static leakage is just too great, even at threshold voltage the device may have a power-consumption floor that prevents the CPU from getting below say 6 or 7W.

For example my 3770k, using the equations of state determined here, at 35C and 100MHz with an assumed Vthreshold of 0.5V (not unreasonable) my CPU would have 2.6W of static losses and a mere 0.54W of dynamic power usage...so ~3W overall.

Haswell is supposed to use special finfet xtors though, probably designed to have even lower leakage current. And if you cut the chip in half (2C/4T instead of 4C/8T) the leakage gets cut in half as well. So that 2.6W becomes 1.3W, and lower still from special xtors. Dynamic power gets cut in half as well, from 0.54W to 0.27W.

So it might be possible...but then one definitely has to wonder as Blange pointed out if a 100MHz 2C/4t Haswell would actually outperform a much less expensive Atom at the same power consumption at that point. Probably not. I think Blange was right.

Hulk · Mar 30, 2013

Idontcare said:
Well that is true, there is a point (threshold voltage) below which the device won't reliably operate regardless of clockspeed.

I'm thinking of the test where the CPU is set to that voltage and then one steps the clockspeed down as needed to get power under 1W.

Now it may be the case that the static leakage is just too great, even at threshold voltage the device may have a power-consumption floor that prevents the CPU from getting below say 6 or 7W.

For example my 3770k, using the equations of state determined here, at 35C and 100MHz with an assumed Vthreshold of 0.5V (not unreasonable) my CPU would have 2.6W of static losses and a mere 0.54W of dynamic power usage...so ~3W overall.

Haswell is supposed to use special finfet xtors though, probably designed to have even lower leakage current. And if you cut the chip in half (2C/4T instead of 4C/8T) the leakage gets cut in half as well. So that 2.6W becomes 1.3W, and lower still from special xtors. Dynamic power gets cut in half as well, from 0.54W to 0.27W.

So it might be possible...but then one definitely has to wonder as Blange pointed out if a 100MHz 2C/4t Haswell would actually outperform a much less expensive Atom at the same power consumption at that point. Probably not. I think Blange was right.

Looking at the benches quickly I think a 5x clockspeed advantage for ivy is a good estimate. Which means it would need about 400MHz to actually be better performing.

Hulk · Mar 30, 2013

Charles Kozierok said:
Not sure how you concluded that from what he wrote. I think they just want faster memory because it will improve graphics performance.

Exactly. Anand didn't write anything about Intel urging memory manufacturers for faster memory when talking about the CPU/L3 latency but only when discussing (for a couple of paragraphs) the difficult decision they had moving away from the "extremely fast" L3 GPU access of Sandy Bridge. The increased L3/GPU latency coupled with the more efficient GPU will definitely stress the graphics memory subsystem. And as I wrote I don't think it's a weakness per se, just that if there is a weak spot to look for when the part is available for testing, this is something that deserves a good look.

As for the low hanging fruit comment, yeah I know what that means. I was just commenting on how much I enjoy Anand's style.

IntelUser2000 · Apr 1, 2013

Idontcare said:
So it might be possible...but then one definitely has to wonder as Blange pointed out if a 100MHz 2C/4t Haswell would actually outperform a much less expensive Atom at the same power consumption at that point. Probably not. I think Blange was right.

Based on SpecInt2k, Atom at 1.6GHz should get about 700 points while Ivy Bridge should get 1.2 points per MHz. Let's say Haswell gets that to 1.3.

So Haswell would need to be at 540MHz to be like a 1.6GHz Atom in single thread. Let's say 500MHz because scaling with clock speed isn't linear. Atom gets better Hyperthreading gains so in Multi-threading it would need maybe 600MHz or so.

Interesting fact: On my Ultrabook, 800MHz fixed frequency running Dhrystone and Whetstone benchmarks used 4.5W. They would need far less CPU frequency and radically engineer the non-core functions as the CPU used less than 2.8W at that point, if they want to go below 1W, because the uncore takes nearly 2W regardless of CPU load or frequency points. That's not even mentioning that the benchmark isn't as intensive as Linpack, which is what TDP is for nowadays. That's why the Atom exists. To scale down below levels of their Core line.

Arachnotronic · Apr 1, 2013

IntelUser2000 said:
Based on SpecInt2k, Atom at 1.6GHz should get about 700 points while Ivy Bridge should get 1.2 points per MHz. Let's say Haswell gets that to 1.3.

So Haswell would need to be at 540MHz to be like a 1.6GHz Atom in single thread. Let's say 500MHz because scaling with clock speed isn't linear. Atom gets better Hyperthreading gains so in Multi-threading it would need maybe 600MHz or so.

Interesting fact: On my Ultrabook, 800MHz fixed frequency running Dhrystone and Whetstone benchmarks used 4.5W. They would need far less CPU frequency and radically engineer the non-core functions as the CPU used less than 2.8W at that point, if they want to go below 1W, because the uncore takes nearly 2W regardless of CPU load or frequency points. That's not even mentioning that the benchmark isn't as intensive as Linpack, which is what TDP is for nowadays. That's why the Atom exists. To scale down below levels of their Core line.

And people wonder why with Haswell, Intel decoupled the L3$...

Erva · Apr 2, 2013

I have another question if that's ok.

How come the desktop versions have a higher TDP than the previous generation? i5-3570K: 77W i5-4570: 84W. Isn't Haswell supposed to have lower power consumption?

ShintaiDK · Apr 2, 2013

Erva said:
I have another question if that's ok.

How come the desktop versions have a higher TDP than the previous generation? i5-3570K: 77W i5-4570: 84W. Isn't Haswell supposed to have lower power consumption?

On package VRM. Platform consumption might still be lower.

Search

Anand's Haswell Architecture article follow up questions...

Hulk

Diamond Member

jpiniero

Lifer

Idontcare

Elite Member

Blandge

Member

Idontcare

Elite Member

Hulk

Diamond Member

Charles Kozierok

Elite Member

Homeles

Platinum Member

Idontcare

Elite Member

Hulk

Diamond Member

Hulk

Diamond Member

IntelUser2000

Elite Member

Arachnotronic

Lifer

Erva

Member

ShintaiDK

Lifer

TRENDING THREADS