Intel "Haswell" Speculation thread

Page 23 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

BenchPress

Senior member
Nov 8, 2011
392
0
0
I think it's possible that they just don't. When they were talking about the added execution port, they only mentioned that it frees 0 and 1 to FMA, not that it would increase IPC for present integer code. (Which, if it was fully connected, it absolutely would. Why not advertise?)

This leads me to believe that perhaps it doesn't forward to/from 0 and 1, and just exists so that loop counters, branches and such can be managed while 0 and 1 are dedicated to vector loads.
According to ARCS001 slide 12, ports 0+1 and 6+5 are symmetric when it comes to scalar integer operations and branch. So it seems possible that there's no forwarding between these pairs. Perhaps there's some instruction dependency analysis going on before scheduling, so that dependent ones are dispatched to the same pair of ports. With Hyper-Threading it's trivial to know which are independent. That way the second branch unit also starts to make a lot more sense...

It would also mean the IPC gain for single-threaded code is very minimal or even non-existent if there weren't other improvements. Hence why they wouldn't advertise it. Heck, it would bear some resemblance to Bulldozer. :hmm:
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
Engineer 1: Add Port 6
Engineer 2: No
Engineer 1: Do it, it's really awesome if we do
Engineer 2: Ok

:D

Edit So on one hand, you make a complicated forwarding network to simplify the scheduler (doesn't have to care which ALU to which ALU).... or you have a simple network and a complicated scheduler. Just saying you have to pick the lesser of two evils.
If the added latency of having no bypass between 0+1 and 5+6 at all is small enough for that to be an option, then I believe the scheduler doesn't have to get any more complicated. It already has to be able to issue instructions which have operands coming from the register file anyway.

But that's a big 'if' of course. I don't know what the latency is for writing a PRF and reading from it again these days.
 

TuxDave

Lifer
Oct 8, 2002
10,572
3
71
If the added latency of having no bypass between 0+1 and 5+6 at all is small enough for that to be an option, then I believe the scheduler doesn't have to get any more complicated. It already has to be able to issue instructions which have operands coming from the register file anyway.

But that's a big 'if' of course. I don't know what the latency is for writing a PRF and reading from it again these days.

I want to write a bunch.... but I can't. You're really over emphasising reducing design complexity for a lot of architectural complexity (the scheduler WILL have a bad day) AND a potential perfomance hit. And Intel has a super awesome design team. Just saying...
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
I want to write a bunch.... but I can't. You're really over emphasising reducing design complexity for a lot of architectural complexity (the scheduler WILL have a bad day) AND a potential perfomance hit. And Intel has a super awesome design team. Just saying...
Thanks for the hints. I read up on (unified) scheduler design as much as I could find and I now realize how uniform bypass latencies indeed keep things way simpler.

It still seems like a big feat to have four integer execution ports which could all execute instructions back-to-back. Has that even been done before? I believe it either sacrifices clock speed, or increases power consumption, or the "super awesome design team" has outdone itself and maximized the potential of the 22 nm process to squeeze it all in with no major compromises. Haswell is very power efficient so I guess that narrows it down.

One other suggestion I've stumbled upon is to use a form of width pipelining to save on bypass time. But maybe I'm making things more complicated than they have to be again. I'm just baffled by the addition of another arithmetic execution port, and hope it doesn't come at a significant cost.
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
How about this idea: Most arithmetic code is 32-bit, while 64-bit is mainly used for pointers and thus typically not on a critical path. So it would probably work out fine if 32-bit arithmetic had a latency of 1 cycle and 64-bit arithmetic had a latency of 2 cycles.

Unlike making the bypass latency between ports 0+1 and 5+6 longer, I don't think it would complicate scheduling. The latency of the operations wouldn't depend on which ALU they're coming from. It would just depend on their width.

One complication is that you can't have a 64-bit operation started two cycles ago be completed at the same time as a 32-bit operation started a cycle ago. So a 64-bit operation can only be followed by an independent 64-bit operation. But given that each execution port has a twin, that doesn't seem like an issue either!

The end result would be that Haswell doesn't have to sacrifice clock speed, and IPC could be slightly higher!
 

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
Even though I really like what Haswell brings to the table. In so far as compute. It also brings a heavy heart . I see broadwell as the end of the Big case setting alongside your desk. It really is a step forward this Haswell is, Sadlly we as a community have to move with that step . It looks like that dick tracey watch is near.Along with new compute power comes new form. Haswell accellerates that new form factor change. broadwell brings an end to cabinet pcs
 
Last edited:

Tuna-Fish

Golden Member
Mar 4, 2011
1,355
1,548
136
How about this idea: Most arithmetic code is 32-bit, while 64-bit is mainly used for pointers and thus typically not on a critical path.

This is not true at all. Pointers are very much on the critical path -- in fact, since pointer operations are often followed by loads, and since the CPU scheduler often empties while waiting for a load (even a L2 hit takes so long that the execution units can usually clear the sceduler), every cycle you delay a pointer operation (and the issuing of the load that follows) means a completely lost cycle of execution for all execution units. Combine that with code that really likes objects (so practically every pointer access at least has an add), 2-cycle 64 bit ops would be a disaster.
 

cytg111

Lifer
Mar 17, 2008
23,216
12,860
136
"broadwell brings an end to cabinet pcs"

- the second the magic code fairy reveals the one true language to parallelize them all, is the second it really becomes all about moar coars. The second after that, the big ass power venting cabinets will be back.
Wait for it ... wait for it ...
 

Ajay

Lifer
Jan 8, 2001
15,465
7,868
136
One complication is that you can't have a 64-bit operation started two cycles ago be completed at the same time as a 32-bit operation started a cycle ago. So a 64-bit operation can only be followed by an independent 64-bit operation. But given that each execution port has a twin, that doesn't seem like an issue either!

Why not? Especially since
Increasing size of buffers internally, giving us larger OoO window
from the AT Blog. So so long as the 64b & 32b operations are independent, I see no problem.

Again, my experience @ the ISA level is with RISC (mainly i860/960, PPC 750 & 32b Mips). Given the load/store nature of RISC, I would think that 64b would be in the critical path, but I don't know much about CISC architectures (aside from what I've already forgotten from my MPU Design & Comp Arch classes :$).
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
This is not true at all. Pointers are very much on the critical path -- in fact, since pointer operations are often followed by loads, and since the CPU scheduler often empties while waiting for a load (even a L2 hit takes so long that the execution units can usually clear the sceduler), every cycle you delay a pointer operation (and the issuing of the load that follows) means a completely lost cycle of execution for all execution units. Combine that with code that really likes objects (so practically every pointer access at least has an add), 2-cycle 64 bit ops would be a disaster.
Thanks for pointing that out, but note that anything that fits the (many) x86 addressing modes will be executed by the AGUs at port 2 and 3, with no added latency. So I don't think having other 64-bit arithmetic taking two cycles would be anywhere near a disaster. Having an extra execution port should offset that and even offer higher IPC.
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
Why not? Especially since "Increasing size of buffers internally, giving us larger OoO window" from the AT Blog. So so long as the 64b & 32b operations are independent, I see no problem.
The problem is that if you start a 1-cycle operation after a 2-cycle operation, the results would be ready on the same cycle. And you simply can't send two results down the same result bus. So the 1-cycle operation simply has to wait, or execute on another port. Starting another 2-cycle operation on the same port is no problem though. So there's no loss of throughput. And as I noted before, having 'twin' ports makes port contention between 1-cycle and 2-cycle operations very unlikely.
Again, my experience @ the ISA level is with RISC (mainly i860/960, PPC 750 & 32b Mips). Given the load/store nature of RISC, I would think that 64b would be in the critical path, but I don't know much about CISC architectures (aside from what I've already forgotten from my MPU Design & Comp Arch classes :$).
Indeed for RISC this would be a problem. But with CISC the majority of pointer arithmetic is part of the addressing mode, which can be executed by an indepedent AGU instead of requiring the generic ALUs.
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
Even though I really like what Haswell brings to the table. In so far as compute. It also brings a heavy heart . I see broadwell as the end of the Big case setting alongside your desk. It really is a step forward this Haswell is, Sadlly we as a community have to move with that step . It looks like that dick tracey watch is near.Along with new compute power comes new form. Haswell accellerates that new form factor change. broadwell brings an end to cabinet pcs
No, that will take awhile, yet. The primary reason for our boxes, today, is not anything that Haswell removes. We have them for drives, special bay devices, and expansion cards.

AIOs will only get more popular, and even power users will use MicroATX more than more, so our cabinets will shrink, and low-profile cards will become ever more popular, as well, but it will be quite some time yet before they go away.
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
By the way, not all 64-bit operations would have to take 2 cycles. Bitwise operations are trivial so they can still be 1 cycle.

But having things like 64-bit addition/subtraction and shift/rotate take 2 cycles would make it a lot easier to have enough time to bypass the results between four ALUs without having to lower the clock frequency. And that would mean there's hope for overclockers after all!
 

Revolution 11

Senior member
Jun 2, 2011
952
79
91
No, that will take awhile, yet. The primary reason for our boxes, today, is not anything that Haswell removes. We have them for drives, special bay devices, and expansion cards.

AIOs will only get more popular, and even power users will use MicroATX more than more, so our cabinets will shrink, and low-profile cards will become ever more popular, as well, but it will be quite some time yet before they go away.
People have been saying for years (decades?) that the desktop is going to die. It never does. Granted, the market is slowly declining and in a matured state but I think there will always be a room for desktop PCs. Scaling up performance is much harder in a compact form (tablet/laptop) for several reasons already mentioned.
 

Ajay

Lifer
Jan 8, 2001
15,465
7,868
136
The problem is that if you start a 1-cycle operation after a 2-cycle operation, the results would be ready on the same cycle. And you simply can't send two results down the same result bus. So the 1-cycle operation simply has to wait, or execute on another port. Starting another 2-cycle operation on the same port is no problem though. So there's no loss of throughput. And as I noted before, having 'twin' ports makes port contention between 1-cycle and 2-cycle operations very unlikely.

OK, I think I must have misread something. You are talking about pushing two instructions down one port, a 2-cycle op and a 1-cycle op - so you wind up with a pipeline hazard. Am I following you correctly now? And thanks for the link to pipeline widening. I understand the P4 architecture much better now (and I remember going over this in my Comp. Arch. class).

So, is the L1 cache ported 8 ways so that the CPU can sustain 8 single cycle ops? [at least for bursty code, I realize it can't be sustained since L2$ and L3$ can't keep up, never mind main memory]

Thanks.
 

TuxDave

Lifer
Oct 8, 2002
10,572
3
71
So, is the L1 cache ported 8 ways so that the CPU can sustain 8 single cycle ops? [at least for bursty code, I realize it can't be sustained since L2$ and L3$ can't keep up, never mind main memory]

It doesn't have to be. You still have physical register files for the next hierarchy of memory. And if you want to go balls to the wall of design, remember that each op has multiple sources and so by my math you'll actually need far more than 8 ways total.
 

Ajay

Lifer
Jan 8, 2001
15,465
7,868
136
It doesn't have to be. You still have physical register files for the next hierarchy of memory. And if you want to go balls to the wall of design, remember that each op has multiple sources and so by my math you'll actually need far more than 8 ways total.

Thanks! Seems like Intel will need a balls to the wall design to hit their targets on AVX2 performance at least. I suppose that there could be less (or equal) ports to the register files, but prioritize AVX ops in the scheduler. I should read RW's SB uArch overview to get a clear picture of what existed b/4 Haswell - so I can make more sense of this. It just from a top level view, Haswell is the first CPU in recent history that has my piqued my interest (since Core2).
 

CHADBOGA

Platinum Member
Mar 31, 2009
2,135
832
136
It just from a top level view, Haswell is the first CPU in recent history that has my piqued my interest (since Core2).

Haswell is the proper next gen over Core 2.

Everything since Core 2, were just Core 2 derivatives, thus Haswell sets the CPU platform for Intel's performance oriented line of CPU's for the next 5 to 6 years.
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
OK, I think I must have misread something. You are talking about pushing two instructions down one port, a 2-cycle op and a 1-cycle op - so you wind up with a pipeline hazard. Am I following you correctly now?
Yes. The scheduler would be prevented from issuing a 1-cycle operation right after a 2-cycle operation, to avoid this hazard.

But due to the fact that Haswell's port 6 can execute the same operations as port 0, and port 5 can execute the same operations as port 1, blocking the scheduler shouldn't have much of an effect, at least not until it runs out of 64-bit instructions. At that point you lose a cycle and the port can start taking 32-bit instructions. But it should still be better than having three execution ports, which suffers from contention.

Anyway, I'm just theorizing out loud. Purely single-threaded workloads that are performance-critical have become very rare. So they may just have solved the increased bypass latency problem by lowering the clock frequency a tad and relying on a sufficient increase in IPC for other workloads. With 33% extra execution ports and increased out-of-order execution buffers that shouldn't be too hard.
And thanks for the link to pipeline widening. I understand the P4 architecture much better now (and I remember going over this in my Comp. Arch. class).
You're welcome. I've only just recently discovered that myself actually. Oh and it's "width pipelining", not "pipeline widening". The latter actually refers to increasing the number of (mirco-)instructions the overall CPU pipeline can handle. Width pipelining is done at the ALU level.
So, is the L1 cache ported 8 ways so that the CPU can sustain 8 single cycle ops?
It can probably access 8 cache banks per cycle, to efficiently support the gather operation, but that's not the same thing as having 8 ports. Sandy Bridge could access 6 banks per cycle...

Anyway, I think you're confused, so to avoid any more confusion let me quickly recap: Haswell has 8 execution ports, but only 4 of them are for arithmetic operations. So even if every instruction accesses memory, you need 4 cache ports tops, not 8. Haswell has two read ports and one write port. But a lot of instructions only use registers, not memory accesses. The register file has lots of ports, and the bypass network also supplies many operands that are the result of a recently executed instruction.

The reason the cache has many banks and few ports is because each bank can only service one memory operation per cycle. Trying to do more than one results in a bank conflict. The slides on Haswell claim it avoids all bank conflicts (for aligned data), and the cache line size remained the same. Which means each 64-byte cache line must be split across 16 32-bit values. This fits in with the gather support, if they increase the increased the number of simultaneous bank accesses to 8. The reason Sandy Bridge could already access 6 64-bit banks was to efficiently support unaligned 128-bit accesses.
 

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
"broadwell brings an end to cabinet pcs"

- the second the magic code fairy reveals the one true language to parallelize them all, is the second it really becomes all about moar coars. The second after that, the big ass power venting cabinets will be back.
Wait for it ... wait for it ...

ya I was referring to pc cabinets not server pcs . Hell as good as Haswell sounds the next shrink brings what exactly? Your correct the Servers are going to need smaller leak proof cabinets in future .

Wait for it . Wait for it.
 

Ajay

Lifer
Jan 8, 2001
15,465
7,868
136
It can probably access 8 cache banks per cycle, to efficiently support the gather operation, but that's not the same thing as having 8 ports. Sandy Bridge could access 6 banks per cycle...

Anyway, I think you're confused, so to avoid any more confusion let me quickly recap: Haswell has 8 execution ports, but only 4 of them are for arithmetic operations. So even if every instruction accesses memory, you need 4 cache ports tops, not 8. Haswell has two read ports and one write port. But a lot of instructions only use registers, not memory accesses. The register file has lots of ports, and the bypass network also supplies many operands that are the result of a recently executed instruction.

The reason the cache has many banks and few ports is because each bank can only service one memory operation per cycle. Trying to do more than one results in a bank conflict. The slides on Haswell claim it avoids all bank conflicts (for aligned data), and the cache line size remained the same. Which means each 64-byte cache line must be split across 16 32-bit values. This fits in with the gather support, if they increase the increased the number of simultaneous bank accesses to 8. The reason Sandy Bridge could already access 6 64-bit banks was to efficiently support unaligned 128-bit accesses.

Thanks fot that, very helpful :thumbsup:

Another question I have - would it be worth it to Intel to increase pipeline lengths a bit to maintain higher clocks (mainly for marketing) though with a higher branch misprediction penalty? Does Haswell improve on branch prediction or do anything to reduce the penalty? I read something about Haswell and branch prediction, but I can't find it (may have been speculation anyway).
 

Edrick

Golden Member
Feb 18, 2010
1,939
230
106
Even though I really like what Haswell brings to the table. In so far as compute. It also brings a heavy heart . I see broadwell as the end of the Big case setting alongside your desk. It really is a step forward this Haswell is, Sadlly we as a community have to move with that step . It looks like that dick tracey watch is near.Along with new compute power comes new form. Haswell accellerates that new form factor change. broadwell brings an end to cabinet pcs

Want to bet on this?

I don't know what you have been smoking lately, but you have been making these grand assumptions on the forums with not much fact to back them up.

What does shrinking a CPU process have anything to do with PC cases? They still run hot (think IB) and need proper cooling. And we will be seeing 8 core desktop CPUs soon which will require a well vented PC case. Also, as 4K monitors become common, graphic cards will need to continue to grow in power, which in turn put out a lot of heat. And as SSD prices keep coming down, PCIe SSDs will become common place in the next 2-3 years, bypassing the SATA bottleneck, which will require more case space.

So please think of the larger picture.
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
What does shrinking a CPU process have anything to do with PC cases? They still run hot (think IB) and need proper cooling.
IB CPUs just have crappy TIM. That doesn't make them hotter, when it comes to cooling the enclosure (1 Watt = 1 Watt, though degC may vary). Indirect cooling of anything but the fastest CPUs out there has been doable for years, now, and can be done affordably, today, even without water. If you want to go all out, you could use a Scythe Ninja or TR HR-02 to cool most OCed CPUs, without even having a fan on the heatsink, and without many loud fans forcing lots of CFM through the case (well, depending on the case--in an SFF you might have to force things a bit more). It would take some care, especially with the cheaper Ninja, but it's doable. You've basically got to pull the kind of airflow control tricks that OEMs do.

An i3 would be no trouble to cool with only a quiet case fan, even in a cramped SFF case (say, 2-3x the volume of a nettop). You end up wanting or needing a regular computer case even if you have passive CPU cooling and a passive power supply, just because USB can't always do everything. If we had an external interface as cheap as USB, while being as performant as PCIe 4-8x (especially if it could be ganged across more cables), then we could start doing tiny cases, with whole-case HSFs (IE, like notebooks) and modular multi-box systems all over the place, as almost everything could be a universally changeable external peripheral.

Cooling issues we have are mostly matters of shaving pennies, and maintaining component interchangeability without loss of reliability, and less matters of real technical difficulties, except for the fastest and hottest CPUs. The Dick Tracy watch may be far off, but a Nettop or Mac Mini sized powerful computer is not commonly available more due to lack of sufficient demand, much more than that it is too hard to do.

How many people want a computer smaller than a Shuttle type, with one or no PCIe slots, limited panel IO (there's just not room!), and then having to rely mainly on USB for every non-special-function device added after the first two SATAs? OK, now of those people, how many of them are willing to pay more than a standard ATX case scenario, to get less system flexibility? A small percentage of Apple PC customers, and anyone buying a notebook/tablet/etc..

If I'm paying more, I want faster PCIe (more lanes), more PCIe slots, faster SATA, more SATA ports (maybe even SAS), more maximum memory, lots of options to cool cards using those expansion slots, and lots of panel IO options (which includes being able to stuff spare expansion slot covers and drive bay covers with extra goodies, if I want).

You're right about Nemisis 1 being wrong about them going away, but I argue that cooling is a secondary, maybe even tertiary, problem, wrt to standard large cases going away. Cooling is an annoying issue because big cases in which parts can be swapped out with any other parts prevent part makers from being able to make assumptions about cooling; allow Intel and AMD to use small and cheap HSFs; and that makes it such that the 3rd-party cooler makers and case designers must make fairly generic flexible coolers and enclosures, which end up being more robust and expensive than if they were made with a specific implementation in mind (a known mobo and PSU, FI).
 
Last edited:

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
Want to bet on this?

I don't know what you have been smoking lately, but you have been making these grand assumptions on the forums with not much fact to back them up.

What does shrinking a CPU process have anything to do with PC cases? They still run hot (think IB) and need proper cooling. And we will be seeing 8 core desktop CPUs soon which will require a well vented PC case. Also, as 4K monitors become common, graphic cards will need to continue to grow in power, which in turn put out a lot of heat. And as SSD prices keep coming down, PCIe SSDs will become common place in the next 2-3 years, bypassing the SATA bottleneck, which will require more case space.

So please think of the larger picture.

Geewiz guy I just going by what intel said at IDF. Whats inside the computer is going to be less known than the new form factors . Intel said this along with the desktop is becoming less important and small compute devices was the way forward . Intels words not mine . I read All the IDF stuff. If you didn't don't hang it on me . Do your own reading (research)