Hard Ball
Senior member
- Jul 3, 2005
- 594
- 0
- 0
Originally posted by: Nemesis 1
Originally posted by: Hard Ball
Originally posted by: IntelUser2000
Originally posted by: soccerballtux
What's the big deal about this? How is it going to function being such a different chip from ATI/Nvidia's monolithic approach? Isn't this just a bunch of Atom processors stuck together?
It's much different. Atom in comparison has a heftier branch prediction, better integer processing and data prefetcher units with optimization towards single thread processing and de-empathized towards FP.
In Larrabbee, to make the cores smaller those things are much simplified but they beefed up the FP units a lot which will be critical for graphics processing. Each of the Larrabbee core will have 4x the capability of the FP/SSE units in Core 2.
The Atom core alone is too big to have enough cores and be a competitive graphics core anyway.
Agree with most of what you said, except this:
Atom in comparison has a heftier branch prediction
I am sorry but you guys lost me . I have a larrabe diagram .I am sorry but that diagram
doesn NOT show a branch prediction unit on larrabee at all only decoders. As is allready been stated intell can do this in software threw compiler. Heres diagram . Notice NO branch predition unit. As shown on other intel core diagrams.
http://www.eweek.com/c/a/IT-In...side-Intel-Larrabee/4/
The VPU pipeline in Larrabee uses predication to reduce the penalty of branch misprediction, which involves using mask registers to record polarity of the branches. And prediction is necessary only when there is a uniform polarity of bits across the relevant mask register, which would be a small percentage of the time under most work loads. So it's not really comparable to the branch predict unit of a mostly scalar processor. Nor would it take less area to implement, which accomplishes something quite different than a large BTB + BHT + loop predictor + RAS of your standard x86 microarchitecture.
I'm not sure what is really confusing. Almost all microprocessors with a significant pipeline have some type of branch prediction, otherwise the only recourse would be to delay the fetch of instr after branch until the BU resolves. Larrabee does not have the type of robust branch prediction that most current microprocessors have, such as multilevel correlating BHT and tournament predictors on some of the more complex designs that you may see today.
The real trick with larrabee is that control flow of the the execution trace can actually be routed through the vector pipeline, by using predication in the vector units with mask registers; which are basically bits that dictate the destination registers/mem addr of each of the lanes of a vec instr. The VPU in Larrabee is essentially a vec16 ALU. And in the cases where the control flow at run time dictates that the CF polarities of some of the lanes in vec instruction be different from others, then both the target/fall-through instructions are executed, with results of each written to the appropriate destination location as specified by the mask register; when the mask register contains only uniform control flow, then only the appropriate branch is executed. In the type of vec heavy types of software that Larrabee is envisioned to excel at, the front end BP actually has very little effect on the efficiency of the overall execution, but the predication scheme would handle most of the heavy lifting.
The cache hierarchy and cache control of Larrabee is also quite different from a conventional x86. Especially is that case, that some explicit cache control instructions modes (in addition to explicit prefetch) are provided, such as modes that can mark lines for early eviction, and even some explicit control of the coherence scheme (actually allowing some lines to be explicitly invalidated, to support scrach pad mode for regions of caches).
