Long Latency Operations Handling and Reduction in Mem Stall Time
And here's the actual explanation of the mechanism:
The target of the netburst replay mechanism is predictable (or so thought) L1 misses, which made an assumption on the predominant latency of that type of miss. This resulted in the design of a rigid replay system that relied on highly predicatable latency for the average case (L1 miss but L2 hit) in, at the time was assumed to be, the vast majority of the workloads; which turned out to be one of the really flawed assumptions that ultimately sunk the Netburst line. The mechanism here is theoretically similar, but with a nearly diametrically opposite set of assumptions and implementations. Here the long latency mechanism is designed to deal primarily with long latency L2 misses, and is implemented using a much more flexible mechanism that can deal with large variations in latency that might be experienced when dealing with accessing shared cache/SRAM.
The design has two levels of dependency structures mapped in the instruction stream, one at the instruction level as in any tomasulo style speculative OoO architecture; but there is also a coarser grained level of dependency among slices. Two of the big topics, slice initiation (choosing heads of dependency trees within register dependency graph based on the instruction stream) and slice elongation/growth (separation of instruction stream into slices based on dependencies in relation to seed instructions) I will deal with at a later time, for the time being, it suffices to say that, the slices are formed such that the large majority of register to register dependencies are captured within individual slices, in other words, the amount of true dependencies that flow from an architectural live-out register of any logically earlier slice to an architectural live-in of a logically later slice are minimized. And the long latency reordering mechanism takes advantage of this coarse grained level of instruction stream organization, which provide natural places for restarting parts of execution stream that must wait on some long latency (primarily memory) operation.
http://farm4.static.flickr.com...50370_378244110f_o.jpg
The upper-right (top-level) block of the schematic is the main mechanism for dealing with long-latency operations (I will call this LL-block as a short hand from this point on). In the steering mechanism of the pipeline, each cluster has two associated steering buffers; and one of these is alternately being filled by the steering mechanism with a subset of the instruction stream, as the other buffer is being drained for execution. As the instruction slices drain into the integer clusters from one of the steering buffers (one instruction at a time), simultaneously the instruction slice also is copied to one of the IQs in the LL-mechanism (which we will call reschedule IQ from this point on). There is one such reschedule-IQ for each of the clusters, so in a sense, while the slice is being sent to the cluster scheduler for execution, at the same time, it is already being speculatively rescheduled for another round of execution in the event that the reexecution trigger event occurs. The trigger event is described as L2 miss (but can also include any other event with LL potential, but I will focus on L2 misses here), and this event is ensured to be detected as early as possible during slice execution in each one of the integer clusters (which handles all loads, including FP and SIMD ld), partially with the aid of a fast-pathed loads, which are loads with mem address not dependent on the computation of any preceding instruction in the slice. This fast-pathed mechanism is another extensive discussion for a later time.
If no L2 miss is detected during slice execution, that corrsponding entries in the approriate reschedule-IQ are designated done, and will not enter into the LL's long term buffer. If an L2 miss is detected during slice execution, the corrsponding entries in the reschedule-IQ are set to enter the long term buffer in the LL-block. When the instructions enter into the long term buffer, these are marked with the PC of the seed instruction of the logically preceding slice that corresponds to the architectural register of one of the source operand of the current instruction; in other words, these are potentially passed register values from the previous epochs (epoch is the logical time period corresponding to each record of the archictural register state) if the current instruction were to be determined to be an exposed read. The other source operand is necessarily dependent on the seed instruction for the current slice, which will become apparent once slice elongation is thoroughly explained. Each of the slices in the buffer is also marked with the LL-operation that caused its residence in the buffer, and will be cleared for return to the main pipeline once the L2 miss returns requested data.
Once the slice is cleared for refilling one of the cluster schedulers, it first traverses a filter for previous epoch architectural live-out filter, which basically a partial-lookup CAM structure along with some logic that determines the register reads which would need to be replaced by values already in epoch ISA registers. So at the end of this live-out filter stage, we have the any instruction that contains an exposed read reg payload, replacing the reg payload with the seed PC of the logically preceding slice where this architectural register was logically last written. And a status bit with each instruction is set if the instruction is determined to be exposed read.
Now, we have to mention another essential structure tied to this process, which are the result shift registers for ISA, with one shift register for each of the architecturally defined register. Each of these RSRs contains the architectural states at the end of each epoch; in other words, for any RSR (each corresponds to an architectural register), each epoch entry of the RSR contains either value shifted from the previous entry (corresponds to the execution of a single slice), or updated value from the destination physical register from some instrution that is mapped to the architectural register according to the cluster RAT; and since the updates of architectural states is done as the slice executes, the updated value in the RSR must also be logicaly the last write to the architectural register for the corresponding epoch.
So once the slice passed through the filter with the relevant reg payloads replaced, the architectural RSR of the relevant epochs are read, and the instructions that have exposed read of previous epoch architectural live-outs are remapped to instructions containing the value from the relevant RSR state. So in essenence, the exposed live-in instructions become replaces a register operand with an immediate operand. The original renamed RAT of the slice is updated so that the corresponding physical registers of the architectural live-ins are freed up, so that these can be reused as renamed registers for future epochs. And here the remapped slice enters the refill instruction buffer, for the cluster resource to become available to restart execution. This process would be controlled by the integer ICU, and dependent on the steering mechanism's resource monitor.
So essentially, if this works ideally in concert with the slicing and steering mechanisms, eliminate much of the stall cycles associated with L2 misses, which usually cannot be hidden through reordering in a conventional OoO core. I'm sure there are things still unclear, please feel free to ask me; and I'll provide whatever answer I can.