Future Microarchitectures

Keysplayr · Sep 11, 2009

Originally posted by: twjr
What about moving from binary to trinary (btw I have no idea if its even possible just want to throw it out there)?

What would that be? One's, Zero's and Quark's? 😀
Sorry. Please go on guys.

Nemesis 1 · Sep 11, 2009

Look at Intels Haswell for 2012 . The info is lacking , But clearly Haswell is a CPUGPU. With Sandy bridge and Ivy Bridge converting X86 to what ever.

twjr · Sep 11, 2009

Originally posted by: Keysplayr

Originally posted by: twjr
What about moving from binary to trinary (btw I have no idea if its even possible just want to throw it out there)?

Click to expand...

What would that be? One's, Zero's and Quark's? 😀
Sorry. Please go on guys.

If you look at the Wiki link IDC gave me for ternary computing the way it was presented there was -1, 0 and +1. But like I said I really have no idea what I'm talking about. Was really throwing it out there to see if anyone could comment on it so as to expand my knowledge.

Hard Ball · Oct 10, 2009

OK, let's provide some more concrete explanation over the schematic, this will have to be done over a number of installments; I'll put one up whenever I have some spare time. Since at arstechnica, I answered some questions over potential similarities between this and Netburst, particularly in terms of the replay mechanism, I probably should cite that post first:

Originally posted by c2418 of arstechnia:
You probably shouldnt care about the trolls and their FUDs. I thought that AMD has better things to do than employ someone to search around internet and try to intimidate people, but obviously they follow in RIAA footsteps. Good for them, I hope they choke on their secrets.

Anyway, I'm also interested in more compact design, this one is rather confusing. But it seems kinda netbursty to me, like hat monster said. However, netburst had a lot less paralelism IIRC.

What's the pipeline lenght? Also, what do you mean by "FP scoreboard", is it the classic scoreboarding (the old one) and shouldn't that be obsolete with speculation and renames? But I can't see if it's using any speculation at all from the image.

So, when we're at it (ISA discussion, that is) do you arstechnicars know anything about new cortex? They say it's "out of order" but do they mean dynamic schelduling or speculation? In is dynamic schelduling without speculation any good?

EDIT: or if you don't want questions about ARM in discussion about your architecture, I can move it elsewhere.

I really appreciate your sentiment, I never thought that these posters were actually from AMD or anything; and certainly not anyone that knows about the details of the actual BD design (they would certainly know better than trying to equate it my design here; and they are probably having a pretty good belly laugh, if they see that someone is actually trying to equate these two.).

I think you and Hat are onto something, it is netburst like in one single respect. Netburst, for better or worse, did implement a very primitive form of slicing mechanism; but that relies on a fixed time expectation that deals with the typical L1 miss case, I had brief discussions in the forum last year, here:
http://episteme.arstechnica.co...009259831#328009259831]http://episteme.arstechnica.co...9259831#328009259831[/url]
and also here:
http://episteme.arstechnica.co...001359831#366001359831]http://episteme.arstechnica.co...1359831#366001359831[/url]
The original idea is not bad (IMO), but the way it's implemented and the compromises that had to be made due to a number of other features that either compete for on-die resources or have some type of conflicts with it doomed it to failure from the beginning.

This mechanism here, although similar to that of netburst in the broadest sense, is much more elaborate and is designed to deal with several types of latency expectations and deal with numerous contigencies on top of that. I will get much more into details when I have time for a serious write up for this forum.

The FP scoreboard there simply tracks the completed instruction's renamed registers and the consequent RAW hazards associated with these physical destination registers. Since as you can see, the pipeline's scheduling mechanisms are split at a very early stage in the core, the FP pipe actually never sees the direct execution of the branch instructions; but it only receives cues about legality of retirement of certain sets of FP instructions at certain cycles determined by what happens in the integer pipeline; and this is done through setting a series of PC limits for the FP pipeline by the integer ICU, which in the schematic as "ready horizon", which determines the time horizon of when certain FP instructions that affect the consistent memory state (such as FSTORE) are permitted to retire from F-ROB.

I think Hat gave you very good answers for the rest of your questions already.

Hard Ball · Oct 10, 2009

Long Latency Operations Handling and Reduction in Mem Stall Time

And here's the actual explanation of the mechanism:

The target of the netburst replay mechanism is predictable (or so thought) L1 misses, which made an assumption on the predominant latency of that type of miss. This resulted in the design of a rigid replay system that relied on highly predicatable latency for the average case (L1 miss but L2 hit) in, at the time was assumed to be, the vast majority of the workloads; which turned out to be one of the really flawed assumptions that ultimately sunk the Netburst line. The mechanism here is theoretically similar, but with a nearly diametrically opposite set of assumptions and implementations. Here the long latency mechanism is designed to deal primarily with long latency L2 misses, and is implemented using a much more flexible mechanism that can deal with large variations in latency that might be experienced when dealing with accessing shared cache/SRAM.

The design has two levels of dependency structures mapped in the instruction stream, one at the instruction level as in any tomasulo style speculative OoO architecture; but there is also a coarser grained level of dependency among slices. Two of the big topics, slice initiation (choosing heads of dependency trees within register dependency graph based on the instruction stream) and slice elongation/growth (separation of instruction stream into slices based on dependencies in relation to seed instructions) I will deal with at a later time, for the time being, it suffices to say that, the slices are formed such that the large majority of register to register dependencies are captured within individual slices, in other words, the amount of true dependencies that flow from an architectural live-out register of any logically earlier slice to an architectural live-in of a logically later slice are minimized. And the long latency reordering mechanism takes advantage of this coarse grained level of instruction stream organization, which provide natural places for restarting parts of execution stream that must wait on some long latency (primarily memory) operation.

http://farm4.static.flickr.com...50370_378244110f_o.jpg

The upper-right (top-level) block of the schematic is the main mechanism for dealing with long-latency operations (I will call this LL-block as a short hand from this point on). In the steering mechanism of the pipeline, each cluster has two associated steering buffers; and one of these is alternately being filled by the steering mechanism with a subset of the instruction stream, as the other buffer is being drained for execution. As the instruction slices drain into the integer clusters from one of the steering buffers (one instruction at a time), simultaneously the instruction slice also is copied to one of the IQs in the LL-mechanism (which we will call reschedule IQ from this point on). There is one such reschedule-IQ for each of the clusters, so in a sense, while the slice is being sent to the cluster scheduler for execution, at the same time, it is already being speculatively rescheduled for another round of execution in the event that the reexecution trigger event occurs. The trigger event is described as L2 miss (but can also include any other event with LL potential, but I will focus on L2 misses here), and this event is ensured to be detected as early as possible during slice execution in each one of the integer clusters (which handles all loads, including FP and SIMD ld), partially with the aid of a fast-pathed loads, which are loads with mem address not dependent on the computation of any preceding instruction in the slice. This fast-pathed mechanism is another extensive discussion for a later time.

If no L2 miss is detected during slice execution, that corrsponding entries in the approriate reschedule-IQ are designated done, and will not enter into the LL's long term buffer. If an L2 miss is detected during slice execution, the corrsponding entries in the reschedule-IQ are set to enter the long term buffer in the LL-block. When the instructions enter into the long term buffer, these are marked with the PC of the seed instruction of the logically preceding slice that corresponds to the architectural register of one of the source operand of the current instruction; in other words, these are potentially passed register values from the previous epochs (epoch is the logical time period corresponding to each record of the archictural register state) if the current instruction were to be determined to be an exposed read. The other source operand is necessarily dependent on the seed instruction for the current slice, which will become apparent once slice elongation is thoroughly explained. Each of the slices in the buffer is also marked with the LL-operation that caused its residence in the buffer, and will be cleared for return to the main pipeline once the L2 miss returns requested data.

Once the slice is cleared for refilling one of the cluster schedulers, it first traverses a filter for previous epoch architectural live-out filter, which basically a partial-lookup CAM structure along with some logic that determines the register reads which would need to be replaced by values already in epoch ISA registers. So at the end of this live-out filter stage, we have the any instruction that contains an exposed read reg payload, replacing the reg payload with the seed PC of the logically preceding slice where this architectural register was logically last written. And a status bit with each instruction is set if the instruction is determined to be exposed read.

Now, we have to mention another essential structure tied to this process, which are the result shift registers for ISA, with one shift register for each of the architecturally defined register. Each of these RSRs contains the architectural states at the end of each epoch; in other words, for any RSR (each corresponds to an architectural register), each epoch entry of the RSR contains either value shifted from the previous entry (corresponds to the execution of a single slice), or updated value from the destination physical register from some instrution that is mapped to the architectural register according to the cluster RAT; and since the updates of architectural states is done as the slice executes, the updated value in the RSR must also be logicaly the last write to the architectural register for the corresponding epoch.

So once the slice passed through the filter with the relevant reg payloads replaced, the architectural RSR of the relevant epochs are read, and the instructions that have exposed read of previous epoch architectural live-outs are remapped to instructions containing the value from the relevant RSR state. So in essenence, the exposed live-in instructions become replaces a register operand with an immediate operand. The original renamed RAT of the slice is updated so that the corresponding physical registers of the architectural live-ins are freed up, so that these can be reused as renamed registers for future epochs. And here the remapped slice enters the refill instruction buffer, for the cluster resource to become available to restart execution. This process would be controlled by the integer ICU, and dependent on the steering mechanism's resource monitor.

So essentially, if this works ideally in concert with the slicing and steering mechanisms, eliminate much of the stall cycles associated with L2 misses, which usually cannot be hidden through reordering in a conventional OoO core. I'm sure there are things still unclear, please feel free to ask me; and I'll provide whatever answer I can.

Hard Ball · Oct 10, 2009

I've added a graph to illustrate my explanation in the text.

DrMrLordX · Oct 10, 2009

Originally posted by: twjr

Originally posted by: Keysplayr

Originally posted by: twjr
What about moving from binary to trinary (btw I have no idea if its even possible just want to throw it out there)?

Click to expand...

What would that be? One's, Zero's and Quark's? 😀
Sorry. Please go on guys.

Click to expand...

If you look at the Wiki link IDC gave me for ternary computing the way it was presented there was -1, 0 and +1. But like I said I really have no idea what I'm talking about. Was really throwing it out there to see if anyone could comment on it so as to expand my knowledge.

Ternary is supposed to be a more-efficient way of representing values than binary. Also, in some works of fiction, ternary computers are supposed to be capable of sentience due to their ability to handle "fuzzy logic" (negative no). Anyone familiar with the old White Wolf PnP RPG Mage: The Ascension would remember that the Virtual Adepts would make use of ternary (mis-labeled "trinary" in game texts) computers as magickal foci.

Idontcare · Oct 20, 2009

Originally posted by: Idontcare

Originally posted by: Ben90
i was reading way long time ago that pretty soon we arnt going to be able to shrink the xtors anymore and after that the only way to increase performance will be through a more efficient architecture or through physically increasing die size; obviously both of these have their limitations as well such as the speed of electricity, so i was wondering if there is any talk of development of stuff like 3D cores or like multi layered cores and how possible/probable/realistic it is for something like that to happen

obviously this wont be looked into seriously for at least a decade or more but it seems we are hitting a physical wall and pretty soon we might need a fundamental change of how things are made and just wanna hear ur take on it because you know a lot more than i do lol

Click to expand...

Intel and Samsung have both said shrinking down to 5nm region is possible, but obviously regardless what the actual number is for shrink limits there will come a day when we hit it, so then what?

The 3D architecture model is intriguing, and doable. Early commercial implementations will be most feasible on simple designs of course, namely memory.

Elpida Develops 3-D Stacked 8-Gbit DRAM

That is an early example of a functional method to implementing 3D IC's with TSV (through-silicon via). An even earlier example exists in the flash world where chips were stacked (but separately wire-bonded) within the package.

So it is happening, the fabrication techniques are being developed, optimized, improved upon every year. For high-performance logic CMOS the design tools need to be much more mature to make it a practical second-choice versus traditional 2D design. So many issues need to be addressed by the validation software from thermals to electrical cross-talk, power distribution, etc.

But it is feasible and will become more and more practical over time as the field matures.

A follow-up to this line of thought was recently published:

The 4 Horsemen of 3D IC

Two areas in particular are on the critical path for volume production of 3D TSV integrated circuits: Design for Manufacturing (DFM) and Design for Test (DFT).She offered the following of a 3D EDA roadmap. By 2010 /11, architectural evaluation engines will need to be available to break out of incremental, evolutionary growth. At some point in the near future, and hopefully before 2013, there will be standards in place for test and 3D IP component compatibility.

http://www.semiconductor.net/b..._Horsemen_of_3D_IC.php

Hard Ball - if you find my continuing to entertain this side-topic within your thread to be more nuisance than supportive of the thread then please just let me know (pm works) and I'll start a new thread instead of cluttering yours. 😉

aigomorla · Oct 20, 2009

Originally posted by: magreen
I'd bet there's < 5 registered members of AT who can read that schematic.
And their names are:

IDC
TCHo
Rubycon
Dmens
Jabberx ?? i think that's her name...

LOL...

plonk420 · Oct 20, 2009

it's a pity that this couldn't be implemented in a FOSS CPU design... (assuming the similarities to AMD CPUs is correct). sounds like people in the know are decently intrigued by it....

Turbonium · Oct 21, 2009

Originally posted by: Ben90
I believe i may have come up with a more efficient design:

http://i27.tinypic.com/34j4qo2.jpg

I lol'd.

Hard Ball · Oct 22, 2009

Originally posted by: Idontcare

Hard Ball - if you find my continuing to entertain this side-topic within your thread to be more nuisance than supportive of the thread then please just let me know (pm works) and I'll start a new thread instead of cluttering yours. 😉

Not a problem; future microarchitectural designs are certainly strongly tied to process tech, and so your contributions are certainly appropriate here.

Future Microarchitectures

Keysplayr

Elite Member

Nemesis 1

Lifer

twjr

Senior member

Hard Ball

Senior member

Hard Ball

Senior member

Hard Ball

Senior member

DrMrLordX

Lifer

Idontcare

Elite Member

aigomorla

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member

plonk420

Senior member

Turbonium

Platinum Member

Hard Ball

Senior member

TRENDING THREADS