Future Microarchitectures

Hard Ball

Senior member
Jul 3, 2005
594
0
0
I had a little time on my hand between busy spells, so I took a little time to draw up a rough design schematic for a future microarchitecture, which I originally posted on Anandtech a few days ago. Hopefull, I can spur up some good, in depth, and substantive conversations about microarchitectural trends in the next period of time in terms of general purpose ICs. Although it would be most beneficial and interesting to those actively study and do research in the area(like myself) or make a living in this field, I'm sure many self trained computer architecture enthusiast have plenty to contribute to a meaningful discussion as well.

This is a conceptual design that incorporates a number of likely trends in future commercial microarchitectures in the next number of years. It is actually roughly based on a commercial microarchitecture that might be coming on market in the near to medium term (depends on what you consider "near"); but I have altered, omitted, and replaced a number of architectural elements and mechanisms so that nothing useful in terms of the original design or specifications could be deciphered, while still ending up with a functional conceptual design. The really important things are useful concepts and trends in designs anyway.

http://farm3.static.flickr.com...71934_3c430e8bf3_b.jpg

Or here link to a larger version:

http://news.webshots.com/photo...4584010090923268CSKvKF

This deign is emblematic of some classes of new features that might be heavily incorporated for increased ILP and lower thermal consumption / instruction throughput, design concepts that take into account of: how best to blanance physical RF latency vs. size vs. #ports trade off; how to deal with long latency instructions and their dependents especially when it comes to misses beyond the private cache of the core; dependence steering of instructions based on register dependency graphs; the related concept of the use of clustered FUs and their; as well as companion techniques to clustering such as banked way predictions and independent LSQs; and a number of other trends.

I'll be glad to provide any answers to general and specific questions provided that they are within ethical bounds; which means that there are things that I will have to be silent on. But hopefully there will be interest, and this maybe a starting point for some substantive discussion involving many people here.

If people have trouble seeing the overall schematic, here are higher res quadrant views:

http://farm3.static.flickr.com...48484_6dcc4ed75b_b.jpg

http://farm3.static.flickr.com...51718_a0fe9349ab_b.jpg

http://farm3.static.flickr.com...57575_438c79acf0_b.jpg

http://farm3.static.flickr.com...61925_168baa3a42_b.jpg
 

faxon

Platinum Member
May 23, 2008
2,109
1
81
my eyes sort of glazed over when i looked at those lol. are you sure this shouldnt have been posted in HT? :D
 

magreen

Golden Member
Dec 27, 2006
1,309
1
81
I'd bet there's < 5 registered members of AT who can read that schematic.
And their names are:
 

TuxDave

Lifer
Oct 8, 2002
10,572
3
71
What do you do for a living? There is quite a large amount of information there but it lacks enough organization to clarify its details.

What is x2? What is x3?

What is the separation of the blocks (FADD/FMOV/SIMD vs FMUL/FMOV/SIMD vs FSTORE/FMOV) supposed to communicate?

What is the difference between a black line and a green line? What's the difference between a small green arrow and a big green arrow?

I think you need more than one picture (aka a generic picture showing what each major section are) and then focus on each section with more detail and actual text.
 

Triskain

Member
Sep 7, 2009
57
8
71
Dear Hardball,

this concept is suspiciously similar to one that is widely discussed on different boards. AMD's future Bulldozer architecure to be exact. Here is a diagram (based on patents from AMD, read more on this blog) of it:

Concept Diagram

The similarities are easy to spot. Two integer clusters, the FPU arrangement, the way the front end is laid out, the LSU and cache arrangement etc. You said you made alterations to distinguish it from another microarchitecture which would account for the differences.

Now the question is do you have access to inhouse AMD information and if yes, how did you manage to publish this without breaking an NDA. Maybe you put it together through patent based research like Dresdenboy did, but the amount of differences and information that according to Dresdenboy are not from patents is too big for your concept to be derived from the same sources.

I would be extremely interested in hearing your explanation.

Greetings, Triskaine
 

morfinx

Member
Mar 10, 2005
54
0
0
Now the question is do you have access to inhouse AMD information and if yes, how did you manage to publish this without breaking an NDA. Maybe you put it together through patent based research like Dresdenboy did, but the amount of differences and information that according to Dresdenboy are not from patents is too big for your concept to be derived from the same sources.

I would be extremely interested in hearing your explanation.

Same.
 

Idontcare

Elite Member
Oct 10, 1999
21,118
58
91
Sigh. My apologies Hardball, I sure hope you anticipated this manner of response to your OP so that at least you aren't entirely disappointed by its reception.

Even if magreen is right and there are only five folks around here capable of having a fruitful and cerebral dialogue with you regarding your general design of a future microarchitecture, that is no reason for its existence in this forum to be negated or questioned.

I would ask, and expect, that if folks don't know how to conduct themselves in an engaging and tactful discussion (and implying NDA violations is NOT that) then they would simply refrain from thread crapping.

A high-brow dialogue between you, pm, tuxdave, and CTho9305 (I'm positive I've accidentally neglected to include a few others here) would be beneficial to a lot of us lurkers and readers in the background, we'll all learn something provided the thread doesn't get bloated with a lot of noise and distractions. Good luck.
 

MODEL3

Senior member
Jul 22, 2009
528
0
0
Also my apologies Hard Ball.

My knowledge about this staff is very limited, in order to try to contribute in a possible dialogue.

But i 'll watch, hoping that i 'll learn something.
 

TuxDave

Lifer
Oct 8, 2002
10,572
3
71
I sure hope no one mistook my post as a way of dismissing the OP. I do design mostly in the lower left corner of your diagram and so I can give a ton of input there but the rest of the architecture is mostly gobbily gunk to me. :)

My issue is where are you representing AVX/FMA/AES/etc... since those are already being announced? (checked on Wiki before opening my mouth)
 

Idontcare

Elite Member
Oct 10, 1999
21,118
58
91
Originally posted by: TuxDave
I sure hope no one mistook my post as a way of dismissing the OP.

Can't speak for anyone else but I did not take it that way at all. Keep up the cerebral, hopefully Hard Ball is on her way back this direction and she'll help you keep the thread alive.
 

Hard Ball

Senior member
Jul 3, 2005
594
0
0
Wow, didn't expect to have made several turns so soon.

To IlllI, faxon, magreen, Ben90 and Tuxdave:

It may have been appropriate to put it in the HT subforum, but I didn't think of that until posted here; I guess I thought this would be the most appropriate post CPU related information, but it may not have been, on a second thought. But it's already here, so maybe next time I will considered the HT.

I'll give more explanation as time goes on, responding to whatever questions that you may have, provided that I'm not obliged not to answer that. The graphic does contain tons of acronyms and short-hands, I probably shouldn't make assumptions about what type of terms and representations the posters here are or are not used to. But if I gave full descriptions of everything, this entire schematic would be covered with nothing except black text.

I will give more explanations about these if necessary, or some others here who work close to this area probably would also know them and point out; like TuxDave probably would have a good idea of what's going on for the most part.

Originally posted by: TuxDave
What do you do for a living? There is quite a large amount of information there but it lacks enough organization to clarify its details.

What is x2? What is x3?

What is the separation of the blocks (FADD/FMOV/SIMD vs FMUL/FMOV/SIMD vs FSTORE/FMOV) supposed to communicate?

What is the difference between a black line and a green line? What's the difference between a small green arrow and a big green arrow?

I think you need more than one picture (aka a generic picture showing what each major section are) and then focus on each section with more detail and actual text.

Tux;

Yep, there is a lot of information being crammed into this schematic, sorry wasn't able to make more clear. Probably should have sepearated into large functional blocks with full descriptions; but this is just something I whipped up in 2-3 hours, not really meant to be a full specification of a microachitectural design. It's more just for sharing of ideas on uarchitectural techniques and being a didactic tool, quality control here isn't really high at all. If I have more time in the future, I can probably break down this into detailed portions with a lot more explanation.

Onto the specifics...

The x2 and x3 that you asked, by that I mean the number of read/write ports to the FP physical reg-file that are connected to the particular datapaths represented by those arrows in the graph, r/w obviously depending on the direction of the arrows. For each of the data lines and busses, when it's vertical in the graphic, it is represented as a wide and colored arrow/line, and when horizontal, it is simply a relatively thick line, except in cases where there isn't space or is very awkward to put in. The narrower ones are 64 bit WB busses that are only used for single data ops, the wider ones are 128-bit datapaths that can carry FP or SIMD operands/results.

For the blocks that you asked (darker blocks within the large execution backend of the FP pipe), these are meant to be FUs that have common data inputs/outputs and common control inputs. But instruction per cycle is issued to only one of these smaller blocks (FADD/FMOV/FSTORE/etc) within the FU. You can just assume that the input control signal set from the FP IC has bit(s) that a decoder uses to route the rest of the signal lines as well as the data input(s) to the correct function subunit. Note that there are two sets of control inputs to two of the FUs, which also have two outputs each, this in part makes feasible of and FU execution either a 128 SIMD instruction or 2 64-bit SISD instructions through the toggle of a mode-bit in the control lines. This is probably as much as I can say, the rest I will have to remain silent on; but since you work in this area, as specifically FP logic, there is a decent chance you know the details already.

Don't worry too much about the color codes of the lines, they are mainly there so that seeing some of the long and winding paths in the graphic won't be nearly impossible. As said earlier, the data lines are wide and colored if vertical (where possible), and control lines are just lines, the thicker lines usually represent two or more sets of control signals / instructions, but it's just a general description, it's not fastidiously drawn to a uniform code.

Sorry to be a bit terse, since I don't have much time, and more to reply. Feel free to follow up. I don't want to have one really long post, so I will reply to the rest of the posters in more posts.

Edit: oops, forgot Ben90's name.
 

Hard Ball

Senior member
Jul 3, 2005
594
0
0
To triskain, theman, & morfinx:

Originally posted by: Triskain
Dear Hardball,

this concept is suspiciously similar to one that is widely discussed on different boards. AMD's future Bulldozer architecure to be exact. Here is a diagram (based on patents from AMD, read more on this blog) of it:

Concept Diagram

The similarities are easy to spot. Two integer clusters, the FPU arrangement, the way the front end is laid out, the LSU and cache arrangement etc. You said you made alterations to distinguish it from another microarchitecture which would account for the differences.

Now the question is do you have access to inhouse AMD information and if yes, how did you manage to publish this without breaking an NDA. Maybe you put it together through patent based research like Dresdenboy did, but the amount of differences and information that according to Dresdenboy are not from patents is too big for your concept to be derived from the same sources.

I would be extremely interested in hearing your explanation.

Greetings, Triskaine

I admire your enthusiasm in looking forward to studying the microarchitecture of future x86 designs. Although this is not my intent for this thread at all, but is rather to explore general trends in future general purpose MPUs and possible techniques that may be used to raise ILP utilization, lower power / throughput, explore techniques relevant to future microarchitecture trends such as those in TLS(thread level speculation).

In response to your question on the BD uarchitecture that's coming up, I can't really comment on whether this particular design has anything to do with any microarchitecture of any particular unreleased and undisclosed design from a specific vendor. This design maybe based on something, but is worked to be general in nature, and to capture a number of general trends that will surface in the next few years, and types of architectural elements that have a good chance of being put to use in the near to medium term future.

All of the information contained in the design, if any of it has to do with any future design from a specific vendor; they have either already be (1a)removed, (1b)substantially altered so that no vendor specific information exists any longer in the design, (1c)replaced by design elements with similar function but using different techniques; or alternatively, this information is both (2a)readily available in published technical literature of an industry or adademic source, AND (2b)all competing vendors in the industry are already aware of this piece of information to the last detail (so if company X is using this technique, then companies Y and Z that are competing are already fully aware of this).

I hope that I've made this very clear. Feel free to ask me about specific elements of this conceptual design, I will answer what I can.
 

Hard Ball

Senior member
Jul 3, 2005
594
0
0
Originally posted by: Idontcare
Sigh. My apologies Hardball, I sure hope you anticipated this manner of response to your OP so that at least you aren't entirely disappointed by its reception.

Even if magreen is right and there are only five folks around here capable of having a fruitful and cerebral dialogue with you regarding your general design of a future microarchitecture, that is no reason for its existence in this forum to be negated or questioned.

I would ask, and expect, that if folks don't know how to conduct themselves in an engaging and tactful discussion (and implying NDA violations is NOT that) then they would simply refrain from thread crapping.

A high-brow dialogue between you, pm, tuxdave, and CTho9305 (I'm positive I've accidentally neglected to include a few others here) would be beneficial to a lot of us lurkers and readers in the background, we'll all learn something provided the thread doesn't get bloated with a lot of noise and distractions. Good luck.



Originally posted by: MODEL3
Also my apologies Hard Ball.

My knowledge about this staff is very limited, in order to try to contribute in a possible dialogue.

But i 'll watch, hoping that i 'll learn something.

Thanks, I really appreciate both of your sentiments. Hope that both of you can actively participate and contribute to this, whether to provide relevant information or ask relevan questions.

If I'm not mistaken, Idontcare, think you work in the fabrication business; if that's the case, I'm sure you would have plenty to say about process tech of future designs.
 

Ben90

Platinum Member
Jun 14, 2009
2,866
3
0
Sorry for thread crapping hard ball, i was REALLY tired and was laghing my ass off drawing that picture in paint (so at least one person enjoyed it lol) but you seem extremely knowledgeable with this stuff so i might have a couple questions for you:

i was reading way long time ago that pretty soon we arnt going to be able to shrink the xtors anymore and after that the only way to increase performance will be through a more efficient architecture or through physically increasing die size; obviously both of these have their limitations as well such as the speed of electricity, so i was wondering if there is any talk of development of stuff like 3D cores or like multi layered cores and how possible/probable/realistic it is for something like that to happen

obviously this wont be looked into seriously for at least a decade or more but it seems we are hitting a physical wall and pretty soon we might need a fundamental change of how things are made and just wanna hear ur take on it because you know a lot more than i do lol
 

twjr

Senior member
Jul 5, 2006
627
207
116
What about moving from binary to trinary (btw I have no idea if its even possible just want to throw it out there)?
 

alyarb

Platinum Member
Jan 25, 2009
2,444
0
76
transistors work by allowing or disallowing the flow of electricity. is there a third state transistors can occupy without going into a completely different concept of computing machine?
 

dmens

Platinum Member
Mar 18, 2005
2,271
917
136
i noticed this hypothetical design uses separate schedulers for FP, INT and SIMD (K-style) as opposed to the P6 onwards style of unified scheduling. interesting choice but that is a less efficient design on power vs performance (in my opinion) because the general workload tends to saturate one set but not the other, but a unified scheduler is made available for all work.

i agree that physical register file is the way to go.

there's a couple things in the diagram which nehalem has implemented... i couldn't find anything on google so i assume their existence has not been disclosed yet.
 

Ben90

Platinum Member
Jun 14, 2009
2,866
3
0
Well quantum computers you can have 0 or 1, or 0 and 1 at the same time... i have absolutely no idea how it works, but its pretty weird....google it if u wanna learn about it
 

Idontcare

Elite Member
Oct 10, 1999
21,118
58
91
Originally posted by: twjr
What about moving from binary to trinary (btw I have no idea if its even possible just want to throw it out there)?

The word you are looking for there twjr is ternary, trinary is less commonly used in the vernacular of the field. Ternary computer

Originally posted by: alyarb
transistors work by allowing or disallowing the flow of electricity. is there a third state transistors can occupy without going into a completely different concept of computing machine?

Yes, this is actually where the world of analog computing could make a comeback. MLC flash chips operate in similar philosophy to what you are describing in which a given memory cell is capable of storing variable quantities of charge and the read transistor is capable of sensing these variable quantities of charge and so the memory cell is capable of storing multiple bits of data.

The sky is the limit when we break away from true binary cmos logic, but the reason we haven't done it yet is because teh sky is also the limit in terms of what can go wrong and the complexity of the design so why do it yet? But there will come a day when it makes sense to explore the realm beyond binary cmos.

Originally posted by: Ben90
i was reading way long time ago that pretty soon we arnt going to be able to shrink the xtors anymore and after that the only way to increase performance will be through a more efficient architecture or through physically increasing die size; obviously both of these have their limitations as well such as the speed of electricity, so i was wondering if there is any talk of development of stuff like 3D cores or like multi layered cores and how possible/probable/realistic it is for something like that to happen

obviously this wont be looked into seriously for at least a decade or more but it seems we are hitting a physical wall and pretty soon we might need a fundamental change of how things are made and just wanna hear ur take on it because you know a lot more than i do lol

Intel and Samsung have both said shrinking down to 5nm region is possible, but obviously regardless what the actual number is for shrink limits there will come a day when we hit it, so then what?

The 3D architecture model is intriguing, and doable. Early commercial implementations will be most feasible on simple designs of course, namely memory.

Elpida Develops 3-D Stacked 8-Gbit DRAM

That is an early example of a functional method to implementing 3D IC's with TSV (through-silicon via). An even earlier example exists in the flash world where chips were stacked (but separately wire-bonded) within the package.

So it is happening, the fabrication techniques are being developed, optimized, improved upon every year. For high-performance logic CMOS the design tools need to be much more mature to make it a practical second-choice versus traditional 2D design. So many issues need to be addressed by the validation software from thermals to electrical cross-talk, power distribution, etc.

But it is feasible and will become more and more practical over time as the field matures.
 

Hard Ball

Senior member
Jul 3, 2005
594
0
0
Originally posted by: Ben90
Sorry for thread crapping hard ball, i was REALLY tired and was laghing my ass off drawing that picture in paint (so at least one person enjoyed it lol) but you seem extremely knowledgeable with this stuff so i might have a couple questions for you:

i was reading way long time ago that pretty soon we arnt going to be able to shrink the xtors anymore and after that the only way to increase performance will be through a more efficient architecture or through physically increasing die size; obviously both of these have their limitations as well such as the speed of electricity, so i was wondering if there is any talk of development of stuff like 3D cores or like multi layered cores and how possible/probable/realistic it is for something like that to happen

obviously this wont be looked into seriously for at least a decade or more but it seems we are hitting a physical wall and pretty soon we might need a fundamental change of how things are made and just wanna hear ur take on it because you know a lot more than i do lol

Not a problem; never considered your humour to be thread crapping, all in good fun.

I don't deal with the fab side of the business, so Idontcare and others probably would be more knowledgeable than I on these matters. It's good to have them here to answer questions about process tech.

I see what you are saying; CMOS transistors don't shrink equally on all dimensions for any given generation; so going to stacked IC would probably bring another level of variability that the architecture and logic design with have to be concerned with.

From the perspective of high-level architecture, what seems to be the most beneficial area to 3D stacking is stacking SRAM or eDRAM with compute logic, and perhaps eventually flash storage as well. One of the biggest obstacles to many-core CMPs in the next decade would be the ability to design an efficient memory hiearchy to guarantee both a compatible consistency model for all processing elements on an IC as well as the necessary bandwidth and latency necessary for high-thread count software to take advantage of these.

Obviously at a certain point, a switch, a request queue, a series of FIFO buffers, coupled with traditional coherence protocols (such as MESI or a derivative) would fail in terms of providing enough bandwidth for software shared data-structured, even for the data that is cached somewhere in the part of the mem hierarchy on the IC. Bus based protocols that have a single point of logical serialization per chip must give away to a directory based approach (see Istanbul's probe filter as an initial step).

And in concert with that must be a scalable point to point network of some kind, with topology of hypercube, mesh, torus, or some other configuration would be necessary; and new methods at routing coherence signals on die must also be adopted to make these interconnect topologies efficient. Such topologies must be implemented in such a way as to make each point to point connection without too much variability in latency, and guaranteeing high bandwidth for each node router to local set of last level shared cache on die. This is where stacked IC would come in very handy, and probably almost necessary when the interconnect network reaches a certain scale.

It's certainly an interesting area to watch in the next 10-15 years. Idontcare and others working with process tech, please feel free to chime in and fill in missing information.
 

Hard Ball

Senior member
Jul 3, 2005
594
0
0
Originally posted by: dmens
i noticed this hypothetical design uses separate schedulers for FP, INT and SIMD (K-style) as opposed to the P6 onwards style of unified scheduling. interesting choice but that is a less efficient design on power vs performance (in my opinion) because the general workload tends to saturate one set but not the other, but a unified scheduler is made available for all work.

i agree that physical register file is the way to go.

there's a couple things in the diagram which nehalem has implemented... i couldn't find anything on google so i assume their existence has not been disclosed yet.

Yes, astute observation.

But part of being able to dependency steering lies in the the size of window of instructions that you would be examining. The FP instructions should probably be categorically be exempt from further examination in this regard, once, the decode stage has clarified their identity. And shunting the FP instructions into the same schedule window would effectively shrink the size of the steering window that you would be working with, and given the same amount of hardware dedicated to steering would mean less accurate dependency slices.

A pipeline with dependency steering has an important consideration; the three-way tradeoff between effective size of steering window, the accuracy of the steering mechanism by whatever criteria that you are measuring (usually by destination register dependency graph), and the variety type of instructions that you are able to consider. Many steering logic designs for that reason only consider narrow types of instructions for steering control or slice initiation, such as mem loads or branches. As much as we can, we need to filter out irrelevant information, which in this case would include vast majority of FP instructions.

So even though the parts of the pipeline control logic that schedule instructions into appropriate reservation stations may seem somewhat less efficient, the deficit in efficiency would be more than made up (hopefully), by higher utilization of function units, less pipeline stalls, and more power efficient physical registerfile and forwarding mechanisms.
 

pm

Elite Member Mobile Devices
Jan 25, 2000
7,419
22
81
A high-brow dialogue between you, pm, tuxdave, and CTho9305 (I'm positive I've accidentally neglected to include a few others here) would be beneficial to a lot of us lurkers and readers in the background, we'll all learn something provided the thread doesn't get bloated with a lot of noise and distractions. Good luck.
It's flattering to see my name mentioned, but I've never been a logic designer or microarchitect. I've done clock route design, and circuit design, but mostly I do silicon debug and test. My expertise nowadays is in structural test... and I noticed there's no JTAG port on the design. :) I looked over the high-level schematic and recognized some of it, could understand other bits of it and was confused by the rest of it. I'll read the thread with interest and go back to lurking, but I don't have much to add other than that I find the subject intriguing.