Is it time to consider a new processor instruction set?

Hulk

Diamond Member
Oct 9, 1999
5,118
3,662
136
With the push toward better efficiency, the stalling of clockspeed, and the impending die shrink wall I'm wondering if perhaps it might be the time for the powers that be to consider a new instruction set.

Quite a bit of effort has gone into "workout around" the x86 instructions. Micro-ops, macro-ops, fusion of these ops, SSID, AVX,... Perhaps it's finally time to start from scratch and design a microprocessor with up-to-date instructions?

On the positive side most current apps that aren't super compute hungry would run just fine in a virtual or emulation mode so that people could slowly make the transition to native apps.

I would think the biggest obstacle to this would be Intel and Microsoft. Intel especially has years and years of knowledge when it comes to the x86 instructions. Starting over while not wiping out Intel's lead in this area would certainly allow time for others to make up ground. Same thing with Microsoft as they have a lock on the market right now and a paradigm shift could leave an opening for an OS upstart.

What do you think?
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
The thing with x86 is that its byte variable length nature lets you add pretty much any kind of instructions you want. AVX instructions with VEX format look like almost totally different encodings. You could practically call it a separate instruction set.

The cost is that there's a little bit of instruction size overhead in differentiating these different instruction sets, you have to support all of them which increases demand on the decoders and you have to support arbitrarily aligned byte variable instructions which also increases the demand on the decoders. But in the higher end stuff Intel plays in this is peanuts and even in the lower end (Atom) the cost is small enough to be offset by a lot of other factors and is continually shrinking.

You won't see x86 getting into smaller embedded stuff though.
 

Plimogz

Senior member
Oct 3, 2009
678
0
71
I dunno -- but then, I don't know enough about CPU architecture and instructions and such to have much of an informed opinion.

However, wrt
With the push toward better efficiency, the stalling of clockspeed, and the impending die shrink
I would happily settle for some much better multi-threading. I know, I know, it's harder to program for, and there are some strict theoretical limits to how far parallelism can get us. But this has been the case for years... And what is sometimes left by the way-side when people underline how little pure clock speed has increased, even if it is acknowledged in the same thoughts that IPC has indeed steadily increased from P4 on -- is that we are getting more cores. And die shrinks offer pretty easy access to more and more cores.

Now if only there could be a nice revolution in how many moar cores most software leverages, we'd easily see the kind of scaling we were accustomed to when clocks were 'rapidly' going from 400Mhz to 4000Mhz.

Or so it seems to me, anyway.
 
Last edited:

TuxDave

Lifer
Oct 8, 2002
10,571
3
71
I have a personal bias... but here's what I think anyways. So take it with a grain of salt. :p

A lot of evidence that I've seen internally and externally don't point towards the ISA being the problem. If we were to start over and only use modern instructions w/ fixed length formats designed to be easily decoded for hardware simplicity at the cost of memory, it would be in the low single digit percentage.

Maybe people more informed on the software side could chime in. The CPU already exposes some aspects of its hardware to developers via perfmon signals. I see those as a trial and error system that software can keep tweaking to minimize the number of "bad events". However, if we exposed more of the hardware and had less hardware agnostic software, maybe we can afford to start removing a lot of the general purpose hardware used to speed it up. Maybe software can tell hardware to act weird and drop all instructions that have a cache miss.

Maybe that's a good idea. Maybe I'm just punting on the work. :p
 

Hulk

Diamond Member
Oct 9, 1999
5,118
3,662
136
I would happily settle for some much better multi-threading. I know, I know, it's harder to program for, and there are some strict theoretical limits to how far parallelism can get us.


I agree with you here. What applications exactly are so difficult to multi-thread effectively? We know that video applications are extremely well-suited for parallel operation although even that become more problematic when both the GPU and CPU need to be fully utilized. 3D rendering is very "multi-core-able."

I know games are quite difficult to do well multi-threading wise.

What else?
 

Charles Kozierok

Elite Member
May 14, 2012
6,762
1
0
There really isn't any need. I remember reading somewhere (can't remember) that Intel and AMD have gotten the in-order/out-of-order/in-order design down to the point where going straight RISC would yield at most a couple of percentage points of performance improvement.
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,633
4,562
75
There really isn't any need. I remember reading somewhere (can't remember) that Intel and AMD have gotten the in-order/out-of-order/in-order design down to the point where going straight RISC would yield at most a couple of percentage points of performance improvement.
Well, at least outside techie circles like this one, performance improvement isn't the goal anymore. Power reduction is.

Realistically, if any instruction set were to replace x86, it would be ARM. Although I always liked the idea of Transmeta's JIT compiler for x86-to-something else.
 

zephyrprime

Diamond Member
Feb 18, 2001
7,512
2
81
Instruction sets are really old news as far as them being a resource to mine for more performance. What we really need is a radically different processor architecture that can support more parallelism. Something that isn't a von neumann machine.
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
Instruction sets are really old news as far as them being a resource to mine for more performance.

People say things like this a lot but Intel has been adding full sets of instructions set roughly every processor "tock", with AVX and AVX2 being the most aggressive changes in years. So at least Intel thinks mining instruction sets isn't old news.
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
With the push toward better efficiency, the stalling of clockspeed, and the impending die shrink wall I'm wondering if perhaps it might be the time for the powers that be to consider a new instruction set.
Like ARM? Not bad.
Like IA64? D: (though Paulson is quite impressive)

Quite a bit of effort has gone into "workout around" the x86 instructions. Micro-ops, macro-ops, fusion of these ops, SSID, AVX,... Perhaps it's finally time to start from scratch and design a microprocessor with up-to-date instructions?
Work has gone in to work around other ISA instructions, too. Micro-ops, fused ops, cracking ops...IBM and MIPS have done it, too, under different guises. That stuff, along with ucoded instructions, are not really bad. The ISA is there to tell the CPU what to do. The CPU can do it however it sees fit, as long as it does it correctly.

Oh, also, things like SSE, and AVX, are not work-arounds, but additions. They may fix weaknesses, but they make use of new hardware capability, and do so by acting differently. Every RISC ISA that's lasted has also gotten multiple different sets of such extensions. Scalar and vector operations are fundamentally different, in practical code. They are mostly pointless to have, without special hardware to make use of them.

Someone could surely come up with an instruction set that could fix x86' weaknesses. OTOH, it would need to catch on, and that presents a problem. An x86 CPU's decoders are going to be power-hungry, and need to be running more than a typical RISC's. Flags could be remade more cleanly, too. That stuff could be fixed, with a fairly compact IS, but would it be worth the software development and support effort, afterwards?

And, could someone do it, without screwing something up, in the process?

IA32, and x86-64, are by no means elegant, but they got a lot of things right. You could probably make page table trees a bit smaller, but it would easy to do memory management worse than x86. 2-operand, with heavy register overwriting, as it turned out, forced them to make plain better CPUs, not needing so many registers*, but instead renaming by replacing, and even memory aliasing and HW stack management.

While far from the strictest, and still needing some barriers, x86's memory model is fairly strict, in-so-far-as the order instructions are run in one thread must be the order memory appears to have been written to in the instruction stream, and cache coherency necessitates that other threads respect the apparent order from that thread (the subtle difference between must be and must appear to be is surprisingly important, allowing for reading and writing bypass optimizations, both in your code, and in the CPU itself). This can be seen as opposed to weak ordering, in which newer hardware could break your old code (yes, some CPUs did, and still may, rely on statically scheduling 'winners' of race conditions), unless it has barriers, but certain barrier semantics can necessitate either pipeline flushing, or excessive checkpointing.

Microcoded instructions can be implemented with dedicated special paths just for that CPU, and take fewer external instructions than equivalent non-ucode counterparts, including long RISC sequences. They look bad in isolation, but considering the case of being a handful of instructions in a whole program, they're not bad at all. While many do exist for legacy purposes, too, they are going to be on the slow side, most likely, and will surely be designed to take up as little die space as possible.

You could do it, yes. But, we have x86, and while ARM's memory model can be annoying, it's well-known, both to humans and compilers, and isn't all that bad. As long as we have the memory wall, ISAs will mostly remain uninteresting, in a practical sense, because the ones we have are good enough, that ISA differences represent a fairly small amount of difference in actual performance.

* Not that the added GPRs in x86-64 were unwelcome additions!
 
Last edited:

bononos

Diamond Member
Aug 21, 2011
3,928
186
106
Which old instructions sets are dropped in current cpus? MMX? Was the x87 implemented as microcode starting from the Core cpu or was it dropped?
 

Charles Kozierok

Elite Member
May 14, 2012
6,762
1
0
Instruction sets are really old news as far as them being a resource to mine for more performance. What we really need is a radically different processor architecture that can support more parallelism. Something that isn't a von neumann machine.

We can already parallel the crap out of things. The problem is Amdahl's law, and a new architecture isn't going to solve it.
 

Ajay

Lifer
Jan 8, 2001
16,094
8,112
136
We can already parallel the crap out of things. The problem is Amdahl's law, and a new architecture isn't going to solve it.

Amdahl's law is only a problem because too many programmer believe it is 'all that'. Take a look at Gustafson's law. DEC had modeled SMT processors and found that four hardware threads per core was the optimal number. After that, the diminishing performance returns weren't worth the extra xtors required to support more threads.

Intel is still stuck at two threads per core. Computer Science majors need to take algorithm courses which teach them how to extract maximum parallelism from their code so that it becomes cheaper for ISVs too implement.
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
Amdahl's law is only a problem because too many programmer believe it is 'all that'. Take a look at Gustafson's law.
Gustafson's Law does in no way refute Amdahl's Law. They are merely perspectives on the same concepts.

Amdahl's Law is, "all that," for any application which has strict serial dependencies, but more or less indefinite data and time bounds. For a general-purpose algorithm, it's the best you'll be able to get. Gustafson's Law concerns itself with scalability of data sets (and time), which may also be limited. The two are equivalent, but for different problem cases. In the case of Gustafson's Law, scalability only increases linearly if the computation per processor is fixed. If the computation per processor increases with data size increasing, the scalability won't be linear, but rather, a curve that flattens out, or approaches an asymptote, as data size and/or processors increases, just as with Amdahl's Law's typical applications.

The implication of Amdahl's Law is that some problems will never be worked on with systems like GPUs, instead requiring faster processing from each processor. The implication of Gustafson's Law is that it becomes worthwhile to do more processing, after a point, rather than process more stuff (such as more in-depth data mining, providing real-time statistics, etc., instead of finding more raw data to process)--or, in today's world, just go idle and save electricity.

It would be good to keep in mind that in 1967, there were far more problems out there that computers weren't fast enough for at all, and simpler faster processors were really much faster than complicated units with many processors, so infinite data/time bounds would make much more sense, than in 1988, by which computers were common business items, able--and often required--to process data as fast or faster than it could be presented to them.

Today, though, you should really be moving to using Gunther's Law, which encompasses both, without the work of deriving one from the other.
http://en.wikipedia.org/wiki/Neil_J._Gunther#Universal_Law_of_Computational_Scalability
DEC had modeled SMT processors and found that four hardware threads per core was the optimal number. After that, the diminishing performance returns weren't worth the extra xtors required to support more threads.
Intel is at 2 for mainstream CPUs, and can feed 2 well. More than 2 threads now, and we'd be back to the poor quality of HT on the P4. Response time matters. Alpha was going for maximum throughput. On real workloads, existing Alphas were able to max out their bus (one of several reasons for the IMC on the K8). At the time, they could keep on scaling well. That's apples and oranges. The guys behind the SPARC T-series figured more would help, FI, and those CPUs are no slouches, in the right setting, and even at Oracle's costs (what can you pay, today?), have managed to provide non-kool-aid drinkers with real value. There's not a perfect universal number, nor perfect way to implement multithreading. 4 may have been the ideal count for the super-wide Alpha-to-be. That does not make for a universal truth.

Intel is still stuck at two threads per core.
Actually, they are up to 4, if you want to go try that counting game. They apply 2 on mainstream CPUs, because we care about more than just keeping the ALUs busy. There have consistently been cases where turning HT off, going back to 1 thread per core, is an improvement. Fewer cases with each generation, but all these years on, it still happens. As long as memory is not instant, it will keep on happening, too, as long as they use shared resources (as opposed to say, fully partitioned SMT).
Computer Science majors need to take algorithm courses which teach them how to extract maximum parallelism from their code so that it becomes cheaper for ISVs too implement.
Where do you find a course that teaches you algorithms that either (a) cannot exist or (b) have not yet been created? They simply don't exist, for a wide variety of real problems. Then, in some cases, when they do exist, the less-parallel versions are faster, in practice, because the parallel versions have such high overhead. Stones don't bleed.
 
Last edited:

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
Which old instructions sets are dropped in current cpus? MMX? Was the x87 implemented as microcode starting from the Core cpu or was it dropped?

None and no.

Only software is starting to drop instruction sets. There is no reason really to remove anything old yet on the hardware side.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
Today, though, you should really be moving to using Gunther's Law, which encompasses both, without the work of deriving one from the other.
http://en.wikipedia.org/wiki/Neil_J._Gunther#Universal_Law_of_Computational_Scalability

Interesting, it looks remarkably similar to Amdahl's Law as modified by Almasi and Gottlieb to account for the time-impact of data coherency and thread propagation. (their work was published 1989, Gunther's was in 1993...maybe he built on their work or independently developed basically the same model?)

AmdahlsLawaugmentedbyAlmasiandGottlieb.png


The Almasi/Gottlieb extension works great when modeling scaling data to determine how non-core hardware features (ram latency/bandwidth, network latency/bandwidth, software models for data parsing and splicing, etc) impact the scalability of the problem set for a given compute model. (the same outcome as Gunther's model as far as I can gather from the wiki page)

LinXThreadScaling.png


I like this Gunther's Law, gonna have to check into it more to see what it offers over the prevaling Almasi/Gottlieb model.
 

sm625

Diamond Member
May 6, 2011
8,172
137
106
This is one of those things that really irks me about AMD. They have both a cpu and gpu in house. And they have a set of drivers that very closely interact with both. Just by simply profiling all the most popular PC games, they could come up with a dozen new instructions that optimize the most-often-executed codepaths to produce an easy 10% framerate boost across the board. Such additions to the microcode would only cost a few million transistors, and with that small investment they could reclaim the gaming and thus the enthusiast crown. But they have completely and utterly failed in such a way it boggles the mind. A 10% boost is conservative compared to what I believe is possible. I'm thinking more like 10% per year, through constant feedback between game profiling, cpu and gpu microcode upgrades, and driver upgrades. And this 10% would go right on top of other types of architectural and process advancements. 7 years after purchasing ATI, AMD cpus should have been doubling the framerates vs intel in games.
 
Last edited:

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
Interesting, it looks remarkably similar to Amdahl's Law as modified by Almasi and Gottlieb to account for the time-impact of data coherency and thread propagation. (their work was published 1989, Gunther's was in 1993...maybe he built on their work or independently developed basically the same model?)
Second mouse gets the cheese? :) Based on looking at citations in papers, right now, it looks he came about it a bit differently, though (link).
 

Olikan

Platinum Member
Sep 23, 2011
2,023
275
126
This is one of those things that really irks me about AMD. They have both a cpu and gpu in house. And they have a set of drivers that very closely interact with both. Just by simply profiling all the most popular PC games, they could come up with a dozen new instructions that optimize the most-often-executed codepaths to produce an easy 10% framerate boost across the board. Such additions to the microcode would only cost a few million transistors, and with that small investment they could reclaim the gaming and thus the enthusiast crown. But they have completely and utterly failed in such a way it boggles the mind. A 10% boost is conservative compared to what I believe is possible. I'm thinking more like 10% per year, through constant feedback between game profiling, cpu and gpu microcode upgrades, and driver upgrades. And this 10% would go right on top of other types of architectural and process advancements. 7 years after purchasing ATI, AMD cpus should have been doubling the framerates vs intel in games.

IMO, that's because AMD is behind the curve in it's own roadmaps, ISA wise...
XOP instructions was delayed 2 years, because early bulldozer was a disaster...i remember that amd have a tecnology similar to intel's TSX, probably got delayed because of bulldozer...

we should be talking about excavator+ today, and it's ISAs....If amd didn't screw up everything