Is it time to consider a new processor instruction set?

Hulk · Apr 17, 2013

With the push toward better efficiency, the stalling of clockspeed, and the impending die shrink wall I'm wondering if perhaps it might be the time for the powers that be to consider a new instruction set.

Quite a bit of effort has gone into "workout around" the x86 instructions. Micro-ops, macro-ops, fusion of these ops, SSID, AVX,... Perhaps it's finally time to start from scratch and design a microprocessor with up-to-date instructions?

On the positive side most current apps that aren't super compute hungry would run just fine in a virtual or emulation mode so that people could slowly make the transition to native apps.

I would think the biggest obstacle to this would be Intel and Microsoft. Intel especially has years and years of knowledge when it comes to the x86 instructions. Starting over while not wiping out Intel's lead in this area would certainly allow time for others to make up ground. Same thing with Microsoft as they have a lock on the market right now and a paradigm shift could leave an opening for an OS upstart.

What do you think?

Exophase · Apr 17, 2013

The thing with x86 is that its byte variable length nature lets you add pretty much any kind of instructions you want. AVX instructions with VEX format look like almost totally different encodings. You could practically call it a separate instruction set.

The cost is that there's a little bit of instruction size overhead in differentiating these different instruction sets, you have to support all of them which increases demand on the decoders and you have to support arbitrarily aligned byte variable instructions which also increases the demand on the decoders. But in the higher end stuff Intel plays in this is peanuts and even in the lower end (Atom) the cost is small enough to be offset by a lot of other factors and is continually shrinking.

You won't see x86 getting into smaller embedded stuff though.

Plimogz · Apr 17, 2013

I dunno -- but then, I don't know enough about CPU architecture and instructions and such to have much of an informed opinion.

However, wrt

With the push toward better efficiency, the stalling of clockspeed, and the impending die shrink

I would happily settle for some much better multi-threading. I know, I know, it's harder to program for, and there are some strict theoretical limits to how far parallelism can get us. But this has been the case for years... And what is sometimes left by the way-side when people underline how little pure clock speed has increased, even if it is acknowledged in the same thoughts that IPC has indeed steadily increased from P4 on -- is that we are getting more cores. And die shrinks offer pretty easy access to more and more cores.

Now if only there could be a nice revolution in how many moar cores most software leverages, we'd easily see the kind of scaling we were accustomed to when clocks were 'rapidly' going from 400Mhz to 4000Mhz.

Or so it seems to me, anyway.

TuxDave · Apr 17, 2013

I have a personal bias... but here's what I think anyways. So take it with a grain of salt.

A lot of evidence that I've seen internally and externally don't point towards the ISA being the problem. If we were to start over and only use modern instructions w/ fixed length formats designed to be easily decoded for hardware simplicity at the cost of memory, it would be in the low single digit percentage.

Maybe people more informed on the software side could chime in. The CPU already exposes some aspects of its hardware to developers via perfmon signals. I see those as a trial and error system that software can keep tweaking to minimize the number of "bad events". However, if we exposed more of the hardware and had less hardware agnostic software, maybe we can afford to start removing a lot of the general purpose hardware used to speed it up. Maybe software can tell hardware to act weird and drop all instructions that have a cache miss.

Maybe that's a good idea. Maybe I'm just punting on the work.

Hulk · Apr 17, 2013

Plimogz said:
I would happily settle for some much better multi-threading. I know, I know, it's harder to program for, and there are some strict theoretical limits to how far parallelism can get us.

I agree with you here. What applications exactly are so difficult to multi-thread effectively? We know that video applications are extremely well-suited for parallel operation although even that become more problematic when both the GPU and CPU need to be fully utilized. 3D rendering is very "multi-core-able."

I know games are quite difficult to do well multi-threading wise.

What else?

SocketF · Apr 17, 2013

ARM and AMD are pushing HSA together ... instruction sets dont matter there ... maybe that is the future?

http://developer.amd.com/resources/...hat-is-heterogeneous-system-architecture-hsa/

Olikan · Apr 17, 2013

i want to belive...

http://en.wikipedia.org/wiki/One_instruction_set_computer

TuxDave · Apr 17, 2013

Olikan said:
i want to belive...

http://en.wikipedia.org/wiki/One_instruction_set_computer

There's a strong push for bit manipulating code. So the future is closer than you think!

Charles Kozierok · Apr 17, 2013

There really isn't any need. I remember reading somewhere (can't remember) that Intel and AMD have gotten the in-order/out-of-order/in-order design down to the point where going straight RISC would yield at most a couple of percentage points of performance improvement.

Charles Kozierok · Apr 17, 2013

Olikan said:
i want to belive...

http://en.wikipedia.org/wiki/One_instruction_set_computer

That could well be the worst idea I've ever seen. It's like Abraham Maslow's worst nightmare come true.

Ken g6 · Apr 17, 2013

Charles Kozierok said:
There really isn't any need. I remember reading somewhere (can't remember) that Intel and AMD have gotten the in-order/out-of-order/in-order design down to the point where going straight RISC would yield at most a couple of percentage points of performance improvement.

Well, at least outside techie circles like this one, performance improvement isn't the goal anymore. Power reduction is.

Realistically, if any instruction set were to replace x86, it would be ARM. Although I always liked the idea of Transmeta's JIT compiler for x86-to-something else.

zephyrprime · Apr 17, 2013

Instruction sets are really old news as far as them being a resource to mine for more performance. What we really need is a radically different processor architecture that can support more parallelism. Something that isn't a von neumann machine.

Exophase · Apr 17, 2013

zephyrprime said:
Instruction sets are really old news as far as them being a resource to mine for more performance.

People say things like this a lot but Intel has been adding full sets of instructions set roughly every processor "tock", with AVX and AVX2 being the most aggressive changes in years. So at least Intel thinks mining instruction sets isn't old news.

Cerb · Apr 17, 2013

Hulk said:
With the push toward better efficiency, the stalling of clockspeed, and the impending die shrink wall I'm wondering if perhaps it might be the time for the powers that be to consider a new instruction set.

Like ARM? Not bad.
Like IA64? D: (though Paulson is quite impressive)

Quite a bit of effort has gone into "workout around" the x86 instructions. Micro-ops, macro-ops, fusion of these ops, SSID, AVX,... Perhaps it's finally time to start from scratch and design a microprocessor with up-to-date instructions?

Work has gone in to work around other ISA instructions, too. Micro-ops, fused ops, cracking ops...IBM and MIPS have done it, too, under different guises. That stuff, along with ucoded instructions, are not really bad. The ISA is there to tell the CPU what to do. The CPU can do it however it sees fit, as long as it does it correctly.

Oh, also, things like SSE, and AVX, are not work-arounds, but additions. They may fix weaknesses, but they make use of new hardware capability, and do so by acting differently. Every RISC ISA that's lasted has also gotten multiple different sets of such extensions. Scalar and vector operations are fundamentally different, in practical code. They are mostly pointless to have, without special hardware to make use of them.

Someone could surely come up with an instruction set that could fix x86' weaknesses. OTOH, it would need to catch on, and that presents a problem. An x86 CPU's decoders are going to be power-hungry, and need to be running more than a typical RISC's. Flags could be remade more cleanly, too. That stuff could be fixed, with a fairly compact IS, but would it be worth the software development and support effort, afterwards?

And, could someone do it, without screwing something up, in the process?

IA32, and x86-64, are by no means elegant, but they got a lot of things right. You could probably make page table trees a bit smaller, but it would easy to do memory management worse than x86. 2-operand, with heavy register overwriting, as it turned out, forced them to make plain better CPUs, not needing so many registers*, but instead renaming by replacing, and even memory aliasing and HW stack management.

While far from the strictest, and still needing some barriers, x86's memory model is fairly strict, in-so-far-as the order instructions are run in one thread must be the order memory appears to have been written to in the instruction stream, and cache coherency necessitates that other threads respect the apparent order from that thread (the subtle difference between must be and must appear to be is surprisingly important, allowing for reading and writing bypass optimizations, both in your code, and in the CPU itself). This can be seen as opposed to weak ordering, in which newer hardware could break your old code (yes, some CPUs did, and still may, rely on statically scheduling 'winners' of race conditions), unless it has barriers, but certain barrier semantics can necessitate either pipeline flushing, or excessive checkpointing.

Microcoded instructions can be implemented with dedicated special paths just for that CPU, and take fewer external instructions than equivalent non-ucode counterparts, including long RISC sequences. They look bad in isolation, but considering the case of being a handful of instructions in a whole program, they're not bad at all. While many do exist for legacy purposes, too, they are going to be on the slow side, most likely, and will surely be designed to take up as little die space as possible.

You could do it, yes. But, we have x86, and while ARM's memory model can be annoying, it's well-known, both to humans and compilers, and isn't all that bad. As long as we have the memory wall, ISAs will mostly remain uninteresting, in a practical sense, because the ones we have are good enough, that ISA differences represent a fairly small amount of difference in actual performance.

* Not that the added GPRs in x86-64 were unwelcome additions!

bononos · Apr 17, 2013

Which old instructions sets are dropped in current cpus? MMX? Was the x87 implemented as microcode starting from the Core cpu or was it dropped?

lamedude · Apr 17, 2013

I don't fully understand what Intel did to get AVX to work in 32bit but it doesn't sound pretty. And it doesn't work in real mode at all (I know big loss). It makes me ask the same question as the OP.

Charles Kozierok · Apr 17, 2013

zephyrprime said:
Instruction sets are really old news as far as them being a resource to mine for more performance. What we really need is a radically different processor architecture that can support more parallelism. Something that isn't a von neumann machine.

We can already parallel the crap out of things. The problem is Amdahl's law, and a new architecture isn't going to solve it.

Ajay · Apr 18, 2013

Charles Kozierok said:
We can already parallel the crap out of things. The problem is Amdahl's law, and a new architecture isn't going to solve it.

Amdahl's law is only a problem because too many programmer believe it is 'all that'. Take a look at Gustafson's law. DEC had modeled SMT processors and found that four hardware threads per core was the optimal number. After that, the diminishing performance returns weren't worth the extra xtors required to support more threads.

Intel is still stuck at two threads per core. Computer Science majors need to take algorithm courses which teach them how to extract maximum parallelism from their code so that it becomes cheaper for ISVs too implement.

Cerb · Apr 18, 2013

Ajay said:
Amdahl's law is only a problem because too many programmer believe it is 'all that'. Take a look at Gustafson's law.

Gustafson's Law does in no way refute Amdahl's Law. They are merely perspectives on the same concepts.

Amdahl's Law is, "all that," for any application which has strict serial dependencies, but more or less indefinite data and time bounds. For a general-purpose algorithm, it's the best you'll be able to get. Gustafson's Law concerns itself with scalability of data sets (and time), which may also be limited. The two are equivalent, but for different problem cases. In the case of Gustafson's Law, scalability only increases linearly if the computation per processor is fixed. If the computation per processor increases with data size increasing, the scalability won't be linear, but rather, a curve that flattens out, or approaches an asymptote, as data size and/or processors increases, just as with Amdahl's Law's typical applications.

The implication of Amdahl's Law is that some problems will never be worked on with systems like GPUs, instead requiring faster processing from each processor. The implication of Gustafson's Law is that it becomes worthwhile to do more processing, after a point, rather than process more stuff (such as more in-depth data mining, providing real-time statistics, etc., instead of finding more raw data to process)--or, in today's world, just go idle and save electricity.

It would be good to keep in mind that in 1967, there were far more problems out there that computers weren't fast enough for at all, and simpler faster processors were really much faster than complicated units with many processors, so infinite data/time bounds would make much more sense, than in 1988, by which computers were common business items, able--and often required--to process data as fast or faster than it could be presented to them.

Today, though, you should really be moving to using Gunther's Law, which encompasses both, without the work of deriving one from the other.
http://en.wikipedia.org/wiki/Neil_J._Gunther#Universal_Law_of_Computational_Scalability

DEC had modeled SMT processors and found that four hardware threads per core was the optimal number. After that, the diminishing performance returns weren't worth the extra xtors required to support more threads.

Intel is at 2 for mainstream CPUs, and can feed 2 well. More than 2 threads now, and we'd be back to the poor quality of HT on the P4. Response time matters. Alpha was going for maximum throughput. On real workloads, existing Alphas were able to max out their bus (one of several reasons for the IMC on the K8). At the time, they could keep on scaling well. That's apples and oranges. The guys behind the SPARC T-series figured more would help, FI, and those CPUs are no slouches, in the right setting, and even at Oracle's costs (what can you pay, today?), have managed to provide non-kool-aid drinkers with real value. There's not a perfect universal number, nor perfect way to implement multithreading. 4 may have been the ideal count for the super-wide Alpha-to-be. That does not make for a universal truth.

Intel is still stuck at two threads per core.

Actually, they are up to 4, if you want to go try that counting game. They apply 2 on mainstream CPUs, because we care about more than just keeping the ALUs busy. There have consistently been cases where turning HT off, going back to 1 thread per core, is an improvement. Fewer cases with each generation, but all these years on, it still happens. As long as memory is not instant, it will keep on happening, too, as long as they use shared resources (as opposed to say, fully partitioned SMT).

Computer Science majors need to take algorithm courses which teach them how to extract maximum parallelism from their code so that it becomes cheaper for ISVs too implement.

Where do you find a course that teaches you algorithms that either (a) cannot exist or (b) have not yet been created? They simply don't exist, for a wide variety of real problems. Then, in some cases, when they do exist, the less-parallel versions are faster, in practice, because the parallel versions have such high overhead. Stones don't bleed.

jaqie · Apr 18, 2013

Short answer to OP: no.

ShintaiDK · Apr 18, 2013

bononos said:
Which old instructions sets are dropped in current cpus? MMX? Was the x87 implemented as microcode starting from the Core cpu or was it dropped?

None and no.

Only software is starting to drop instruction sets. There is no reason really to remove anything old yet on the hardware side.

Idontcare · Apr 18, 2013

Cerb said:
Today, though, you should really be moving to using Gunther's Law, which encompasses both, without the work of deriving one from the other.
http://en.wikipedia.org/wiki/Neil_J._Gunther#Universal_Law_of_Computational_Scalability

Interesting, it looks remarkably similar to Amdahl's Law as modified by Almasi and Gottlieb to account for the time-impact of data coherency and thread propagation. (their work was published 1989, Gunther's was in 1993...maybe he built on their work or independently developed basically the same model?)

The Almasi/Gottlieb extension works great when modeling scaling data to determine how non-core hardware features (ram latency/bandwidth, network latency/bandwidth, software models for data parsing and splicing, etc) impact the scalability of the problem set for a given compute model. (the same outcome as Gunther's model as far as I can gather from the wiki page)

I like this Gunther's Law, gonna have to check into it more to see what it offers over the prevaling Almasi/Gottlieb model.

sm625 · Apr 18, 2013

This is one of those things that really irks me about AMD. They have both a cpu and gpu in house. And they have a set of drivers that very closely interact with both. Just by simply profiling all the most popular PC games, they could come up with a dozen new instructions that optimize the most-often-executed codepaths to produce an easy 10% framerate boost across the board. Such additions to the microcode would only cost a few million transistors, and with that small investment they could reclaim the gaming and thus the enthusiast crown. But they have completely and utterly failed in such a way it boggles the mind. A 10% boost is conservative compared to what I believe is possible. I'm thinking more like 10% per year, through constant feedback between game profiling, cpu and gpu microcode upgrades, and driver upgrades. And this 10% would go right on top of other types of architectural and process advancements. 7 years after purchasing ATI, AMD cpus should have been doubling the framerates vs intel in games.

Cerb · Apr 18, 2013

Idontcare said:
Interesting, it looks remarkably similar to Amdahl's Law as modified by Almasi and Gottlieb to account for the time-impact of data coherency and thread propagation. (their work was published 1989, Gunther's was in 1993...maybe he built on their work or independently developed basically the same model?)

Second mouse gets the cheese?

Based on looking at citations in papers, right now, it looks he came about it a bit differently, though (link).

Olikan · Apr 18, 2013

sm625 said:
This is one of those things that really irks me about AMD. They have both a cpu and gpu in house. And they have a set of drivers that very closely interact with both. Just by simply profiling all the most popular PC games, they could come up with a dozen new instructions that optimize the most-often-executed codepaths to produce an easy 10% framerate boost across the board. Such additions to the microcode would only cost a few million transistors, and with that small investment they could reclaim the gaming and thus the enthusiast crown. But they have completely and utterly failed in such a way it boggles the mind. A 10% boost is conservative compared to what I believe is possible. I'm thinking more like 10% per year, through constant feedback between game profiling, cpu and gpu microcode upgrades, and driver upgrades. And this 10% would go right on top of other types of architectural and process advancements. 7 years after purchasing ATI, AMD cpus should have been doubling the framerates vs intel in games.

IMO, that's because AMD is behind the curve in it's own roadmaps, ISA wise...
XOP instructions was delayed 2 years, because early bulldozer was a disaster...i remember that amd have a tecnology similar to intel's TSX, probably got delayed because of bulldozer...

we should be talking about excavator+ today, and it's ISAs....If amd didn't screw up everything

Is it time to consider a new processor instruction set?

Diamond Member

Diamond Member

Senior member

Lifer

Diamond Member

Senior member

Platinum Member

Lifer

Elite Member

Elite Member

Programming Moderator, Elite Member

Diamond Member

Diamond Member

Elite Member

Diamond Member

Golden Member

Elite Member

Lifer

Elite Member

Platinum Member

Lifer

Elite Member

Diamond Member

Elite Member

Platinum Member