Is it time to consider a new processor instruction set?

djgandy · Apr 18, 2013

TuxDave said:
I have a personal bias... but here's what I think anyways. So take it with a grain of salt.

A lot of evidence that I've seen internally and externally don't point towards the ISA being the problem. If we were to start over and only use modern instructions w/ fixed length formats designed to be easily decoded for hardware simplicity at the cost of memory, it would be in the low single digit percentage.

Maybe people more informed on the software side could chime in. The CPU already exposes some aspects of its hardware to developers via perfmon signals. I see those as a trial and error system that software can keep tweaking to minimize the number of "bad events". However, if we exposed more of the hardware and had less hardware agnostic software, maybe we can afford to start removing a lot of the general purpose hardware used to speed it up. Maybe software can tell hardware to act weird and drop all instructions that have a cache miss.

Maybe that's a good idea. Maybe I'm just punting on the work.

Correct. The ISA is one of the parts of the CPU that has technical documentation that is open to the public so it may seem significant as it can be criticised much more easily. The rest is all secret because it is HARD stuff and a Intel don't want the competition knowing about it. Instruction decode is easy in comparison to what the rest of the chip does.

Idontcare · Apr 18, 2013

Cerb said:
Second mouse gets the cheese? Based on looking at citations in papers, right now, it looks he came about it a bit differently, though (link).

Or could be more of a Newton versus Leibniz type situation in which both independently come up with the same idea but their mathematical notation differ (as do their personal ego's in terms of having credit go to themselves) and in the end the credit goes to whomever wins in the court of public opinion

(FWIW, for folks who aren't aware of the Leibniz–Newton calculus controversy it is a worthy read, hubris is a trait of humanity, it appears we cannot be seperated from it)

Charles Kozierok · Apr 18, 2013

Heh. I just wrote up a whole bunch of stuff about Leibniz's stepped reckoners. Fascinating stuff.

"And now that we may give final praise to the machine we may say that it will be desirable to all who are engaged in computations which, it is well known, are the managers of financial affairs, the administrators of others' estates, merchants, surveyors, geographers, navigators, astronomers ... But limiting ourselves to scientific uses, the old geometric and astronomic tables could be corrected and new ones constructed by the help of which we could measure all kinds of curves and figures ... it will pay to extend as far as possible the major Pythagorean tables; the table of squares, cubes, and other powers; and the tables of combinations, variations, and progressions of all kinds ... Also the astronomers surely will not have to continue to exercise the patience which is required for computation.... For it is unworthy of excellent men to lose hours like slaves in the labor of calculation which could safely be relegated to anyone else if the machine were used."
-- Gottfried Wilhelm Leibniz, 1685

To get back on topic, people would be shocked if they knew just how little of a modern CPU's real estate is actually responsible for processing.

Sleepingforest · Apr 18, 2013

Charles Kozierok said:
To get back on topic, people would be shocked if they knew just how little of a modern CPU's real estate is actually responsible for processing.

I'll bite. How little?

Charles Kozierok · Apr 18, 2013

Bah. I had a nice graphic I saw somewhere in my research travels that showed nicely how basically die sizes are staying roughly the same because they keep increasing L2 cache sizes, and the actual logic is now basically tucked into the corner. I can't find it.

I seem to recall the number being in the 10-20% range.

Sleepingforest · Apr 18, 2013

So it's all basically cache?

Charles Kozierok · Apr 18, 2013

Cache is a big part of it, but there's also all the I/O stuff, memory controller, and lots of little system bits and pieces (sometimes using that awful term "uncore"). And then the IGP in some cases.

Most of the extra transistors we're getting from Moore's Law are not going into processing logic.

nehalem256 · Apr 18, 2013

Llano die-shot. And note some of the CPU portion is the L1 cache.

TuxDave · Apr 18, 2013

Charles Kozierok said:
Bah. I had a nice graphic I saw somewhere in my research travels that showed nicely how basically die sizes are staying roughly the same because they keep increasing L2 cache sizes, and the actual logic is now basically tucked into the corner. I can't find it.

I seem to recall the number being in the 10-20% range.

You are probably referring to something similar to this:

http://marchamilton.wordpress.com/2012/03/06/flops-becoming-almost-free-with-intels-avx/

Let's be clear about what this shows. The arithmetic units are tucked in the corner and so for theoretical computation throughput, that's what you're looking at. However, some may think the rest of the die is for legacy and "bloat". It's NOT.

Instead you have an hardware and architectural ideas that enable high performance hardware for a wide variety of workloads. If you have very simple workloads (like linpack), perhaps you would prefer a CPU that is simpler in architecture as long as it's packed with execution units. Individual CPUs may choke left and right but since you just have a ridiculously large pile of work to do, as long as a lot of the CPUs are still chugging away you're happy. Something like Xeon Phi where you have a smaller core but a stupidly large vector processing unit.

It's definitely a tradeoff and as we start looking for more performance, we have to start understanding more about what type of workloads are we running and how well is it understood (so that software can handle some bad cases in the compile-stage). I'm no compiler guy, and so while I can't offer much original thought for this but from what I see from others, there's a lot of nice "research-ish" things out there focusing on understanding workloads to more efficiently use hardware. At least in my view, you can look # of execution units and look at the IPC of your software. If they're not close, that means execution unit count/area is not limiting your performance. Something else may.

(previous disclaimer still holds... and I'm a hardware guy, not a software guy. So I may be talking out of my ass in the last paragraph)

Idontcare · Apr 18, 2013

Sleepingforest said:
So it's all basically cache?

Its all the "un-core" stuff they shove onto the die to boost IPC and lower power consumption.

IMO it is a tenuous argument to posit that cache is not included in the microarchitecture footprint given that within the processor core the prefechters and circuitry are all optimized in terms of power, latency, clockspeed and so forth on the fundamental premise that the cache will be available to feed the cores.

Not a dig on Charles, just a passing observation regarding the fuzzy line between core and uncore when the cores are sized as they are because they were designed with the implicit expectation of the uncore itself existing alongside.

Take away that cache and the uncore and tell the core-design team they can't rely on that stuff for performance enablement and suddenly you'd find yourself looking at some rather strikingly different core layouts (and footprints) as they try and create a low-latency low-clockspeed core microarchitecture that solely depends on the system ram to feed it.

The two go hand in hand, you get small efficient cores because you go big on the throwing xtors and die-space at all the uncore stuff IMO.

Ajay · Apr 18, 2013

Cerb said:
Gustafson's Law does in no way refute Amdahl's Law. They are merely perspectives on the same concepts.

Amdahl's Law is, "all that," for any application which has strict serial dependencies, but more or less indefinite data and time bounds. For a general-purpose algorithm, it's the best you'll be able to get. Gustafson's Law concerns itself with scalability of data sets (and time), which may also be limited. The two are equivalent, but for different problem cases. In the case of Gustafson's Law, scalability only increases linearly if the computation per processor is fixed. If the computation per processor increases with data size increasing, the scalability won't be linear, but rather, a curve that flattens out, or approaches an asymptote, as data size and/or processors increases, just as with Amdahl's Law's typical applications.

The implication of Amdahl's Law is that some problems will never be worked on with systems like GPUs, instead requiring faster processing from each processor. The implication of Gustafson's Law is that it becomes worthwhile to do more processing, after a point, rather than process more stuff (such as more in-depth data mining, providing real-time statistics, etc., instead of finding more raw data to process)--or, in today's world, just go idle and save electricity.

It would be good to keep in mind that in 1967, there were far more problems out there that computers weren't fast enough for at all, and simpler faster processors were really much faster than complicated units with many processors, so infinite data/time bounds would make much more sense, than in 1988, by which computers were common business items, able--and often required--to process data as fast or faster than it could be presented to them.

Today, though, you should really be moving to using Gunther's Law, which encompasses both, without the work of deriving one from the other.
http://en.wikipedia.org/wiki/Neil_J._Gunther#Universal_Law_of_Computational_Scalability
Intel is at 2 for mainstream CPUs, and can feed 2 well. More than 2 threads now, and we'd be back to the poor quality of HT on the P4. Response time matters. Alpha was going for maximum throughput. On real workloads, existing Alphas were able to max out their bus (one of several reasons for the IMC on the K8). At the time, they could keep on scaling well. That's apples and oranges. The guys behind the SPARC T-series figured more would help, FI, and those CPUs are no slouches, in the right setting, and even at Oracle's costs (what can you pay, today?), have managed to provide non-kool-aid drinkers with real value. There's not a perfect universal number, nor perfect way to implement multithreading. 4 may have been the ideal count for the super-wide Alpha-to-be. That does not make for a universal truth.

Actually, they are up to 4, if you want to go try that counting game. They apply 2 on mainstream CPUs, because we care about more than just keeping the ALUs busy. There have consistently been cases where turning HT off, going back to 1 thread per core, is an improvement. Fewer cases with each generation, but all these years on, it still happens. As long as memory is not instant, it will keep on happening, too, as long as they use shared resources (as opposed to say, fully partitioned SMT).
Where do you find a course that teaches you algorithms that either (a) cannot exist or (b) have not yet been created? They simply don't exist, for a wide variety of real problems. Then, in some cases, when they do exist, the less-parallel versions are faster, in practice, because the parallel versions have such high overhead. Stones don't bleed.

Sigh

Yes, I've heard of Gunther's law and others. The reality is that it will take allot of time and effort (and academic research) to find algorithms to break seemingly serial code into parallel. Sure, there will always be serial code, we will be stuck with that unless someone discovers a compute model completely different from that of a Turing Machine. None the less, we are finding more and more algorithms which accelerate a variety of tasks by using parallel computing. In some case, these algorithms are only able to approximate the correct solutions, but when enough iterations are executed, the error falls below the hardware intrinsic LSB error, the fact that it's an approximation no longer matters (except of multi-word large number and string processing).

DEC's analysis is dated now and only applied to the workloads of the time. Also, just to be clear, I was talking about hardware threads per core, not per CPU. One of the reasons Intel sticks with only two hardware cores, where thread 0 has precedence over thread 1, is to maintain maximum ST performance with a more economical number of xtors. They must have found it to be less expensive to add another core, than add the complexity of adding additional hardware threads (probably in part due to the x86 ISA versus a full RISC ISA). The business of turning off HT in certain situations is likely based on the overhead of managing mutexes and possible some overhead within the processor itself. TSX offers the opportunity for reducing that overhead, at the cost of writing the extra code needed to handled situations when an exception thrown (I haven't fully digested the spec yet, I will when I have the right hardware to test this out).

My basic point is that it will take a sea change in the way software engineers analysis and design projects to make maximum use multi threading and this will need to start in university classes. I don't have all the answers, but I think that we, as software engineers can do a better job if we are trained well enough that it doesn't take much extra time to write (which managers hate) and the tools are improving to debug multi threaded code (I expect that compilers going forward will do more multi-threading for us, just as they will soon do auto-vectorization for us for the upcoming Haswell architecture. What ever lower the business cost barrier of entry to multi-threading is what will move this technology forward.

I've been writing thread coded since 1996, I guess I expected us to be further along that path by now with the introduction of multi-cored consumer CPUs early in the first decade of this century. Maybe I was expecting too much.

Idontcare · Apr 18, 2013

Ajay said:
Sigh Yes, I've heard of Gunther's law and others. The reality is that it will take allot of time and effort (and academic research) to find algorithms to break seemingly serial code into parallel. Sure, there will always be serial code, we will be stuck with that unless someone discovers a compute model completely different from that of a Turing Machine. None the less, we are finding more and more algorithms which accelerate a variety of tasks by using parallel computing. In some case, these algorithms are only able to approximate the correct solutions, but when enough iterations are executed, the error falls below the hardware intrinsic LSB error, the fact that it's an approximation no longer matters (except of multi-word large number and string processing).

Back when it was my life to care about this stuff I came up with the not-so-novel concept of what I called "heterogeneous computing" in which the serial code was intentionally processed on a silly-special hardware configuration while all the parallelizable stuff was farmed out to a sea of slow-as-molasses cores.

At the time I used a processor clocked at 1GHz with a small army of subservient cores clocked at 500MHz to generate the physical data proving out the merits of the model.

Not exactly hardware-agnostic, nor unique, but it basically is the exact same idea captured in the motivation by Intel and AMD to create the whole turbo-core/boost situation on their processors.

(note I am by no means claiming any credit whatsoever for their products or ideas, quite the opposite in fact, my concepts were so generic and trivial that I'm pretty sure anyone and everyone thought of it at the same time back then)

But the question of what to do with the unavoidable unparallelizable stuff has dogged computer science people since well before Amdahl's day.

Conceptually it has dogged mankind going back tens of thousands of years when the first slave-owner had to figure out how best to maximize the productivity of his slaves with himself being the "task master" coordinating their activities in the field. (my apologies if this topic offends anyone, I do not mean to invoke emotional reflections of such hardships visited upon many peoples of this planet

)

Ajay · Apr 18, 2013

Idontcare said:
Or could be more of a Newton versus Leibniz type situation in which both independently come up with the same idea but their mathematical notation differ (as do their personal ego's in terms of having credit go to themselves) and in the end the credit goes to whomever wins in the court of public opinion

(FWIW, for folks who aren't aware of the LeibnizNewton calculus controversy it is a worthy read, hubris is a trait of humanity, it appears we cannot be seperated from it)

People need to be more aware of Leibniz calculus. One of my nephew's physic teacher is, and Leibniz calculus provides some conceptionally less difficult solutions to some basic differentiation and integration problems. It could help some Science/Engineering majors with at least their calculus 1 & 2 classes. I was only 1 of the 30% of incoming science and engineering students who were not required to take remedial math classes in there freshmen year (based on a test we took the fist week of classes). I was shocked, 70% were not adequately prepared for doing the math necessary to complete their major!.

Cogman · Apr 18, 2013

Ajay said:
My basic point is that it will take a sea change in the way software engineers analysis and design projects to make maximum use multi threading and this will need to start in university classes. I don't have all the answers, but I think that we, as software engineers can do a better job if we are trained well enough that it doesn't take much extra time to write (which managers hate) and the tools are improving to debug multi threaded code (I expect that compilers going forward will do more multi-threading for us, just as they will soon do auto-vectorization for us for the upcoming Haswell architecture. What ever lower the business cost barrier of entry to multi-threading is what will move this technology forward.

Functional programming + the actor model is really want you are looking for if you want to get crazy threaded performance.

I'm sure that better methods will come along, but for now that really is the best we got. The problem is that functional programming isn't generally taught and most companies are tied very tightly to languages and tools that make it nearly impossible to do.

(That said, I'm still not 100% sold on the functional programming sales pitch. Wolfram alpha is the only place I really know of that uses it, and it makes a lot of sense there because of what Wolfram does. I don't know if it would make sense for something like a database.)

Cogman · Apr 18, 2013

Ajay said:
People need to be more aware of Leibniz calculus. One of my nephew's physic teacher is, and Leibniz calculus provides some conceptionally less difficult solutions to some basic differentiation and integration problems. It could help some Science/Engineering majors with at least their calculus 1 & 2 classes. I was only 1 of the 30% of incoming science and engineering students who were not required to take remedial math classes in there freshmen year (based on a test we took the fist week of classes). I was shocked, 70% were not adequately prepared for doing the math necessary to complete their major!.

Crazy. Makes sense though. For my school we had a HUGE attrition rate and I'm fairly certain it had to do with the entry level math requirements. Just doing Ohms law was enough that my classes went from 120->20 for the next level class. (With about 6 graduating in my major and related majors, CE and EE).

Admissions was concerned about this, but couldn't figure out what they could do to decrease the attrition rate (Without making the classes overly dumbed down)

tweakboy · Apr 18, 2013

Oh boy I feel like Im in class... good posts gurus.

Charles Kozierok · Apr 18, 2013

Idontcare said:
IMO it is a tenuous argument to posit that cache is not included in the microarchitecture footprint given that within the processor core the prefechters and circuitry are all optimized in terms of power, latency, clockspeed and so forth on the fundamental premise that the cache will be available to feed the cores.

You're right, and I wasn't trying to suggest that.

The context here is a new instruction set. And I'd be shocked if more than 5% of the transistors on a modern CPU have anything to do with that.

jhu · Apr 18, 2013

Cogman said:
Crazy. Makes sense though. For my school we had a HUGE attrition rate and I'm fairly certain it had to do with the entry level math requirements. Just doing Ohms law was enough that my classes went from 120->20 for the next level class. (With about 6 graduating in my major and related majors, CE and EE).

Admissions was concerned about this, but couldn't figure out what they could do to decrease the attrition rate (Without making the classes overly dumbed down)

Recruit better students?

Cerb · Apr 18, 2013

Cogman said:
Functional programming + the actor model is really want you are looking for if you want to get crazy threaded performance.

There's also CSP.

I'm sure that better methods will come along, but for now that really is the best we got.

Not really. It's as good as it needs to be. One thing to keep in mind is that message-passing and buffered systems don't need to be written in functional code (and, many out there aren't--Hadoop would be a good example of this being re-invented all over again). What they do need is to make data passing and sharing explicit, not be afraid to copy data, and for all work not being done to have queues to wait in. In addition, these sorts of behaviors can be implemented at lower levels, too, if needed or desired, rather than needing to bolt on yet another library (especially so with a program that is naturally cooperative, the principles can be applied with none of the overhead).

The big deal is that for a lot of work, parallelism needs to come from concurrency, and concurrency generally necessitates minimizing dependencies, including shared data, that aren't absolutely necessary for efficient operation. The lesser part is that concurrent parallelism can easily subsume non-concurrent parallelism (fork->join, FI). IE, once you have multiple mailboxes/queues/buffers, and some concept of dependency barriers, the old simplistic parallelizations still work as well within that system as they did previously, though they might suffer some very small wall time losses.

I doubt functional programming will, "take over," any time soon (I'd like if it did, but that's unrealistic). However, MPI-based Actor, or CSP, can generally be effectively used anyway, and often are. It just seems like it's all coming along so very slowly, in part because it wasn't a common concern until after 2005.

Idontcare · Apr 18, 2013

Charles Kozierok said:
You're right, and I wasn't trying to suggest that.

My bad then, I misunderstood

Charles Kozierok said:
The context here is a new instruction set. And I'd be shocked if more than 5% of the transistors on a modern CPU have anything to do with that.

Yeah I remember when Hans did a full breakdown analysis of one of AMD's processors (or it may have been an Intel diemap now that I try and think more about it) and he delineated the circuit blocks for some portion of instructions that were being added with the new CPU release (can't recall if it was for SSE2.1 or 4, but it was something like that) and the area-adder for the circuits was just a silly tiny sliver of the overall core size.

And you can see that just from the numbers. A modern "fat core" is somewhere around 20-25mm^2 and has all the circuitry necessary to support ~2000 instructions. The footprint to add another dozen instructions is pretty dang small.

Hulk · Apr 18, 2013

Ajay said:
People need to be more aware of Leibniz calculus. One of my nephew's physic teacher is, and Leibniz calculus provides some conceptionally less difficult solutions to some basic differentiation and integration problems. It could help some Science/Engineering majors with at least their calculus 1 & 2 classes. I was only 1 of the 30% of incoming science and engineering students who were not required to take remedial math classes in there freshmen year (based on a test we took the fist week of classes). I was shocked, 70% were not adequately prepared for doing the math necessary to complete their major!.

I graduated from Rutgers College of Engineering in 1988 with a degree in Mechanical Engineering. At our orientation the person speaking to us said, and I quote, "look to your left, look to your right, both of those people won't be there next year." And they were right, about 2/3 of the engineering students at RU would drop out. It wasn't that hard to get in but they didn't baby you once you got in. You either kept up and survived or you dropped out or more likely over to a less technical major. No remedial math or science classes offered. You took your 5 semesters of Calc, 4 of physics, 2 of chemistry, statics, dynamics, etc.. end of story.

Is it time to consider a new processor instruction set?

Member

Elite Member

Elite Member

Platinum Member

Elite Member

Platinum Member

Elite Member

Lifer

Lifer

Elite Member

Lifer

Elite Member

Lifer

Lifer

Lifer

Diamond Member

Elite Member

Lifer

Elite Member

Elite Member

Diamond Member