Let's Design A Super Modern Computing Platform

_Rick_ · Nov 24, 2013

While we're talking about physical changes:

I'm not sure integrating GPU and CPU will be doable for the gaming market, which drives GPU development. Sure, some shader-like units on the CPU are nice to have, but an external GPU is important.

What I would see as a positive change though is putting the GPU physically on the same plane as the CPU, so that cooling GPUs with big tower-style coolers becomes possible.
The place were the south bridge currently is, could be taken by a horizonal slot, while most of the south bridge functionality is integrated into the CPU package.

The downside would be that adding multiple graphics cards to a system could become more difficult -- SLI/CF may well become impossible to build, short of 2-GPU-on-a-card solutions, or making the form factor compatible with vertically mounted cards
Alternatively, some main boards could delete some vertical slots for a secondary horizontal slot.

Of course, we all know what the non-OEM industry makes of new form factors...

serpretetsky · Nov 24, 2013

Is it possible to design a chip to be cooled from both sides? I know older amd's and pentium used to not have any pins directly under the die (although they usually did have some electronics there), would it be possible to design a chip where there are no pins directly under the die, cut out that part of the package to give access to the die, and cool it using some sort of heatsink that clamps onto both sides of the chip?

Cerb · Nov 24, 2013

serpretetsky said:
Is it possible to design a chip to be cooled from both sides? I know older amd's and pentium used to not have any pins directly under the die (although they usually did have some electronics there), would it be possible to design a chip where there are no pins directly under the die, cut out that part of the package to give access to the die, and cool it using some sort of heatsink that clamps onto both sides of the chip?

Not easily, if at all. The "bottom" of the chip is where its pads are mated to the wires that go to the pins/pads for the socket (or solder pad grid, if BGA).

Some simple ICs allow for just that, but of course they don't have many pins, and you solder directly to the die's pads. To cool a modern chip, you need a good bit of pressure on the CPU, so a cooler would have to be like a vice, and even if you moved the pins out, and assuming the pads on the die are all on the edges (I think they are, but am not 100% sure), that would mean additional strain where those pads go to the pads for the socket or PCB, which also need some pressure applied for a socket, and either one-way pressure or no pressure, for BGA.

But, there's also not really a need to cool from both sides. The majority of non-mobile thermal problems are due to the problem that as the chips shrink, the power density is increasing. So, FI, a shrink might get you a 50% smaller chip that uses 66% of the power. Over 4 such shrinks, that would be a density increase of about 3x. Having so much power to dissipate over 1-2 in^2 is a problem, when it used to be that, even though more power was consumed, it was spread out across the system. There is no easy solution to this problem, if you need high performance CPUs/GPUs (if you don't, "disaggregation" should be a very good one).

Zodiark1593 said:
My issue with ARM, none of their consumer-solutions even come close to a Core i7 in both single threaded and multi-threaded performance, let alone when you factor in overclocking.

So, your main problem with ARM is that they and their partners have been unable to do what AMD and IBM also haven't been able to do, and that everyone else has just given up on (such as NEC and Fujitsu)? Nobody has yet been able to make a CPU that can approach Sandy Bridge single-threaded performance, much less Ivy, Haswell, and beyond. Nobody. Intel is alone, up at the top.

It has little or nothing to do with ARM v. x86, and everything to do with money, talent, skill, and demand. Intel has among the best engineers, obviously good management, and boatloads of R&D money. And, there's not demand for super-fast ARM CPUs. Faster than last year's, yes, but not for speed like our desktops have. Widows runs on x86, 64-bit Linux will need awhile to get stable and reliable on ARM. Now, if the manifold ARM server thing works out, and 64-bit ARM Linux gets stable and fast, then the door will be wide open for speedy ARM designs, by anyone that can afford to try to make them.

SecurityTheatre said:
Runs for 10 hours on battery, fits in my pocket.

...

They're not hard to get, but you often need to clean them up, first. My Discover couldn't do that, with heavy usage, until after I rooted it and disabled all the carrier-added crapware. After doing that, and trying some different mail clients (Aquamail is what I settled on, since I could limit what folders get pushed, and UI is decent), it's having no problems running all day at to over 50% battery, except when stuck in low-signal areas (with light usage, >90% battery).

As to the main topic:

Wanted:

1. An ABI for shared-memory heterogeneous computing (in the works already for x86 and ARM).

2. ECC required.

3. NUMA support baked in (IE, standard for querying topology, and allowing software to measure its effects, and/or for hardware to do that and report it).

4. A common DSP IR, guaranteed supported by whatever DSPs actually may be implemented (including just using the main CPU). The DSP having its own fancy vendor-only features is fine, so long as it can be generically programmed, as well. As new features become common, they should be standardized in said IR spec, even if it means forcing the slowpoke designers to add those features (IoW, make standards bodies like Kronos' to grow some balls, and tell the legacy-whiners to modernize).

5. A common GPGPU-like vector IR, as above.

6. Partitioning virtualization support baked in and required.

Not wanted:

1. Microkernel OS. Only a handful of good ones have ever existed, and QNX is the only one that has survived well. The OS kernel is an incestuous thing, needing to link together different systems in ways that basic messaging make more difficult and slow that just having it all in the same memory space. Leave messaging to user-space applications, where it belongs.

2. Legacy API specs, and language support requirements. How many years until the OP's list are old? Just let that go as it will. If vendors can't agree on their shit, that's too bad, but it's not a problem for a platform spec to solve, unless that platform spec is from a software middleman, such as Google, Apple, or MS. That kind of thinking was moved away from in the 80s, and that was a very good thing.

Even with that said, Python and HTML5 for the kernel or base system? The kernel needs to be small, fast, scalable, and secure. It also needs to do absolutely zero work where HTML technologies of any kind would be useful. As for replacing the likes of BASH and Perl, Python and PHP have been used for that in past, and are actually OK at it, but there's not really a compelling need, HTML and CSS would be 100% pointless, and Javascript would be in no way helpful (also Systemd is keeping maintainers busy, these days, anyway

).

3. Storage: software went to SCSI ages ago, both in *n*x and Windows, regardless of what's actually targeted. Hardware that works should be allowed, because the user needs or wants it.

4. Tiered RAM: there is no magic pixie dust. If somebody could make RAM faster, without being much more expensive and/or higher latency, they would. They do their best as new standards come about. While not including extra bits per chip for ECC makes me go grrr, issues of bandwidth and latency do matter to JEDEC members, and if you would do some research into the history and reasoning for what they choose to do, they generally do a good job of it, save for the RDRAM fiasco.

5. File system. Kind of like with the API and language specs. ZFS is now a solid file system, with age and cruft, and many, "we could have done that better," feature implementations. It also largely ignores single-disk setups, and is a RAM hog, being made by Unix server guys for Unix servers. It's good, yes, but it's the present, not the future. The future will have more aggressive CoW, tend to be logging, and/or implement log-like transaction stores within extents, now that much of the bad stuff has been figured out for LSFSes and block-level CoW, and should eventually provide some resiliency for single-drive systems.

serpretetsky · Nov 24, 2013

Cerb said:
Not easily, if at all. The "bottom" of the chip is where its pads are mated to the wires that go to the pins/pads for the socket (or solder pad grid, if BGA).

ahh, i see. Yeah I just assumed most of them looked something like this:

But, there's also not really a need to cool from both sides. The majority of non-mobile thermal problems are due to the problem that as the chips shrink, the power density is increasing. So, FI, a shrink might get you a 50% smaller chip that uses 66% of the power. Over 4 such shrinks, that would be a density increase of about 3x. Having so much power to dissipate over 1-2 in^2 is a problem, when it used to be that, even though more power was consumed, it was spread out across the system. There is no easy solution to this problem, if you need high performance CPUs/GPUs (if you don't, "disaggregation" should be a very good one).

But this is the very problem I am trying to address! If you cool both sides of the die you are effectively doubling the area that you can cool. Twice the area means twice the heat flow for any given temperature.

GWestphal · Nov 24, 2013

You could use micro-jet MEMS cooling. I was going to suggest that earlier but I forgot to add it. Pretty neat technology.

@ Cerb, could you spell out some of your acronyms? Not familiar with all of them.

I agree, HTML and Python isn't for the kernel level programming ( you could use RPython or CPython because they boot strap), but html and python for the application level of the OS.

You can't be NUMA and hUMA at the same time, can you?

The mach kernel underlying OS X is a microkernel-ish and it's done pretty well over the years. I think with the shared cache to hardware accelerate much of the kernel operation, you could make a much more secure and performant kernel.

I was doing some looking and I can't find any chips over 1Mb for SRAM. So it looks like DRAM wins though if some one could get the lead out on memristors we could merge RAM and storage.

Cerb · Nov 24, 2013

GWestphal said:
@ Cerb, could you spell out some of your acronyms? Not familiar with all of them.

BGA: ball grid array.
ICs: integrated circuits. As used, to refer to fairly simple chips, like voltage regulators, battery chargers, and so on.
ECC: error checking and correction. Everything but your RAM has the ability to, at the least, tell if something went wrong. This feature has been artificially segmented into being server-only.
IR: intermediate representation, such as LLVM, Java, .NET, etc. bytecode
SCSI: small computer systems interface, which has become a server-centric hardware spec, due to adding complexity and features most of us don't care about, but the de facto software spec, due to the basic protocol being straight-forward.

I agree, HTML and Python isn't for the kernel level programming ( you could use RPython or CPython because they boot strap), but html and python for the application level of the OS.

At that point, though, there's no need to require it. C/C++ support in the OS will take care of it (already does). A newer scripting language for init would be nice, but not Python. It would need to be something that understood multiprocessing and/or concurrency, to be really worth it, which standard Python does not. I'm no fan of BASH, and even less of Perl, but most other popular scripting languages have all of their faults except for arcane syntax choices, IMO.

You can't be NUMA and hUMA at the same time, can you?

Sure you can. NUMA v. UMA is about memory buses, and locality. From CPU A, going from core 1 to core 3 is slower than staying in core 1. Going to CPU B's cache is slower than staying in your own cache (it may be as slow as going out to your own main memory). Going to CPU B's DRAM will be slower than accessing CPU A's DRAM. And so on. AMD's hUMA is a marketing term that has little to do with that, and mostly to do with task scheduling across different types of processors, with the main CPU having final say over how it works.

The mach kernel underlying OS X is a microkernel-ish and it's done pretty well over the years.

The Mach kernel has not done well over the years, performance-wise. It has been well-known, and software that needs to has worked around it, much like they typically do in Windows. It is stable, it works well, and it has a good enough security model that most vulnerabilities require enough work to take advantage of that they haven't even needed to patch them. But, performance is not, and has not been, a strong point of it.

http://www.anandtech.com/show/1702/8
http://www.phoronix.com/scan.php?page=article&item=macosx_108dp1_ubuntu&num=3
http://www.phoronix.com/scan.php?page=article&item=macosx_108dp1_ubuntu&num=4
http://www.phoronix.com/scan.php?page=article&item=macosx_108dp1_ubuntu&num=5
Only a few wins.

But, what you see as a desktop or notebook user is this:
http://www.phoronix.com/scan.php?page=article&item=macosx_108dp1_ubuntu&num=2
Which has nothing to do with the kernel itself. The basic kernel performance is, outside of server-type uses, a minor factor in the user experience, compared to driver optimizations, power conservation implementations, and application user interfaces, once the kernel is, "good enough." The inefficiencies in the kernel don't really affect you. But, they're there, and cross-platform software devs don't go changing their software to meet OS X' specific needs, hence the death of the XServe line.

I think with the shared cache to hardware accelerate much of the kernel operation, you could make a much more secure and performant kernel.

First, more performant and secure than what? Second, what is what you're talking about that is in any way different than any modern CPU/SoC? They all share cache between CPU cores, and going forward, they'll all end up sharing cache between everything (Intel already does this, nVidia might, AMD will, and I don't know about others' near-term plans).

I was doing some looking and I can't find any chips over 1Mb for SRAM. So it looks like DRAM wins though if some one could get the lead out on memristors we could merge RAM and storage.

NAND is already going to do that (as in, Micron has pledged to release the DIMMs for sale--we'll just have to see, beyond that!). True NVRAM would be very nice for flash storage, though, and I hope it starts becoming common (it would greatly simplify being able to, without great added costs, offer reasonable guarantees against SSD data loss or bricking on unexpected power losses). As it is, DRAM is already the choice for big CPU caches. SRAM needs to be simple to access (figuring out what and where to access it can be complicated, though), and very close, to be a big boon, because otherwise, getting from point A (in the CPU) to point B (the actual data to be written or read) will eat up most, if not all, of the advantages SRAM can offer.

NTMBK · Nov 25, 2013

GWestphal said:
I agree, HTML and Python isn't for the kernel level programming ( you could use RPython or CPython because they boot strap), but html and python for the application level of the OS.

Only if you want your apps to be laggy as hell!

I can certainly get on board with using Python in app development, but only in certain places- wiring together logical components on a high level is a very good candidate for Python. But this is only because Python interfaces beautifully with C and C++- which is what you want to actually write your high performance sections of an app in. Don't give me any "But it's hard!" nonsense. If you want the thing to run fast, write it in native code.

GWestphal · Nov 25, 2013

Those links are interesting, but it shows mach (microkernel) matching or exceeding linux (monolithic kernel) in every test...the 2 where it was at 50% the performance were noted to have used half the number of threads, so it would be expected to run 50% slower.

I'm note sure the approach I am talking about with dedicating certain cores to certain kernel task and using shared cache is significantly different, but isn't completely shared UMA cache a relatively new thing. Prior to haswell, didn't each core have it's own independent cache, thus making message passing a (calculate, write, transfer, write, read, calculate) or 6 instructions per message passed versus having UMA cache (calculate write, read, calculate) 4 instructions. That's 30% faster message passing.

Again, I could be way off base here, I'm not nearly as knowledgable as some of you experts, I'm just trying cobble together what I've pieced together to try to make a more efficient design.

Well, I don't know if it's there yet but with LLVM and pypy, python can get pretty darn close to C++ speed.

velis · Nov 25, 2013

http://stackoverflow.com/questions/4537850/what-is-difference-between-monolithic-and-micro-kernel
Basically, microkernel will have performance issues with things that require lots of operations, such as 3D graphics.

Personally I'd lean towards a hybrid of sorts where a mono-like build would be allowed with all the high-performance drivers compiled into the kernel itself and less performance-hungry ones being activated as "traditional" drivers go. Of course, the appropriate APIs would still be in place for the user to be able to choose how they want to install a driver - in the same ring or on one of the lower (higher) ones. This way you could actually choose whether you favour performance over stability.
This solution does present a performance issue itself: since 3 - 4 drivers are already compiled in (standard VESA, your GFX specific, same-ring API forwarder, lower ring API forwarder), it imposes a bit of a performance hit for each call just to choose one of the 4 available mode paths...

Python can get close for some operations, but far from all. Its basic fault is that it can't do multithreading on a platform which offers 16 cores. BDFL will argue that in such a case multiprocessing should be used, but if multiprocessing would be such a panacea, multithreading would never have had been invented.
That said, Python is great for final level of software (the user apps), but even there you'd probably need high performance outside of classic matrices every now and then - multithreading-provided performance or otherwise.

Cerb · Nov 25, 2013

GWestphal said:
I'm note sure the approach I am talking about with dedicating certain cores to certain kernel task and using shared cache is significantly different, but isn't completely shared UMA cache a relatively new thing. Prior to haswell, didn't each core have it's own independent cache (...)

Yes and no. Every line of cache exists in L3 (and now L4). Each CPU then has its own L1 and L2 cache, which are much smaller, and separate per core. If another CPU needs one of those lines, it must send it, or update L3 and have it retrieve from there (I can't recall which way it does it). They started that, with multiple cores, as of the Core Duo, and have been refining it since.

Dedicated processors may or may not use the CPU's cache, but they often will spill into it (DMA transfers will, on new Intel models, and Intel's IGP uses the CPU's cache). Going forward, since cache is so cheap for everybody now, that's going to be standard, as time goes on. In many cases, though, it's not well documented for us. FI, does Qualcomm's Hexagon use the CPU's cache, or go straight to the memory bus?

GWestphal said:
Those links are interesting, but it shows mach (microkernel) matching or exceeding linux (monolithic kernel) in every test...the 2 where it was at 50% the performance were noted to have used half the number of threads, so it would be expected to run 50% slower.

Timed MAFFT Alignment, 7-zip, Postmark, and everything AT tested don't have that issue.

OS X isn't bad, but, like Windows, it's got some issues, and its microkernel heritage are part of it. There is no good reason to use a microkernel, if starting from scratch.

SecurityTheatre · Nov 25, 2013

Red Squirrel said:
Actually it could be that way too, in higher performance systems you could have a card where you put your own chips, and there would be a chip standard too. I never really considered the distances, I guess when it comes to today's clock speeds and performance distances actually does matter, so chances are what you'd end up with is typically 1 card would be the "system" (what is today a motherboard) and then the other slots would be expansion for other stuff like hard drives and so on.

If you want hyper-modular, you will need to find ways to make the components very small, or you sacrifice some speed in making them modular.

Your big modular system sounded like some of the new higher-end blade platforms, however. You should check out Nutanix (I think that's the spelling). They have something very similar to what you said.

Zodiark1593 · Nov 26, 2013

SecurityTheatre said:
If you want hyper-modular, you will need to find ways to make the components very small, or you sacrifice some speed in making them modular.

Your big modular system sounded like some of the new higher-end blade platforms, however. You should check out Nutanix (I think that's the spelling). They have something very similar to what you said.

If only we could get a printer that can fab individual chips for customers on demand (for a set of standardized boards perhaps). If I want a specific cpu and gpu and memory interface, I could license the required blocks, select the wafer size I require, pay for the print, and now I have the SOC that fits my needs.

NTMBK · Nov 26, 2013

Zodiark1593 said:
If only we could get a printer that can fab individual chips for customers on demand (for a set of standardized boards perhaps). If I want a specific cpu and gpu and memory interface, I could license the required blocks, select the wafer size I require, pay for the print, and now I have the SOC that fits my needs.

How about faster and cheaper FPGAs? What if you had a single piece of silicon, and every time you wanted to run a piece of software the hardware would actually reconfigure itself to be absolutely optimal? Imagine a GPU which could rebalance the amounts of shaders, texture units, ROPs, and caches it had, based on what game you were about to run on it. Profiler guided hardware!

It's a complete pipedream of course, and FPGAs are a nightmare to develop with. But it's a fun thought experiment.

GWestphal · Nov 26, 2013

That would be cool, is it just the compiling of the verilog that takes a long time on FPGAs or is it the flashing the design into the FPGA. If the flashing is fast, you could have several premade profiles for certain tasks, i.e. physics, image processing, signal processing etc

What about implementing a kernel as and ASIC/FPGA. Then at the bottom of the software level you have not a kernel anymore, but a operation coalescer or something like that.

SecurityTheatre · Nov 26, 2013

NTMBK said:
How about faster and cheaper FPGAs? What if you had a single piece of silicon, and every time you wanted to run a piece of software the hardware would actually reconfigure itself to be absolutely optimal? Imagine a GPU which could rebalance the amounts of shaders, texture units, ROPs, and caches it had, based on what game you were about to run on it. Profiler guided hardware!

It's a complete pipedream of course, and FPGAs are a nightmare to develop with. But it's a fun thought experiment.

Meh.

It will always suffer a 50% or greater loss, simply due to the nature of routing through the chip. Unless you designed an FPGA to exactly match the specs of something you were designing... but then... you just made an ASIC.

There will *always* be speed, power and complexity advantages in an ASIC. If say there are 100 million users and almost everyone only ever uses 3 designs, it will actually be cheaper, faster AND more efficient to simply included 3 ASIC chips, rather than one huge FPGA that can emulate all 3 of those ASICs.

NTMBK · Nov 26, 2013

SecurityTheatre said:
Meh.

It will always suffer a 50% or greater loss, simply due to the nature of routing through the chip. Unless you designed an FPGA to exactly match the specs of something you were designing... but then... you just made an ASIC.

There will *always* be speed, power and complexity advantages in an ASIC. If say there are 100 million users and almost everyone only ever uses 3 designs, it will actually be cheaper, faster AND more efficient to simply included 3 ASIC chips, rather than one huge FPGA that can emulate all 3 of those ASICs.

Oh yes, an ASIC will beat an FPGA hands down. But if you can reconfigure the FPGA on the fly to do 100 different things, now are we beating 100 ASICs? It's the sliding scale of what it is worth doing as fixed function, and what do you do as reusable silicon. We already see the same tradeoff in today's SoCs- general purpose CPU cores, quite generalised GPU shaders with a few specialised parts (e.g. texture units), and completely fixed function encode/decode blocks for things like video and audio. I have no idea where that balance lies, I just find the concept of FPGAs cool.

(Never had the opportunity to work with one yet, though, and I hear they're a real pain in the arse.)

GWestphal · Nov 26, 2013

About i7 vs ARM

i7 has 4 cores and 85W of power and a Geekbench 3 of 13000.
A7 has 2 cores and ~3.2W (32 AHr battery with 10 hours life, which includes everything, realistically it is ~1W for the CPU/GPU) of power and a geekbench 3 of 2500.

that would mean

i7 has 38 geeks/wattcore
A7 has 416 geeks/wattcore

A7 being over 10x more power efficient,

If you were to assume doubling the power consumption could double the performance (maybe not I'm just throwing out numbers here), then at about 24W the ARM architecture would have equivalent performance to an i7 at 30% the power consumption.

SecurityTheatre · Nov 27, 2013

NTMBK said:
Oh yes, an ASIC will beat an FPGA hands down. But if you can reconfigure the FPGA on the fly to do 100 different things, now are we beating 100 ASICs? It's the sliding scale of what it is worth doing as fixed function, and what do you do as reusable silicon. We already see the same tradeoff in today's SoCs- general purpose CPU cores, quite generalised GPU shaders with a few specialised parts (e.g. texture units), and completely fixed function encode/decode blocks for things like video and audio. I have no idea where that balance lies, I just find the concept of FPGAs cool. (Never had the opportunity to work with one yet, though, and I hear they're a real pain in the arse.)

The question is... are there really 100 things that need doing that benefit from a hardware circuit? I would wager most problems solved by home computers can be narrowed down to just a few.

My hands-on with FPGA design was upwards of 10 years ago now, but my intuition is that there are very few situations where it is practical to replace a general purpose CPU + software *OR* a handful of special purpose ASICS (like GPU shaders, DSP coprocessors, etc).

The primary reason to do it isn't often speed (most stuff is going to be just as fast in a comparably designed CPU) and if it is speed, you would go with an ASIC.

Sure there are a few handfuls of situations where an FPGA solves something with reduced complexity, or provides a useful speed boost on a low-volume operation for a laboratory or prototyping scenario (over a comparable software simulator, etc).

I don't know the break-even, given modern designs, etc, but I would think you could include 4-6 (maybe 10) really good quality complex ASICs of various flavors for the cost of a single comparable FPGA, when producing in quantities of 1 million or more.

On a similar note, even within a CPU, we see some of these ASIC design principles. Old K7 (Athlon) cores used to have several dedicated Integer pipes, a dedicated FADD/FSUB pipe and a pipe for FMUL/FDIV and another that only did FDIV. This type of single-purpose ASIC-like design philosophy reduces the necessary complexity of the system While allowing increases in speed and flexiblity.

This, in my eye, is like including half a dozen dedicated ASICs with a big complex switching engine to determine which ones are used at any given time.

Given the linear nature of x86 code, you only had a total of 5 or 6 execution pathways in the old K7, but when you're doing matrix operations or DSP calculations, you can include many more pipes and that's what a modern GPU does, including as many as... what is it now? 2048 per chip?

but on all that, there is the simple question... . What other operations do you need done? And are they done frequently enough (and are they variable enough) to justify the FPGA over a series of ASICs?

SecurityTheatre · Nov 27, 2013

GWestphal said:
About i7 vs ARM

i7 has 4 cores and 85W of power and a Geekbench 3 of 13000.
A7 has 2 cores and ~3.2W (32 AHr battery with 10 hours life, which includes everything, realistically it is ~1W for the CPU/GPU) of power and a geekbench 3 of 2500.

that would mean

i7 has 38 geeks/wattcore
A7 has 416 geeks/wattcore

A7 being over 10x more power efficient,

If you were to assume doubling the power consumption could double the performance (maybe not I'm just throwing out numbers here), then at about 24W the ARM architecture would have equivalent performance to an i7 at 30% the power consumption.

That's completely invented.

There is NOTHING inherent about the ARM architecture that makes it efficient. In fact, the internal uOP architecture in modern i7-era Intel processors is a marvel of RISC engineering.

Also, modern high-performance CPUs throw a ton of transistors at solving complex linearity problems inherent in code. That last 30% of performance costs 3x (or maybe even 8x) more energy. There is no way around this and parallalization doesn't solve this particular type of problem.

If you want to compare x86 against ARM objectively, take the high-end in low-power chips.

Lets compare the ARM a7 vs the Intel Atom 4770. The Intel Atom wins most benchmarks by a margin of about 30%. It consumes about 2.5-3W including all SoC components, which puts it on par, power-wise with the a7, but with slightly higher performance per watt.

I would put this up to the fact that Intel is half a generation (or more) ahead of everyone else in process design and layup technologies, rather than inherent architectural advantages.

Starting from scratch, ARM isn't a bad architecture, but x86 isn't really that bad, in the end. Now that the I-Cache is behind the CISC translators and only stores uOPs, it might actually be slightly more efficient than an ARM could be, because of the op-combinations and other optimizations that they use on-chip.

The only possible benefit would be a 1-cycle latency decrease (the decode phase of the pipe) for cache-miss operations. But then, 1 extra cycle when looking at RAM latency on a full cache-miss (which on a modern chip, is no less than 75 cycles) is pretty small.

In conclusion, all I'm saying is that ARM vs x86 (as an architecture) is not inherently that different, performance and power-usage wise. Nitpicking, ARM probably gets a 0.01% performance improvements by reducing certain cache-miss operations by 1 cycle in latency.

GWestphal · Nov 27, 2013

So, ignoring previous software/compatibility issues if you were to choose to build the first computer after technology hating zombies destroyed all the tech in the world, which architecture would you use as a jumping off point?

If they are similar in power consumption/performance, is one significantly less complicated to fab or to add more cores to, or something like that? Which would move into 3D easier?

TuxDave · Nov 27, 2013

GWestphal said:
About i7 vs ARM

i7 has 4 cores and 85W of power and a Geekbench 3 of 13000.
A7 has 2 cores and ~3.2W (32 AHr battery with 10 hours life, which includes everything, realistically it is ~1W for the CPU/GPU) of power and a geekbench 3 of 2500.

If you were to assume doubling the power consumption could double the performance (maybe not I'm just throwing out numbers here), then at about 24W the ARM architecture would have equivalent performance to an i7 at 30% the power consumption.

Back of the hand math puts it at a square to cubic relationship if your goal was to increase frequency for performance while using the same architecture. So to double the frequency, you may end up with something like 8x the power.

Edit: In before the 0.5CV^2f folk come in. Doubling frequency instantly doubles your power. But to make it work at the new frequency, voltage needs to be increased and your devices made faster. You then play with the C and V terms of the equation.

SecurityTheatre · Nov 29, 2013

GWestphal said:
So, ignoring previous software/compatibility issues if you were to choose to build the first computer after technology hating zombies destroyed all the tech in the world, which architecture would you use as a jumping off point?

If they are similar in power consumption/performance, is one significantly less complicated to fab or to add more cores to, or something like that? Which would move into 3D easier?

I'm not sure it matters that much. The complexity of the actual architecture itself is far more important than the instruction set.

To be fair, I think I would use a RISC. I'd favor one that was not encumbered by patents. There are a few theoretical instruction sets that are streamlined for multiprocessing and have some unique stuff for parallelization.

In fact, the Intel IA-64 was an interesting architecture that was head of its time, but actually is quite brilliant and wouldn't be awful. The inherent parallalism is probably better than a linear instruction set like ARM or MIPS or POWER or one of those (they are all very very similar).

A5 · Dec 3, 2013

GWestphal said:
About i7 vs ARM

i7 has 4 cores and 85W of power and a Geekbench 3 of 13000.
A7 has 2 cores and ~3.2W (32 AHr battery with 10 hours life, which includes everything, realistically it is ~1W for the CPU/GPU) of power and a geekbench 3 of 2500.

that would mean

i7 has 38 geeks/wattcore
A7 has 416 geeks/wattcore

A7 being over 10x more power efficient,

If you were to assume doubling the power consumption could double the performance (maybe not I'm just throwing out numbers here), then at about 24W the ARM architecture would have equivalent performance to an i7 at 30% the power consumption.

1) Power and performance don't scale linearly

2) I doubt the iPad gets 10 hours of battery life if you loop benchmarks over and over. And it probably wouldn't get the same score at the end of the battery due to thermal throttling in the chassis. I seriously doubt the A7 is a 1W (or even 3W) chip. It's probably more like a 5W chip at full power.

3) You're comparing chips with completely different goals. As others said, extracting maximum performance requires much more power than "good enough" performance targeting low power. These kinds of edge cases exist in many places in computing - you can look at Amdahl's Law to get an idea of how it works with multi-threaded programs.

Let's Design A Super Modern Computing Platform

Diamond Member

Senior member

Elite Member

Senior member

Golden Member

Elite Member

Lifer

Golden Member

Senior member

Elite Member

Senior member

Platinum Member

Lifer

Golden Member

Senior member

Lifer

Golden Member

Senior member

Senior member

Golden Member

Lifer

Senior member

Diamond Member