Can intel make a cpu with 1 legacy core and rest subset of x86 isa?

richardffw · May 18, 2011

x86 is often cited as a handicap that intel has to deal with such as in larrabee fighting against gpu or atom fighting against arm. If the full x86 is such a burden, can't intel create a multicore cpu where 1 core has full x86 instructions, and all the other cores designed to only use subset of x86 consisting of more modern and commonly used instructions? Wouldn't this make for a better cpu?

Ben90 · May 18, 2011

Alternatively they use the transistor/power budget they would otherwise spend on your legacy core to emulate x86 instructions to RISC. Now all cores can run legacy and new at full speed. The power isn't a concern in the desktop and notebook space. It start becoming a slight burden <1w. I don't see how your extra core would be less of a burden <1w in the space of a phone.

Anyone else find it ironic that Intel used the cheapness of x86 to steal marketshare from RISC processors, and then switched its cores to RISC anyways?
Excellent question BTW, welcome to Ananadtech.

greenhawk · May 18, 2011

richardffw said:
If the full x86 is such a burden, can't intel create a multicore cpu where 1 core has full x86 instructions, and all the other cores designed to only use subset of x86 consisting of more modern and commonly used instructions? Wouldn't this make for a better cpu?

The first issue is that the software needs to support that setup so it knows which core can run what. Not a easy change.

As to the commonly used instructions, intel and amd have been doing that for a while now. That is what the MMX/MMX2/SSE/SSE2/3dNow is all about basically.

As to replacing the x86 completly, intel tried that as well. That was their i64 system(iridium?), but it did not get the needed software developers behind it for it to take of, so intel canned it.

Edrick · May 18, 2011

greenhawk said:
As to replacing the x86 completly, intel tried that as well. That was their i64 system(iridium?), but it did not get the needed software developers behind it for it to take of, so intel canned it.

First off, Itanium is not canned....yet.

Second, they did not get much developer support largely due to the facts that it was a very high end chip meant for large scale servers......and that is was delayed.

In my opinion, if Intel pushed the Itanium into mainstream at cheaper price points, and if Itanium had a big performance advantage over x86, then you would have seen more developers jumping on board.

sm625 · May 18, 2011

Intel should have taken an itanium and put it on the same die as a Core 2. They should have done that at least 3-4 years ago. Maybe that particular variant of chip wouldnt sell well. But by now it would be much like SB graphics: you're basically stuck with buying that part of the chip whether you want it or not. I'm sure they could fit one itanium core into a 100M transistors. It is enough for them to get their foot in the door. After a few years we'd get windows kernel support, and we'd start seeing software that ran significantly faster on that architecture. Then it would take off and the itanium would take over a larger and larger portion of the chip.

Edrick · May 18, 2011

sm625 said:
Intel should have taken an itanium and put it on the same die as a Core 2. They should have done that at least 3-4 years ago. Maybe that particular variant of chip wouldnt sell well. But by now it would be much like SB graphics: you're basically stuck with buying that part of the chip whether you want it or not. I'm sure they could fit one itanium core into a 100M transistors. It is enough for them to get their foot in the door. After a few years we'd get windows kernel support, and we'd start seeing software that ran significantly faster on that architecture. Then it would take off and the itanium would take over a larger and larger portion of the chip.

We HAD Windows kernel support for Itanium up until Win2008 R2. Now MS is cancelling that.

smartpatrol · May 18, 2011

sm625 said:
Intel should have taken an itanium and put it on the same die as a Core 2. They should have done that at least 3-4 years ago. Maybe that particular variant of chip wouldnt sell well. But by now it would be much like SB graphics: you're basically stuck with buying that part of the chip whether you want it or not. I'm sure they could fit one itanium core into a 100M transistors. It is enough for them to get their foot in the door. After a few years we'd get windows kernel support, and we'd start seeing software that ran significantly faster on that architecture. Then it would take off and the itanium would take over a larger and larger portion of the chip.

The Itanium core doesn't take up many transistors, but it requires a HUGE amount of cache because the penalty for a cache miss is so high. Not only that, but Itanium is a massive power hog. The cheapest "low power" current-gen Itanium has a 130 watt TDP.

Core 2 was all about performance/watt and about producing a chip that can be used in laptops as well as desktops and servers. Why would Intel want to add an Itanium core and a whole load of cache?

Not only that, but I don't know what kind of "significantly faster" software you would expect to see. Itanium only really performs in a few niche applications.

The bottom line is, that would've been an idiotic move on Intel's part.

Idontcare · May 18, 2011

Welcome to the forums richardffw! :thumbsup:

richardffw said:
Can intel make a cpu with 1 legacy core and rest subset of x86 isa?

Wouldn't this make for a better cpu?

Ben90 said:
Anyone else find it ironic that Intel used the cheapness of x86 to steal marketshare from RISC processors, and then switched its cores to RISC anyways?

Intel is not interested in making a "better" cpu. They are interested in making profits and selling CPU's are merely a means to that end.

What's even more important is not simply making profits, but the "quality" of the profits counts when it comes to the shareholders. (gross margins, return on assets/equity/etc, PFO, etc)

Today's cpu's are neither RISC nor CISC. Not even the RISC cpus are RISC anymore, they all pretty much support quite elaborate and extensive instruction sets.

But you got to trust me when I tell you that Intel would just as soon sell you an Abacus as they would a 990X if they thought you'd be willing to buy that Abacus for $999 and Intel's accountants figured they'd net >50% gross margins from venturing into the Abacus market.

VirtualLarry · May 18, 2011

Idontcare said:
Intel is not interested in making a "better" cpu. They are interested in making profits and selling CPU's are merely a means to that end.
and Intel's accountants figured they'd net >50% gross margins

Does Intel really make 50% gross margins on their chips, even after all of the R&D expenses for both chip development and fabs?

That seems incredible to me.

I could see making 50% gross margins on the chips, not considering R&D expenses, as once you have the chip designed, and the fabs researched, it doesn't seem to cost all that much per wafer, when you consider the retail cost of the chips.

But if it's true, Intel is raking in such profits, then why isn't AMD likewise?

lol123 · May 18, 2011

The term gross margin does not include R&D costs, only the cost of manufacturing and sale of the product.

http://en.wikipedia.org/wiki/Gross_margin

Idontcare · May 18, 2011

VirtualLarry said:
Does Intel really make 50% gross margins on their chips, even after all of the R&D expenses for both chip development and fabs?

That seems incredible to me.

They do: http://ycharts.com/companies/INTC/gross_profit_margin#compCos=AMD

But if you want to see the benefit of a true monopoly you need to look at Microsoft's gross margins

tweakboy · May 18, 2011

This will never happen, at least not for along time to come 2016

dorion · May 18, 2011

What I would really like to see is the underlying RISC operations of Intel and AMD processors exposed as ISAs. Newly compiled programs could then target the underlying ISAs bypassing all the translation logic. Intel and AMD could introduce ways to have multiple target exes just like is OSX, also something like Rosetta for when they start producing processors without x86 translation facilities.

I know the exposed ISAs would probably be rife with IP, but I'm allowed to dream.

pm · May 18, 2011

smartpatrol said:
The Itanium core doesn't take up many transistors, but it requires a HUGE amount of cache because the penalty for a cache miss is so high. Not only that, but Itanium is a massive power hog. The cheapest "low power" current-gen Itanium has a 130 watt TDP.

I would argue that these are microarchitectural design decisions and not characteristics of the architecture. And I don't agree with the cache miss penalty. The cache miss penalty on a data access that misses in the first level cache but hits on the second level cache on recent Itanium designs is 5 cycles. I'm not sure that we published the second level cache miss latency number - but it's lower than the miss latency on Nehalem. These latencies are pretty decent in my biased opinion. All Itanium designs have had large caches because they are used in servers and servers benefit from large cache sizes because they work with large datasets and thus have large amounts of RAM.

As far as power, it's not a characteristic of the architecture that Itanium is a "massive power hog", but was a design decision based around the fact that Itanium's customer base is more concerned about reliability and uptime and maintainability and transactional throughput than they are about power. If customers were supremely concerned about power in the high-end server space, then you can be assured that Itanium would have a much much smaller power footprint. The issue is that, right now, on a list of attributes such as price, reliability, performance, power and other characteristics, Itanium customers are ranking power near the bottom... so the design follows from that.

Patrick Mahoney
Senior Design Engineer
Enterprise Processor Division
Intel Corp.

*I am not a spokesperson for Intel Corp. My opinions are my own *

smartpatrol · May 18, 2011

pm said:
I would argue that these are microarchitectural design decisions and not characteristics of the architecture. And I don't agree with the cache miss penalty. The cache miss penalty on a data access that misses in the first level cache but hits on the second level cache on recent Itanium designs is 5 cycles. I'm not sure that we published the second level cache miss latency number - but it's lower than the miss latency on Nehalem. These latencies are pretty decent in my biased opinion. All Itanium designs have had large caches because they are used in servers and servers benefit from large cache sizes because they work with large datasets and thus have large amounts of RAM.

I was under the impression that Itanium suffers a large penalty for a cache miss because of the explicitly parallel design. When an x86 CPU is waiting on main memory access, it can still execute speculatively or out-of-order. On the other hand, an EPIC processor will simply stall until the data is fetched from memory. Is this not the case?

At any rate, thank you for your informative post!

Voo · May 18, 2011

Idontcare said:
Today's cpu's are neither RISC nor CISC. Not even the RISC cpus are RISC anymore, they all pretty much support quite elaborate and extensive instruction sets.

Don't agree with that. If you take ARM for example they have pretty much only simple instructions (RISC isn't about a small set of instructions, RISC is a set of small instructions!) and support only simple addressing modes - both hallmarks of a RISC architecture.

And the x86 ISA with almost any addressing mode ever thought of, instructions to copy strings around and whatnot surely is the archetype of a CISC ISA. But there I assume you mean what the CPU is really executing and that's quite RISC like with their µops too.

So I think we could argue that all CPUs internally are RISC like, executing only simple instructions.

@smartpatrol: out of order execution can be implemented in any architecture you want and Itaniums obviously does have it - would be quite a horrible high-end server architecture without it.

smartpatrol · May 18, 2011

Voo said:
@smartpatrol: out of order execution can be implemented in any architecture you want and Itaniums obviously does have it - would be quite a horrible high-end server architecture without it.

http://hothardware.com/Reviews/Intel-Previews-RecordBreaking-32nm-Itanium-Poulson-Processor/

No, it doesn't.

Instead of using specific CPU hardware to re-arrange and optimally schedule instructions for execution (defined as Out of Order Execution, or OoOE), Itanium relies on the compiler to optimize code at run-time. This allowed the designers of Merced (the first generation Itanium) to devote more die space to execution hardware, thus boosting theoretical performance.

Cerb · May 18, 2011

dorion said:
What I would really like to see is the underlying RISC operations of Intel and AMD processors exposed as ISAs. Newly compiled programs could then target the underlying ISAs bypassing all the translation logic. Intel and AMD could introduce ways to have multiple target exes just like is OSX, also something like Rosetta for when they start producing processors without x86 translation facilities.

I know the exposed ISAs would probably be rife with IP, but I'm allowed to dream.

But, would that serve their interests? The internals, that might be exposable, could very well run slower, if exposed (large binary size), and would still need translation logic. In addition, Intel and AMD would be stuck unable to alter them over the generations. It's not just that the internals are super secret RISC machines, but that each generation can implement different ways to do zero footprint just in time x86 translation.

It would be much better to have a CPU like Transmeta made, with documented front and back ends, and protected memory for firmware-supported ISAs.

smartpatrol said:
I was under the impression that Itanium suffers a large penalty for a cache miss because of the explicitly parallel design. When an x86 CPU is waiting on main memory access, it can still execute speculatively or out-of-order. On the other hand, an EPIC processor will simply stall until the data is fetched from memory. Is this not the case?

With VLIW across several functional units, the chances of data not being at the ready increases. Each miss might not be so bad, but it halts everything, so they can add up to disaster. Also, if your x86 CPU has to wait on main memory, OoOE is not going to help.

On a big iron server, which is likely to run customized/legacy code anyway, that's not so big of a deal, because (a) you can throw programmers at the problem, and (b) in exchange, you get the ability to consistently process vast amounts of data with complex interdependencies, which commodity CPUs might very well not be able to do at acceptable rates, which leads to big server CPUs still edging out commodity CPUs in certain situations (the same argument applies to specialty embedded processors, as well). Mainframes and the like have been marginalized, but there will be a market for reliable I/O monsters for many years to come.

Voo · May 18, 2011

smartpatrol said:
http://hothardware.com/Reviews/Intel-Previews-RecordBreaking-32nm-Itanium-Poulson-Processor/

No, it doesn't.

Ok, sorry for the misinformation. Sounds like that makes the compiler a hell of a lot more complicated than it already is and means that a cache miss does actually stall the CPU for the time being - doesn't sound sensibly considering that power efficiency isn't a forte of itanium to start with..

sm625 · May 18, 2011

VirtualLarry said:
Intel is raking in such profits, then why isn't AMD likewise?

It's simple self interest. Imagine the type of person who makes the decision to purchase 100,000 AMD or Intel cpus for their company's new product. Whoever is making that decision is probably sitting on a few million in their retirement account. The average retirement account probably holds around 0.5% Intel stock (INTC being a component of the Dow Jones Industrial Average), and probably less than 0.01% AMD stock. So for every million retirement dollars, $5000 is in Intel, and next to nothing is in AMD.

Idontcare · May 18, 2011

Voo said:
Don't agree with that. If you take ARM for example they have pretty much only simple instructions (RISC isn't about a small set of instructions, RISC is a set of small instructions!) and support only simple addressing modes - both hallmarks of a RISC architecture.

And the x86 ISA with almost any addressing mode ever thought of, instructions to copy strings around and whatnot surely is the archetype of a CISC ISA. But there I assume you mean what the CPU is really executing and that's quite RISC like with their µops too.

Yeah I realize now just how horribly misguided (wrong) my post was...can't really explain it, I have zero recollection of where I even thought I was going with my post

Voo said:
So I think we could argue that all CPUs internally are RISC like, executing only simple instructions.

So basically these guys are using well-optimized RISC microarchitectures and ISA's to serve as hardware emulation for x86 CISC?

FX!32 wasn't such a bad idea, and the performance was nearly there. Then DEC went belly up and both Intel and AMD picked its carcass while Compaq (now HP) bought them up.

greenhawk · May 18, 2011

Edrick said:
First off, Itanium is not canned....yet.

Might not be, but going from memory of what I have heard over the last 3(?) years, it might as well be close.

3 years of little or no talk on tech sites / news sites to me reads as either it has been "swept under the carpet" or intel is playing/planing such a long game/period as to make it pointless to the average joe for the next several years.

But I do agree, if it is to gain traction, it needs to be on existing chips like MMX and similar changes from the basic code. Then hopefully in 3-5 years, some dedicated chips can be brought out instead.

Just like linux, any major change like Itanium needs something important like gaming for the mid to high end consumer market to be interested in buying them.

evilspoons · May 18, 2011

I believe that as hardware gets faster and faster we can "compartmentalize" and move away from needing to be able to access - or hell, even understand - hardware directly. Just look at video cards, who the heck writes directly for an graphics card any more like you had to back in the 3dfx Glide days? (Can you even do that on something like a GTX 580?). You have abstraction layers like DirectX, OpenGL, CUDA, OpenCL, and so on.

As a person trying to solve a problem ("I am writing a game that does THIS, I am writing a program that does THAT") I don't need or want to know how the CPU/GPU does its job, I just want it to do it. That means I can replace it with a different unit and it will also still Just Work.

At the CPU level this comes at a fairly noticeable performance penalty still (high-level vs. assembly language) but the gap is going away. I believe at some point CPUs will be totally abstracted to the same level that GPUs are now - instead of having to rewrite the operating system for a new architecture, you just install a new "OpenCPUL" driver and away you go.

I think Intel's approach of disassembling x86 commands and turning them into basic internal blocks is genius. Maybe at some point they can strap on an ARM preprocessor too.

Tuna-Fish · May 18, 2011

dorion said:
What I would really like to see is the underlying RISC operations of Intel and AMD processors exposed as ISAs. Newly compiled programs could then target the underlying ISAs bypassing all the translation logic. Intel and AMD could introduce ways to have multiple target exes just like is OSX, also something like Rosetta for when they start producing processors without x86 translation facilities.

I know the exposed ISAs would probably be rife with IP, but I'm allowed to dream.

This would force them to freeze the underlying ISA's as they are. So far, Intel has changed theirs pretty much every tock, and as the decreasing price of transistors changes the costs and benefits of various tradeoffs, this is likely to continue. So, on a five-year outlook or more, revealing the underlying ISA would likely hurt performance.

Having your frontend decoupled from your backend is a good thing. I think stumbling into this by accident and necessity is probably one of the reasons x86 won. If only the frontend wasn't as horrible as x86

.

If Intel wanted a new ISA, the best choice would probably be functionally almost equivalent to the parts of x86 that everyone uses, but redoing all the instruction encoding. Keep the variable instruction sizes, but make finding instruction boundaries cheaper, by clamping them at 2 bytes instead of 1, and either putting the length of the instruction into the first few bits of every instruction, or even putting a header into every group of 16 (or 32) bytes that marks instruction boundaries. Use no prefixes or other madness, and have a very simple (length+opcode)(reg1)(reg2)[optionally room for immediates/more complex addressing modes, in groups of 2 bytes] that all the most widely used instructions fit into, with the less common ones taking 4 bytes. Remove all partial register updates (all non-64 bit operations zero- or sign-extend), have insert instructions in their place. Spend a bit to allow any instruction to avoid updating flags. Cut all the FPU's/SIMD other than AVX, and give it the same instruction decoding overhaul (fit every avx instruction to 4 bytes). At least initially, use microcode only to trap complex situations (like page fault during unaligned access) -- basically, if you don't implement the instruction in 3 uops or less, don't offer it at all.

That should fix the worst problems of x86, while actually remaining implementable. You should be able to decode a full block (either 16 or 32 bytes) with one-clock throughput (pipelined). This keeps the OOE hardware busy, and allows the frontend to shut itself down when the buffers are full, saving power. Code density should improve, as many instructions that are now 3 bytes or more would fit into 2. You can powergate the frontend you are not currently using, and the new one should be so much smaller than the one x86 uses that it should basically vanish into the die. There'd be no need for uop cache, as the decode should more than keep up with execute. And of course, since it would be so much simpler, the new decode hardware would use much less power when it's used.

Will they make something like this? As AMD64 has shown, the only thing that matters is performance on existing code, so probably no. We can dream.

Voo said:
Ok, sorry for the misinformation. Sounds like that makes the compiler a hell of a lot more complicated than it already is and means that a cache miss does actually stall the CPU for the time being - doesn't sound sensibly considering that power efficiency isn't a forte of itanium to start with..

IPF has the curious distinction of being perhaps the only still widely-used isa that is actually worse than x86. Shifting complexity from processor to compiler using VLIW is a fundamentally bad idea, and not just because compilers took a decade to catch up.

Most real code can be divided roughly into two categories -- compute kernels, where you can get great IPC, and the challenge is decoding code fast enough to allow wide designs to take advantage of that IPC, and the rest, where you stall for memory and pointer-chase, where there are only occasional opportunities of executing more than one instruction per clock, and the challenge is using as few bits as possible to encode the program, whose execution time is entirely bounded by your memory system anyway. Most performance-critical programs spend most of their execution time in the first kind of segments, but by far most of their code consists of the second kind of segments.

IPF is truly great at the first kind -- basically give me any compute-related microbenchmark and otherwise equivalent IPF and x86 processors, and the IPF will win. It's the assembly tweakers wet dream. But most of real code is the second kind, for which IPF really, really sucks. Each vliw bundle is 16 bytes, and instructions within cannot use the results of each other. The most complex addressing mode is register-indirect. If you want to do any memory operation more complex than that, every pointer chase you do is 32 bytes, or even more. Sure, you'd have plenty of slots in those bundles to fit unrelated instructions into, but you don't have anything to put in them.

The end result is that real, normal code tends to swell up really badly, and overflows all sensible instruction caches. Which is just great because in the absence of OOE, on every miss you really wait. If you ever wondered why they put 1MB of L2 instruction cache per core on the 90nm Montecito, well, now you know. Any savings they had due to the simpler decode and not having OOE they paid back right there, with really heavy interest.

The only reason IPF even exists now is that Intel has segmented all the cool RAS features into Itanium servers -- expensive high end servers that need reliability use Itanium despite the instruction set, not because of it.

Note that I think that there are places where VLIW shines -- notably, AMD GPU's use it, and it allows them much better compute density than nVidia. But that's because the loads they get are basically tailor-made to their strengths -- in AMD shaders, all memory operations are segregated into separate programs run individually from the shader ops. Pointer-chasing is more or less just banned. Unfortunately, you cannot really do that for general-purpose code. (Itanium should be pretty cool at OpenCL, if somebody bothered to write a decent compiler for it).

PreferLinux · May 18, 2011

sm625 said:
It's simple self interest. Imagine the type of person who makes the decision to purchase 100,000 AMD or Intel cpus for their company's new product. Whoever is making that decision is probably sitting on a few million in their retirement account. The average retirement account probably holds around 0.5% Intel stock (INTC being a component of the Dow Jones Industrial Average), and probably less than 0.01% AMD stock. So for every million retirement dollars, $5000 is in Intel, and next to nothing is in AMD.

Not completely. It is why Intel is high; but it is not just that AMD chooses not to, which your post makes it sound like. AMD doesn't have the market share in the high margins markets (servers and such), and that also stops them charging high prices there too. So, AMD gets much of its revenue from the low-margins markets, and that is showing up overall.

Can intel make a cpu with 1 legacy core and rest subset of x86 isa?

Junior Member

Platinum Member

Platinum Member

Golden Member

Diamond Member

Golden Member

Senior member

Elite Member

No Lifer

Member

Elite Member

Diamond Member

Senior member

Elite Member Mobile Devices

Senior member

Golden Member

Senior member

Elite Member

Golden Member

Diamond Member

Elite Member

Platinum Member

Senior member

Golden Member

Senior member