Is this hybrid-apu-thing chip design possible/feasible ?

-Slacker- · Nov 15, 2011

I'm thinking a true hybrid apu, one that uses it's components, or part of it's components to do both 3d render stuff and "regular cpu stuff", unlike current apus, where the cpu and gpu are designed to do completely separate tasks.

I got the idea from the way multi core cpus handle multi threaded loads; With encoding, encrypting and other professional sorts of applications as an exception, most multi threaded programs end up using 1~2 cores properly, while barely sipping in to the rest of the cores on occasions, which means the potential of the rest of those cores is usually wasted.

So why not have a cpu design that incorporates more than one type/size of core - a few main cores that handle the grunt of the number crunching, and several "little" cores that pick off whatever non-intensive threads that remain?

To expand on the above - since the little cores are, well, small, and could be crammed on the die even by the dozens, would it be possible that they acted as graphical processors as well? (from what I can see gpus like massively parallelized, low performance processors).

^For example, these small cores could either

-Act as complementary cores to the main cores of the cpu, as described in the paragraph before the one above.

or

-Complement the cores/stream processors on the actual GPU.

Like so:

Well?

(No, you can't have whatever I'm smoking. Get your own.)

sm625 · Nov 15, 2011

What we need is a 4 issue core with 4 integer clusters and a fpu that consists of 16SP. The scheduler should be able to process all gpu instructions the same way it handles cpu instructions. The ROPs can be somewhere else on the die but the shaders definitely should be in the core. If AMD knew what they were doing we'd have 10 times the number of polygons per second we have now. It would be able to take 8 SSE4 instructions, fuse them, and process all of them in one clock cycle via the 16 SPs. It would be able to process 4 256 bit AVX instructions in one clock cycle.

-Slacker- · Nov 15, 2011

I gotta admit, I'm not savvy enough to understand all that but, from the sounds of it, it seems better than my idea.

If I understand correctly, stream processor clusters can be used as floating point units?

jhu · Nov 15, 2011

Sony, Toshiba, and IBM are currently preparing their lawsuit against you for copying Cell.

sm625 · Nov 15, 2011

-Slacker- said:
If I understand correctly, stream processor clusters can be used as floating point units?

Sure. In fact, all AMD needs to do is stick a small amount of logic (scheduler, x86 INT cluster, etc) into a radeon gpu, and the thing could run as a fullblown x86 core processing integer instructions normally, and fusing SSE / MMX and blasting them thru the SIMDs at rates that would make an intel shareholder jump off a bridge. This was what I expected from AMD five years ago, and they have proven to be quite adept at royally screwing things up.

Puppies04 · Nov 15, 2011

sm625 said:
So why not have a cpu design that incorporates more than one type/size of core

Like this?...

How do you keep increasing performance in a power constrained environment like a smartphone without decreasing battery life? You can design more efficient microarchitectures, but at some point you’ll run out of steam there. You can transition to newer, more power efficient process technologies but even then progress is very difficult to come by. In the past you could rely on either one of these options to deliver lower power consumption, but these days you have to rely on both - and even then it’s potentially not enough. Heterogeneous multiprocessing is another option available - put a bunch of high performance cores alongside some low performance but low power cores and switch between them as necessary.

From here.. http://www.anandtech.com/show/4991/...dualcore-more-power-efficient-highend-devices

dguy6789 · Nov 15, 2011

sm625 said:
If AMD knew what they were doing we'd have 10 times the number of polygons per second we have now

That is quite the claim. There's no way I can just let a statement like that go without some extremely detailed information backing it up.

-Slacker- · Nov 15, 2011

Hmm, so, If ARM does this kind of stuff with their cortex processors, then why haven't AMD (or Intel, since the general consensus in the thread seems to be that AMD wastes opportunities) done this?

jhu said:
Sony, Toshiba, and IBM are currently preparing their lawsuit against you for copying Cell.

NOT IF I CROSS THE BORDER INTO MEXICO FIRST

jhu · Nov 15, 2011

-Slacker- said:
NOT IF I CROSS THE BORDER INTO MEXICO FIRST

Better to go to Maldives: archipelago paradise, safe, and no extradition treaty with the USA.

ed29a · Nov 15, 2011

dguy6789 said:
That is quite the claim. There's no way I can just let a statement like that go without some extremely detailed information backing it up.

You must be new here. Lot of armchair chip designers around here, they obviously would have designed chips that consume 1W or less and put out one bajillion polygons per second.

Pffft, obvious.

sm625 · Nov 15, 2011

dguy6789 said:
That is quite the claim. There's no way I can just let a statement like that go without some extremely detailed information backing it up.

When we go from HDD to RAM to cache to cpu core back to cache to PCIe to gpu RAM to gpu cache to SIMD then back to cache(?) before going to the ROP, you can see that a single pixel requires way too much processing, by an order of magnitude. If you can cut all that down to just HDD to RAM to cache to cpu core to ROP you can get a theoretical order of magnitude increase in polygons per watt. Especially if the game textures are stored on NAND flash and the flash is stacked on the die and there is a special NAND controller wholly dedicated to storing 16GB of game textures and other relevant data such as the game engine and other code it needs. I think 10x polygons per watt is extremely conservative. In fact I will be extremely disappointed if the next game console does not use such a design.

Kenmitch · Nov 15, 2011

sm625 said:
Sure. In fact, all AMD needs to do is stick a small amount of logic (scheduler, x86 INT cluster, etc) into a radeon gpu, and the thing could run as a fullblown x86 core processing integer instructions normally, and fusing SSE / MMX and blasting them thru the SIMDs at rates that would make an intel shareholder jump off a bridge. This was what I expected from AMD five years ago, and they have proven to be quite adept at royally screwing things up.

Would kinda be funny if Bulldozer is waiting for GCN to make it complete and competetive with Intel. Not an AMD fanboy but **** that Bulldozer does need a kick in the nutz!

Profanity is not allowed in the technical forums.

Do not use our Forums to post any material, or links to any material, which is knowingly false and/or defamatory, inaccurate, abusive, vulgar, hateful, harassing, obscene, profane, sexually oriented, threatening, invasive of a person's privacy, or otherwise violative of any law. Special exception to the restrictions on vulgarity and profanity are granted ONLY in the social forums.

AnandTech Forum Guidelines

Administrator Idontcare

Ken g6 · Nov 15, 2011

You forgot one thing.

Intel/AMD CPU: 3GHz, roughly.
nVIDIA SP: 1.5GHz, roughly.
AMD SP: 750MHz, roughly.

Therefore, using AMD SPs in place of a CPU FPU would only do one instruction per four CPU cycles. And you thought Bulldozer's IPC was bad!

Furthermore, I'm not sure about AMD, but nVIDIA's SPs need four cycles to get an instruction through their pipeline. If you have lots of FP instructions queued up, that's fine; but if not, that's 16 cycles/instruction. This could probably be worked around to some degree, but I'm not sure how much.

Olikan · Nov 15, 2011

-Slacker- said:
incorporates more than one type/size of core - a few main cores that handle the grunt of the number crunching, and several "little" cores that pick off whatever non-intensive threads that remain?

well, that is almost the idea of heterogeneous computing from amd.

but it is not "small cores to do small things that can gang", it is more "dedicated stuff to do one thing"

think about it....just some examples

quick sync found on sandy bridge, very small, more powerfull than a gtx 580 for video transcoding.

UDV engine found on bobcats, very small, more powerfull for 3d videos than any cpu alone.

try to imagine beasts like those for let say... companion AI core engine?

-Slacker- · Nov 15, 2011

Ken g6 said:
You forgot one thing.

Intel/AMD CPU: 3GHz, roughly.
nVIDIA SP: 1.5GHz, roughly.
AMD SP: 750MHz, roughly.

Therefore, using AMD SPs in place of a CPU FPU would only do one instruction per four CPU cycles. And you thought Bulldozer's IPC was bad!

Furthermore, I'm not sure about AMD, but nVIDIA's SPs need four cycles to get an instruction through their pipeline. If you have lots of FP instructions queued up, that's fine; but if not, that's 16 cycles/instruction. This could probably be worked around to some degree, but I'm not sure how much.

Wouldn't it be possible for the hybrid, smaller cores to adjust clock speeds depending on what they're doing - assuming it would be possible for the main cores and said small cores to run at different frequencies?

At Olikan:Nice, I had no idea AMD had their own concept for a heterogenous apu. I'll have to look into that

RocksteadyDotNet · Nov 15, 2011

You guys seem to be forgetting that graphics SP's arn't x86.

dac7nco · Nov 15, 2011

RocksteadyDotNet said:
You guys seem to be forgetting that graphics SP's arn't x86.

THIS. It's not as rock hard as when I was taking CIS courses, but 6502 assembly was loads easier than 68000. The tools are so much greater, and the understanding more widespread, but its like the answer to Anand's continuing question: why isn't Quicksync more widespread? Because people forget that Intel doesn't like to release IP.... which means how QS would be used in a production environment.

What Intel doesn't seem to realize, is that if they released their IP to the public domain... they would still have a five-year advantage over any potential competitor.

Daimon

Search

Is this hybrid-apu-thing chip design possible/feasible ?

-Slacker-

Golden Member

sm625

Diamond Member

-Slacker-

Golden Member

jhu

Lifer

sm625

Diamond Member

Puppies04

Diamond Member

dguy6789

Diamond Member

-Slacker-

Golden Member

jhu

Lifer

ed29a

Senior member

sm625

Diamond Member

Kenmitch

Diamond Member

Ken g6

Programming Moderator, Elite Member

Olikan

Platinum Member

-Slacker-

Golden Member

RocksteadyDotNet

Diamond Member

dac7nco

Senior member

TRENDING THREADS