Is this hybrid-apu-thing chip design possible/feasible ?

-Slacker-

Golden Member
Feb 24, 2010
1,563
0
76
I'm thinking a true hybrid apu, one that uses it's components, or part of it's components to do both 3d render stuff and "regular cpu stuff", unlike current apus, where the cpu and gpu are designed to do completely separate tasks.

I got the idea from the way multi core cpus handle multi threaded loads; With encoding, encrypting and other professional sorts of applications as an exception, most multi threaded programs end up using 1~2 cores properly, while barely sipping in to the rest of the cores on occasions, which means the potential of the rest of those cores is usually wasted.

So why not have a cpu design that incorporates more than one type/size of core - a few main cores that handle the grunt of the number crunching, and several "little" cores that pick off whatever non-intensive threads that remain?

To expand on the above - since the little cores are, well, small, and could be crammed on the die even by the dozens, would it be possible that they acted as graphical processors as well? (from what I can see gpus like massively parallelized, low performance processors).

^For example, these small cores could either

-Act as complementary cores to the main cores of the cpu, as described in the paragraph before the one above.

or

-Complement the cores/stream processors on the actual GPU.


Like so:

classifiedmicrochipplan.jpg



Well? :D


(No, you can't have whatever I'm smoking. Get your own.)
 

sm625

Diamond Member
May 6, 2011
8,172
137
106
What we need is a 4 issue core with 4 integer clusters and a fpu that consists of 16SP. The scheduler should be able to process all gpu instructions the same way it handles cpu instructions. The ROPs can be somewhere else on the die but the shaders definitely should be in the core. If AMD knew what they were doing we'd have 10 times the number of polygons per second we have now. It would be able to take 8 SSE4 instructions, fuse them, and process all of them in one clock cycle via the 16 SPs. It would be able to process 4 256 bit AVX instructions in one clock cycle.
 

-Slacker-

Golden Member
Feb 24, 2010
1,563
0
76
I gotta admit, I'm not savvy enough to understand all that but, from the sounds of it, it seems better than my idea.

If I understand correctly, stream processor clusters can be used as floating point units?
 

jhu

Lifer
Oct 10, 1999
11,918
9
81
Sony, Toshiba, and IBM are currently preparing their lawsuit against you for copying Cell.
 

sm625

Diamond Member
May 6, 2011
8,172
137
106
If I understand correctly, stream processor clusters can be used as floating point units?

Sure. In fact, all AMD needs to do is stick a small amount of logic (scheduler, x86 INT cluster, etc) into a radeon gpu, and the thing could run as a fullblown x86 core processing integer instructions normally, and fusing SSE / MMX and blasting them thru the SIMDs at rates that would make an intel shareholder jump off a bridge. This was what I expected from AMD five years ago, and they have proven to be quite adept at royally screwing things up.
 

Puppies04

Diamond Member
Apr 25, 2011
5,909
17
76
So why not have a cpu design that incorporates more than one type/size of core

Like this?...

How do you keep increasing performance in a power constrained environment like a smartphone without decreasing battery life? You can design more efficient microarchitectures, but at some point you’ll run out of steam there. You can transition to newer, more power efficient process technologies but even then progress is very difficult to come by. In the past you could rely on either one of these options to deliver lower power consumption, but these days you have to rely on both - and even then it’s potentially not enough. Heterogeneous multiprocessing is another option available - put a bunch of high performance cores alongside some low performance but low power cores and switch between them as necessary.

From here.. http://www.anandtech.com/show/4991/...dualcore-more-power-efficient-highend-devices
 

dguy6789

Diamond Member
Dec 9, 2002
8,558
3
76
If AMD knew what they were doing we'd have 10 times the number of polygons per second we have now

That is quite the claim. There's no way I can just let a statement like that go without some extremely detailed information backing it up.
 

-Slacker-

Golden Member
Feb 24, 2010
1,563
0
76
Hmm, so, If ARM does this kind of stuff with their cortex processors, then why haven't AMD (or Intel, since the general consensus in the thread seems to be that AMD wastes opportunities) done this?


Sony, Toshiba, and IBM are currently preparing their lawsuit against you for copying Cell.

NOT IF I CROSS THE BORDER INTO MEXICO FIRST
 

ed29a

Senior member
Mar 15, 2011
212
0
0
That is quite the claim. There's no way I can just let a statement like that go without some extremely detailed information backing it up.

You must be new here. Lot of armchair chip designers around here, they obviously would have designed chips that consume 1W or less and put out one bajillion polygons per second.

Pffft, obvious.
 

sm625

Diamond Member
May 6, 2011
8,172
137
106
That is quite the claim. There's no way I can just let a statement like that go without some extremely detailed information backing it up.

When we go from HDD to RAM to cache to cpu core back to cache to PCIe to gpu RAM to gpu cache to SIMD then back to cache(?) before going to the ROP, you can see that a single pixel requires way too much processing, by an order of magnitude. If you can cut all that down to just HDD to RAM to cache to cpu core to ROP you can get a theoretical order of magnitude increase in polygons per watt. Especially if the game textures are stored on NAND flash and the flash is stacked on the die and there is a special NAND controller wholly dedicated to storing 16GB of game textures and other relevant data such as the game engine and other code it needs. I think 10x polygons per watt is extremely conservative. In fact I will be extremely disappointed if the next game console does not use such a design.
 
Last edited:

Kenmitch

Diamond Member
Oct 10, 1999
8,505
2,250
136
Sure. In fact, all AMD needs to do is stick a small amount of logic (scheduler, x86 INT cluster, etc) into a radeon gpu, and the thing could run as a fullblown x86 core processing integer instructions normally, and fusing SSE / MMX and blasting them thru the SIMDs at rates that would make an intel shareholder jump off a bridge. This was what I expected from AMD five years ago, and they have proven to be quite adept at royally screwing things up.

Would kinda be funny if Bulldozer is waiting for GCN to make it complete and competetive with Intel. Not an AMD fanboy but **** that Bulldozer does need a kick in the nutz!


Profanity is not allowed in the technical forums.
Do not use our Forums to post any material, or links to any material, which is knowingly false and/or defamatory, inaccurate, abusive, vulgar, hateful, harassing, obscene, profane, sexually oriented, threatening, invasive of a person's privacy, or otherwise violative of any law. Special exception to the restrictions on vulgarity and profanity are granted ONLY in the social forums.

AnandTech Forum Guidelines
Administrator Idontcare
 
Last edited by a moderator:

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,712
4,672
75
You forgot one thing.

Intel/AMD CPU: 3GHz, roughly.
nVIDIA SP: 1.5GHz, roughly.
AMD SP: 750MHz, roughly.

Therefore, using AMD SPs in place of a CPU FPU would only do one instruction per four CPU cycles. And you thought Bulldozer's IPC was bad!

Furthermore, I'm not sure about AMD, but nVIDIA's SPs need four cycles to get an instruction through their pipeline. If you have lots of FP instructions queued up, that's fine; but if not, that's 16 cycles/instruction. This could probably be worked around to some degree, but I'm not sure how much.
 

Olikan

Platinum Member
Sep 23, 2011
2,023
275
126
incorporates more than one type/size of core - a few main cores that handle the grunt of the number crunching, and several "little" cores that pick off whatever non-intensive threads that remain?

well, that is almost the idea of heterogeneous computing from amd.

but it is not "small cores to do small things that can gang", it is more "dedicated stuff to do one thing"

think about it....just some examples

quick sync found on sandy bridge, very small, more powerfull than a gtx 580 for video transcoding.

UDV engine found on bobcats, very small, more powerfull for 3d videos than any cpu alone.

try to imagine beasts like those for let say... companion AI core engine? :cool:
 

-Slacker-

Golden Member
Feb 24, 2010
1,563
0
76
You forgot one thing.

Intel/AMD CPU: 3GHz, roughly.
nVIDIA SP: 1.5GHz, roughly.
AMD SP: 750MHz, roughly.

Therefore, using AMD SPs in place of a CPU FPU would only do one instruction per four CPU cycles. And you thought Bulldozer's IPC was bad!

Furthermore, I'm not sure about AMD, but nVIDIA's SPs need four cycles to get an instruction through their pipeline. If you have lots of FP instructions queued up, that's fine; but if not, that's 16 cycles/instruction. This could probably be worked around to some degree, but I'm not sure how much.

Wouldn't it be possible for the hybrid, smaller cores to adjust clock speeds depending on what they're doing - assuming it would be possible for the main cores and said small cores to run at different frequencies?


At Olikan:Nice, I had no idea AMD had their own concept for a heterogenous apu. I'll have to look into that
 

dac7nco

Senior member
Jun 7, 2009
756
0
0
You guys seem to be forgetting that graphics SP's arn't x86.

THIS. It's not as rock hard as when I was taking CIS courses, but 6502 assembly was loads easier than 68000. The tools are so much greater, and the understanding more widespread, but its like the answer to Anand's continuing question: why isn't Quicksync more widespread? Because people forget that Intel doesn't like to release IP.... which means how QS would be used in a production environment.

What Intel doesn't seem to realize, is that if they released their IP to the public domain... they would still have a five-year advantage over any potential competitor.

Daimon