Why not one A15 and 2-4 A7 companion cores?

Hubb1e · Jan 7, 2013

Why not build a chip with one really fast A15 and a handful of A7 CPUs? The A7 is compatible with the same instruction set as the A15, what would be the advantages or disadvantages of such a chip?

TuxDave · Jan 7, 2013

I thought that was the basis of big.LITTLE. What are you asking?

Exophase · Jan 7, 2013

He's asking why Tegra 4 is using a 5th Cortex-A15 for its "companion core." Frankly I don't really understand either. It doesn't make sense using all that area on a chip with a big performance cap if you can get better perf/W at lower frequencies (I'm assuming) using something much smaller..

TuxDave · Jan 7, 2013

Exophase said:
He's asking why Tegra 4 is using a 5th Cortex-A15 for its "companion core." Frankly I don't really understand either. It doesn't make sense using all that area on a chip with a big performance cap if you can get better perf/W at lower frequencies (I'm assuming) using something much smaller..

Oh that's an easy question. That way they can maintain one verilog code base for both the fast cores and slow cores. Save yourself time from coding and validation. After that, just synthesize the companion core for a slow frequency and BAM, you get a free low power core.

CTho9305 · Jan 7, 2013

That reason doesn't make sense. Unless they're staffed at startup levels, the management overhead isn't big enough to make that the right tradeoff. If they're using A15's, they bought the RTL from ARM and likely aren't even "maintaining" it themselves in any meaningful way.

A15 is a monster. A7 is tiny (ARM claims ~0.5mm^2 on 28nm). You're never going to get an A15 in remotely near the area of an A7. ARM also claims A15 is power-inefficient at the performance envelopes where A7 makes sense. Maybe nVidia determined that even the companion core needs A15-level performance to make a good product... or they're lying. It's obvious that they fake die photos, so I'd be careful about drawing conclusions from the tegra4 images out there. edit: Or ARM is lying about the relative efficiencies of their cores.

TuxDave · Jan 7, 2013

CTho9305 said:
That reason doesn't make sense. Unless they're staffed at startup levels, the management overhead isn't big enough to make that the right tradeoff. If they're using A15's, they bought the RTL from ARM and likely aren't even "maintaining" it themselves in any meaningful way.

A15 is a monster. A7 is tiny (ARM claims ~0.5mm^2 on 28nm). You're never going to get an A15 in remotely near the area of an A7. ARM also claims A15 is power-inefficient at the performance envelopes where A7 makes sense. Maybe nVidia determined that even the companion core needs A15-level performance to make a good product... or they're lying. It's obvious that they fake die photos, so I'd be careful about drawing conclusions from the tegra4 images out there. edit: Or ARM is lying about the relative efficiencies of their cores.

I don't know how large the Tegra CPU team is vs the Tegra GPU team or GFX card team but it would be an incredible waste of time and money to have another design team work on getting another part out where performance doesn't matter. Just because nVidia is big doesn't mean that the design team in charge of a specific product is very large. To me, getting a low power core for free is a pretty great idea to free up resources doing something else. I don't really get where the management overhead comes it.

I bolded a questionable statement. When nVidia buys an ARM license, do they just synthesize without any logic changes? So an ARM license is purely backend effort?

Exophase · Jan 8, 2013

I don't think anyone is changing the RTL when they buy ARM licenses. But they will typically need to supply macros to implement the design and not necessarily rely on auto-synthesis for all of it. ie, they will perform hardening using different approaches. ARM also sells hardened cores which AFAIK are used pretty commonly by small Chinese SoC vendors that probably have a lot less money to design to harden it themselves. I wondered before if nVidia was using these hardened cores in Tegra 2 and 3. This would explain why their LP companion core can only go up to 500MHz when everyone else is using LP for everything but can scale well past 1GHz. It could also explain why nVidia had such quick time to market with Tegra 2 despite being relatively inexperienced in low power SoC design.

Of course ARM also provides verification suites.

I really wonder what nVidia is doing here exactly. It'd be a lot of work to do a muxed core internal to an A15 cluster because the design doesn't work this way. It isn't like with Cortex-A9 because the L2 is tightly coupled. If die shot at least somewhat reflects the real floorplan (if not so much a real die shot) it would look like the companion core has its own L2 cache, making it different from Tegra 3's arrangement where a single core is literally muxed. ARM does officially support a configuration allowing two multicore Cortex-A15 clusters connected with a coherency link over the AXI bus. This is the standard way to get an 8-core Cortex-A15. It could be that nVidia is using this arrangement, making it more like "big.big" if you will. What doesn't make sense is how this would provide software transparent exclusive switching as opposed to just looking like a 5 core system.

CTho9305 · Jan 8, 2013

TuxDave said:
I don't know how large the Tegra CPU team is vs the Tegra GPU team or GFX card team but it would be an incredible waste of time and money to have another design team work on getting another part out where performance doesn't matter. Just because nVidia is big doesn't mean that the design team in charge of a specific product is very large. To me, getting a low power core for free is a pretty great idea to free up resources doing something else. I don't really get where the management overhead comes it.

they bought the RTL from ARM and likely aren't even "maintaining" it themselves in any meaningful way.

Click to expand...

I bolded a questionable statement. When nVidia buys an ARM license, do they just synthesize without any logic changes? So an ARM license is purely backend effort?

Sorry, when I said "management", I meant design/data management, not people management. Instead of tar -zxf'ing the RTL for n purchased IPs, you tar -zxf the RTL for n+1 IPs.

The actual incremental manpower to add an extra block (especially a non-critical one) is probably on the order of one engineer (less for smaller / more efficient companies). That said, if your companion core isn't completely identical to the main cores (maybe just with different transistor models--but the same library cells--so you can just treat it as an extra operating condition in a single design), the effort savings for using the same RTL versus different RTL plummets - most of the work you do (e.g. floorplanning, clock tree optimization, phyV/electrical cleanup) can't be reused, so you're almost doing two unrelated blocks anyway.

As for the ARM license stuff, my understanding is that there are multiple options you can buy, ranging from a hard IP to an architecture license (Krait/Swift/Armada/X-Gene/etc). I suspect most customers go for RTL-without-permission-to-modify-it, and I base that on die photos (they all look slightly different) + benchmarks (they all seem to perform the same) + expectations of the manpower tradeoffs (as soon as you make any change to processor RTL that you can't formally verify against the original, you need a pretty big verification effort to ensure you haven't introduced bugs).

Edit: From ARM's website:

TuxDave · Jan 8, 2013

CTho9305 said:
Sorry, when I said "management", I meant design/data management, not people management. Instead of tar -zxf'ing the RTL for n purchased IPs, you tar -zxf the RTL for n+1 IPs.

Even with data management, maybe I'm a little spoiled with what I get to work with because even that leaves me thinking "what's the big deal".

That said, I assumed that the companion core was pretty much identical to the core. If I was management, you start with your existing RTL, floorplan and preroutes and just synthesize it. That's probably the easiest and more inefficient method. The next option is to take the most uncongested areas and simple start chopping off area and repeat until you hit a point where the amount of design errors hit a specific number (representing how much resource you have to clean it up). Then you just stop and go work with what you got.

Thanks for the pointer on the different methods to handle an ARM license. I was always wondering about it on the back of my mind. I really kind of wonder what it would be like working for an ARM based company. No features to make? No RTL to code? Sounds like no fun to me!

djgandy · Jan 8, 2013

It seems like a big penalty to pay and counter intuitive when the goal is power efficiency. Then again so does sticking down 4 x A15's but Nvidia is unlikely to change that as it wold be admitting that they got it wrong on T3 by going quad core.

sontin · Jan 8, 2013

They are using A15 because of the performance. I agree that it makes no sense to use A15 if you want the better perf/watt. But that Tegra4 is a superphone, tablet and now a handheld SoC they think performance matters more than perf/watt.

Exophase said:
This would explain why their LP companion core can only go up to 500MHz when everyone else is using LP for everything but can scale well past 1GHz. It could also explain why nVidia had such quick time to market with Tegra 2 despite being relatively inexperienced in low power SoC design.

nVidia explained it:
On 40nm the G process has a better perf/watt beginning with >500MHz. That's the reason why the companion core is only used up to 500MHz and then the Core 0 from the quadcore kicks in.

djgandy · Jan 8, 2013

sontin said:
They are using A15 because of the performance. I agree that it makes no sense to use A15 if you want the better perf/watt. But that Tegra4 is a superphone, tablet and now a handheld SoC they think performance matters more than perf/watt.

nVidia explained it:
On 40nm the G process has a better perf/watt beginning with >500MHz. That's the reason why the companion core is only used up to 500MHz and then the Core 0 from the quadcore kicks in.

Yeah but they already have 4 x A15s.... Why add a power saving A15 as well? It contradicts the goal of saving power.

sontin · Jan 8, 2013

To save power?!

Cores only use power when they are active.

djgandy · Jan 8, 2013

Yeah but if you are saving power, add an A9 or A7. Once the workload increases it switches to the 4 x A15s anyway. The whole point of companion cores is to handle idle workloads and things like music playback / video decode. The CPU usage is very low here and you just need something ticking over in the background.

sontin · Jan 8, 2013

Music playback goes over the ARM7 when the display is off.

Use the same Core(s) - means the same performance character - make(s) is much easier to scale the clocks up and down.

It is the better way when your companion cores should not be visible to the OS.

djgandy · Jan 8, 2013

sontin said:
Music playback goes over the ARM7 when the display is off.

What ARM7?

sontin said:
Use the same Core(s) - means the same performance character - make(s) is much easier to scale the clocks up and down.

It is the better way when your companion cores should not be visible to the OS.

Well that is just nonsense. You are just trying to guess.

Exophase · Jan 8, 2013

sontin said:
They are using A15 because of the performance. I agree that it makes no sense to use A15 if you want the better perf/watt. But that Tegra4 is a superphone, tablet and now a handheld SoC they think performance matters more than perf/watt.

The companion core is a power consumption optimization no matter how you look at it. The other four cores will still offer the best in peak perf.

The companion core in Tegra 3 was always described as optimized towards the lower ends of the perf curve, hence why they did it on a lower leakage process. Tegra 4 seems to be optimizing in the opposite direction - they want better power consumption at significant percentages of peak perf. That could make sense, but here's my question - why do a separate fifth core for this? I don't know yet what the fifth core can clock to so this may be off base, but if it's similar to what the four cores can clock to when all active (and given power consumption seen so far I expect this clock speed to be significantly below 1.9GHz, maybe even below 1.5GHz) then maybe they should have just made one or two of the four cores power optimized and frequency limited.

But these are just guesses, I'm sure nVidia knows much better than I do, even if they do make weird decisions like no NEON on Tegra 2 and a single-channel memory controller on Tegra 3..

sontin said:
nVidia explained it:
On 40nm the G process has a better perf/watt beginning with >500MHz. That's the reason why the companion core is only used up to 500MHz and then the Core 0 from the quadcore kicks in.

I thought I read that the core can only handle 500MHz period, but that could be totally wrong.. To some extent you want to avoid switching right when it hits the threshold; if it really can run over 500MHz then nVidia shouldn't switch it until after it stays above that for a while because the switching overhead is pretty significant.

I wouldn't say the quadcore kicks in, nVidia has made it clear each core in the G set is power gated and has even described the performance flow as going from the 1 LP core to 1 core at G to > 1 cores at G.

djgandy said:
Yeah but if you are saving power, add an A9 or A7. Once the workload increases it switches to the 4 x A15s anyway. The whole point of companion cores is to handle idle workloads and things like music playback / video decode. The CPU usage is very low here and you just need something ticking over in the background.

A9's not actually an option, not the same instruction set.

Hubb1e · Jan 8, 2013

When I posted the thread I was actually thinking about single threaded vs multi threaded performance. Most of the workload is still on one thread so I was thinking that one A15 core would be great for that. Then it would have 2-4 or maybe even more little A7 cores that would handle both the idle workload and help the big core complete tasks.

Maybe this doesn't make any sense though because the A7 would be so much slower than the A15 that the A15 would be waiting on tasks from the A7s. The A7s are supposed to be 75% of the speed of an A9 so it isn't a slouch, but it's not close to the A15.

The way Nvidia has implemented Tegra is to turn off the other cores when not needed. This doesn't share the load across chips of different performance characteristics. Maybe the core management of a ONE BIG and many little cores isn't viable.

Exophase · Jan 8, 2013

Almost all multicore SoCs can gate individual cores. nVidia made a big deal about it in Tegra 3 precisely because Tegra 2 was one of the few examples of an SoC that couldn't. Releasing a dual core without this capability is bad enough, but a quad core without this capability would be suicidal.

It'd be great if the companion A15 could run in parallel with the others, while on a different voltage and clock speed. Then they'd have partial asynchronous DVFS like Qualcomm has. But I doubt this will be the case.

CTho9305 · Jan 8, 2013

TuxDave said:
Even with data management, maybe I'm a little spoiled with what I get to work with because even that leaves me thinking "what's the big deal".

I think we're agreeing? I don't think the design/data management cost of one extra block is enough to tip the scales in favor of reusing a big power-hungry A15.

If I was management, you start with your existing RTL, floorplan and preroutes and just synthesize it. That's probably the easiest and more inefficient method. The next option is to take the most uncongested areas and simple start chopping off area and repeat until you hit a point where the amount of design errors hit a specific number (representing how much resource you have to clean it up). Then you just stop and go work with what you got.

Have you tried that? I've tried timing-unconstrained synthesis, and the cell area scaling wasn't as exciting as I had expected.

Thanks for the pointer on the different methods to handle an ARM license. I was always wondering about it on the back of my mind. I really kind of wonder what it would be like working for an ARM based company. No features to make? No RTL to code? Sounds like no fun to me!

Yeah... it's different. But there can be interesting problems outside of a CPU core - consider that all AMD could talk about at the Atlas (Cortex A50-series) announcement was how great their Freedom Fabric is. They have to differentiate from other A57-licensees somehow, and it doesn't look like they're doing it with the core.

TuxDave · Jan 9, 2013

CTho9305 said:
Have you tried that? I've tried timing-unconstrained synthesis, and the cell area scaling wasn't as exciting as I had expected.

Actually yes. Timing relaxation in combination with area reduction gave good results, and the interesting part was that area reduction helped constrain the design better to get better results out of timing relaxation! Power was getting wasted not on timing but just sending signals across wasted space of nothingness.

But yeah slight miscommunication on the management statement. I wasn't saying duplicate rtl saves data management but if people actually touched rtl and had to validate it, duplicate rtl saves a ton if front end effort.

beginner99 · Jan 9, 2013

Hubb1e said:
Why not build a chip with one really fast A15 and a handful of A7 CPUs? The A7 is compatible with the same instruction set as the A15, what would be the advantages or disadvantages of such a chip?

I agree with you. Disadvantage would be on software side. How should the OS know on which core (fast, slow) to schedule?

It sure would be great to have 1 fast core like for rendering web pages and smaller ones for background tasks. Personally I don't get quad-core phone chips. Is there any real world scenario for needing 4 cores on a phone? You can't even multi-task properly so...IMHO a 1 fast core (faster than A15) + 1 "low power core" for background tasks should be enough.

However that would require "Tasks" to be somehow tagged for a certain core.

Hubb1e · Jan 9, 2013

Nvidia is betting that these quads are useful for gaming and since they are power gated they don't consume much power when off. But yeah, a quad is still overkill for a modern office use desktop so a quad in a phone seems like a waste of silicon. Gaming and heavy Apps like photo and video editing can take advantage of the quads but in normal use I can't see much use for them. That's why I was proposing one big core and lots of little ones. If it's multi-threaded then 3 small cores could perform better than the 1 big core and 3 A7s fit in less space than 1 A15. I'm not sure how that would match up in terms of speed but it's an interesting idea. The A7s are low power so under low load you can run one of them. Then as soon as you hit a single threaded workload that needs lot of horsepower switch to the A15. Then when it shifts to multithreaded run all 3 A7s.

Exophase · Jan 9, 2013

beginner99 said:
I agree with you. Disadvantage would be on software side. How should the OS know on which core (fast, slow) to schedule?

Same way it knows how to scale frequency/voltage in general - if a process is at 100% CPU time for any sustained period of time give it more headroom. If it runs far enough below 100% CPU time give it less. The longer the process runs the more statistical data you can gather on what it needs. You could perhaps save some statistics as well. Program performance needs tend to be pretty stable and the latency in adjusting these things is fast enough that it doesn't cause a huge problem for the user.

In the case of asynchronous cores it means migrating processes based on their measured needs. In a way I think asynchronous is easier to work with because you have more flexibility - you don't have to make a decision to clock everything based on the maximum demand and decide to run more cores at lower frequencies vs fewer cores at higher frequencies with more threads sharing a core. I mean, in big.LITTLE you still do have to pick frequencies fro the entire big and little clusters respectively, but that still at least gives you some extra leverage.

CTho9305 · Jan 9, 2013

TuxDave said:
Actually yes. Timing relaxation in combination with area reduction gave good results, and the interesting part was that area reduction helped constrain the design better to get better results out of timing relaxation! Power was getting wasted not on timing but just sending signals across wasted space of nothingness.

Interesting...cell sizes in the designs I've looked at were already skewed towards the minimums. Your design must look pretty different. Ever heard the term "<redacted> death spiral"?

But yeah slight miscommunication on the management statement. I wasn't saying duplicate rtl saves data management but if people actually touched rtl and had to validate it, duplicate rtl saves a ton of front end effort.

Yeah, I agree with that completely.

Why not one A15 and 2-4 A7 companion cores?

Senior member

Lifer

Diamond Member

Lifer

Elite Member

Lifer

Diamond Member

Elite Member

Lifer

Member

Diamond Member

Member

Diamond Member

Member

Diamond Member

Member

Diamond Member

Senior member

Diamond Member

Elite Member

Lifer

Diamond Member

Senior member

Diamond Member

Elite Member