He's asking why Tegra 4 is using a 5th Cortex-A15 for its "companion core." Frankly I don't really understand either. It doesn't make sense using all that area on a chip with a big performance cap if you can get better perf/W at lower frequencies (I'm assuming) using something much smaller..
That reason doesn't make sense. Unless they're staffed at startup levels, the management overhead isn't big enough to make that the right tradeoff. If they're using A15's, they bought the RTL from ARM and likely aren't even "maintaining" it themselves in any meaningful way.
A15 is a monster. A7 is tiny (ARM claims ~0.5mm^2 on 28nm). You're never going to get an A15 in remotely near the area of an A7. ARM also claims A15 is power-inefficient at the performance envelopes where A7 makes sense. Maybe nVidia determined that even the companion core needs A15-level performance to make a good product... or they're lying. It's obvious that they fake die photos, so I'd be careful about drawing conclusions from the tegra4 images out there. edit: Or ARM is lying about the relative efficiencies of their cores.
I don't know how large the Tegra CPU team is vs the Tegra GPU team or GFX card team but it would be an incredible waste of time and money to have another design team work on getting another part out where performance doesn't matter. Just because nVidia is big doesn't mean that the design team in charge of a specific product is very large. To me, getting a low power core for free is a pretty great idea to free up resources doing something else. I don't really get where the management overhead comes it.
I bolded a questionable statement. When nVidia buys an ARM license, do they just synthesize without any logic changes? So an ARM license is purely backend effort?they bought the RTL from ARM and likely aren't even "maintaining" it themselves in any meaningful way.
Sorry, when I said "management", I meant design/data management, not people management. Instead of tar -zxf'ing the RTL for n purchased IPs, you tar -zxf the RTL for n+1 IPs.
This would explain why their LP companion core can only go up to 500MHz when everyone else is using LP for everything but can scale well past 1GHz. It could also explain why nVidia had such quick time to market with Tegra 2 despite being relatively inexperienced in low power SoC design.
They are using A15 because of the performance. I agree that it makes no sense to use A15 if you want the better perf/watt. But that Tegra4 is a superphone, tablet and now a handheld SoC they think performance matters more than perf/watt.
nVidia explained it:
On 40nm the G process has a better perf/watt beginning with >500MHz. That's the reason why the companion core is only used up to 500MHz and then the Core 0 from the quadcore kicks in.
Music playback goes over the ARM7 when the display is off.
Use the same Core(s) - means the same performance character - make(s) is much easier to scale the clocks up and down.
It is the better way when your companion cores should not be visible to the OS.
They are using A15 because of the performance. I agree that it makes no sense to use A15 if you want the better perf/watt. But that Tegra4 is a superphone, tablet and now a handheld SoC they think performance matters more than perf/watt.
nVidia explained it:
On 40nm the G process has a better perf/watt beginning with >500MHz. That's the reason why the companion core is only used up to 500MHz and then the Core 0 from the quadcore kicks in.
Yeah but if you are saving power, add an A9 or A7. Once the workload increases it switches to the 4 x A15s anyway. The whole point of companion cores is to handle idle workloads and things like music playback / video decode. The CPU usage is very low here and you just need something ticking over in the background.
Even with data management, maybe I'm a little spoiled with what I get to work with because even that leaves me thinking "what's the big deal".
Have you tried that? I've tried timing-unconstrained synthesis, and the cell area scaling wasn't as exciting as I had expected.If I was management, you start with your existing RTL, floorplan and preroutes and just synthesize it. That's probably the easiest and more inefficient method. The next option is to take the most uncongested areas and simple start chopping off area and repeat until you hit a point where the amount of design errors hit a specific number (representing how much resource you have to clean it up). Then you just stop and go work with what you got.
Yeah... it's different. But there can be interesting problems outside of a CPU core - consider that all AMD could talk about at the Atlas (Cortex A50-series) announcement was how great their Freedom Fabric is. They have to differentiate from other A57-licensees somehow, and it doesn't look like they're doing it with the core.Thanks for the pointer on the different methods to handle an ARM license. I was always wondering about it on the back of my mind. I really kind of wonder what it would be like working for an ARM based company. No features to make? No RTL to code? Sounds like no fun to me!
Have you tried that? I've tried timing-unconstrained synthesis, and the cell area scaling wasn't as exciting as I had expected.
Why not build a chip with one really fast A15 and a handful of A7 CPUs? The A7 is compatible with the same instruction set as the A15, what would be the advantages or disadvantages of such a chip?
I agree with you. Disadvantage would be on software side. How should the OS know on which core (fast, slow) to schedule?
Interesting...cell sizes in the designs I've looked at were already skewed towards the minimums. Your design must look pretty different. Ever heard the term "<redacted> death spiral"?Actually yes. Timing relaxation in combination with area reduction gave good results, and the interesting part was that area reduction helped constrain the design better to get better results out of timing relaxation! Power was getting wasted not on timing but just sending signals across wasted space of nothingness.
But yeah slight miscommunication on the management statement. I wasn't saying duplicate rtl saves data management but if people actually touched rtl and had to validate it, duplicate rtl saves a ton of front end effort.
