nvidia tegra K1

bullzz · Jan 6, 2014

http://www.theverge.com/2014/1/5/5278206/nvidia-debuts-tegra-k1-192-core-processor

192 CUDA cores.. interesting

Quick note: if you want to talk about the CPU specifically, that's fine. Otherwise if this is a general SoC discussion this gets thunked to Mobile Devices
-ViRGE

BallaTheFeared · Jan 6, 2014

What's more interesting is the project Denver dual core, especially since as noted this is the cpu forum :|

http://www.anandtech.com/show/7621/nvidia-reveals-first-details-about-project-denver-cpu-core

192 kepler cores (vs pre g80 in T4)

monstercameron · Jan 6, 2014

opinions on 5W?

Spawne32 · Jan 6, 2014

im more interesting about the automotive applications and what advantages that is supposed to have.

sontin · Jan 6, 2014

I'm interested in seeing what A15r3 delivers. With 2,3GHz and better IPC it should run cycles around Jaguar.

BTW: Tegra K1 supports up to 8GB system ram with a 40-bit address.

jdubs03 · Jan 6, 2014

monstercameron said:
opinions on 5W?

That interests me too, is there some separation in energy consumption for the quad A15's and dual Denver's? From what I got from it there isn't but we'll have to wait to find out.

I can't wait to see the CPU performance for Denver. It should be interesting when compared against an upcoming Cherry Trail/Broadwell-Y and A8s.

I hope it'll be 20nm, but there was no mention of the process. The timeframe release does lead me to lean more toward 20nm which the A8 will likely be, so they'll need that.

Hopefully Qualcomm releases some tidbits on a new Krait based off A57 tomorrow, and I'm sure Samsung will have Exynos stuff today.

sontin · Jan 6, 2014

Tegra K1v1 (what a name^^) is on 28nm.

Tegra K1v2 with Denver cores could be on 20nm. The timeframe is reachable.

The A15 of Tegra 4 uses with 1,9GHz around 3W or so. But that is a limitation of the process. All four cores running at 1400MHz use only 4,5-5W.

Selenium_Glow · Jan 6, 2014

One question if someone is willing to fill me in on...

Does the CPU and GPU in Tegra have shared memory, or do they require separate memory spaces? For example, the onboard graphics on most of the old motherboards allocated some of the RAM as graphics memory, but that allocated memory was not available to the CPU in that case. Is something similar happening with the K1 here?

And if it is, I'm kinda interested to know how much memory does the CPU in K1 would actually required? I for once understand/assume that the GPU benefits from more addressable memory, so the more the better.

sontin · Jan 6, 2014

It's still separate memory for Tegra K1.

nVidia will introduce their virtual memory concept with Parker, using Denver CPU und Maxwell GPU cores.

Exophase · Jan 6, 2014

Very surprised to hear Denver is hitting an SoC before Parker. This is a pretty aggressive move for nVidia, at least suggesting a six month cadence with genuinely different SoCs that are both targeting the high end. With the first one kind of laughably following not far off the heels of Tegra 4i.

These "K1v2" figures do seem... weird. 7-way superscalar? I can't think of a single even remotely general purpose CPU that's so wide at the decode level, not even IBM's POWER processors. It might be technically feasible, especially if they can only handle that throughput in AArch64 mode. But the cost of being able to actually rename the operands of 7 instructions is high, finding enough parallelism to actually even come close to using that decode bandwidth even a small percentage of the time is slim, and they'd need a much wider than 7 ports backend to facilitate all that execution which means a lot in terms of register file ports, forwarding network complexity, and so on. I also don't think they'd get terribly far without quite a bit of L1 cache parallelism which isn't cheap. I could possibly see them sort of reaching this width if it involves SMT, but even then it seems pretty overboard. Maybe not if it's not full SMT and there are limits to what a single thread can utilize.

What I'm suspicious of is that they're counting the width of A15 and Denver at different parts of the pipeline. It makes sense to have 7 execution ports/pipelines. In that case, A15 has 8. I've seen some (scarce, unfortunately) mention that A57 is consolidating the number of ports, which is almost certainly a power consumption optimization. So 7 seems like more than enough even for a pretty high-end aggressive design - afterall, Ivy Bridge only had 6 (while Haswell extended it to 8).

I know Cyclone was purported to be capable of decoding 6 instructions per cycle (and sustaining 6 IPC execution) but until I see the exact methodology of this test I'm skeptical of it as well.

One other consideration is that some of that number may be accounted for by instruction fusion. This could include x86-style branch fusion but possibly other classes of instructions as well, although none immediately spring to mind.

The 128KB L1, presumably instruction cache figure is also far out there. The only place I recall seeing such a large L1 instruction cache was Itanium where the VLIW-ish nature of the instructions led to some relatively low density. A possible consideration here is that some of the frontend, including the L1 icache, is shared between the two cores, Bulldozer style. Would be interesting, to say the least, although even Steamroller still doesn't hit such a big shared L1 icache. I hope they're not actually storing decoded instructions in some wider format, that seems like it'd be pretty wasteful even for a strategy to support AArch32 + AArch64..

With such big caches and such a wide execution at such a (relatively) high clock speed we could be looking at some long pipeline lengths and long L1 latencies, coupled with some really deep OoOE buffering to try to keep up with it. We could be looking at some relatively gigantic cores, which is more or less what you'd expect with nVidia only offering two of them. Unlike Apple, they have the most to stand behind since they offered the first mobile quad core SoC with Tegra 3 and defended it pretty aggressively. I don't think they'd be going dual core here unless the cost difference was huge; I think they'd go quad core even if it meant they could only run all four at a greatly reduced frequency (which Tegra 4 is basically doing that anyway).

Also, no mention of a companion core for Denver. And I don't think we'll see one. Pairing an A7 cluster with that would be very interesting, and would mean quite a particular design investment on nVidia's part, which I don't see it happening. But who knows, we didn't learn about Tegra 3's companion core until pretty late in the game.

So two things nVidia has to eat some crow on.. which I doubt we'll see much actual discourse on, but that'd be pretty fun...

Two final thoughts: I wonder if the Denver part is legitimately meant to replace the A15 part, or if the former is going to be targeting phones while the latter targets tablets or even beyond that. IF that's the case then it's possible nVidia will continue to license ARM cores for some parts, and this isn't just a time to market feasibility thing. Lastly, I noticed that nVidia had quite good documentation in its anti-Qualcomm propagada.. er.. technical white papers, where they went into a fair amount of details of how A15 operated. Now that they're using their own core I sure hope we get something even more thorough. Great take that to Qualcomm who says nary a thing about their tech. Fingers crossed.

Exophase · Jan 6, 2014

Selenium_Glow said:
One question if someone is willing to fill me in on...

Does the CPU and GPU in Tegra have shared memory, or do they require separate memory spaces? For example, the onboard graphics on most of the old motherboards allocated some of the RAM as graphics memory, but that allocated memory was not available to the CPU in that case. Is something similar happening with the K1 here?

And if it is, I'm kinda interested to know how much memory does the CPU in K1 would actually required? I for once understand/assume that the GPU benefits from more addressable memory, so the more the better.

I don't think any SoC with integrated GPU works with static memory partitioning like that.

SlimFan · Jan 6, 2014

http://www.theregister.co.uk/2009/11/04/nvidia_transmeta_x86/

The wide Instruction cache + "superscalar" width sound like they've followed through on this. It'll be really interesting to see the breakdown.

Excellent point about the "5th core" and quad core both going by the wayside.

NTMBK · Jan 6, 2014

SlimFan said:
The wide Instruction cache + "superscalar" width sound like they've followed through on this. It'll be really interesting to see the breakdown.

The original Pentium was "superscalar", along with every x86 chip since. NVidia just put the word on the marketing slide because it sounds badass, not because it actually means anything in terms of performance.

sontin · Jan 6, 2014

Exophase said:
We could be looking at some relatively gigantic cores, which is more or less what you'd expect with nVidia only offering two of them. Unlike Apple, they have the most to stand behind since they offered the first mobile quad core SoC with Tegra 3 and defended it pretty aggressively. I don't think they'd be going dual core here unless the cost difference was huge; I think they'd go quad core even if it meant they could only run all four at a greatly reduced frequency (which Tegra 4 is basically doing that anyway).
Also, no mention of a companion core for Denver. And I don't think we'll see one. Pairing an A7 cluster with that would be very interesting, and would mean quite a particular design investment on nVidia's part, which I don't see it happening. But who knows, we didn't learn about Tegra 3's companion core until pretty late in the game.

I think they use K1v2 as a test drive for Denver and Android 64bit and stay pin compartible to K1v1. So they have silicon to write drivers and updates and can make a little money without staying behind the competition.

So two things nVidia has to eat some crow on.. which I doubt we'll see much actual discourse on, but that'd be pretty fun...

We should wait for Parker.

Two final thoughts: I wonder if the Denver part is legitimately meant to replace the A15 part, or if the former is going to be targeting phones while the latter targets tablets or even beyond that. IF that's the case then it's possible nVidia will continue to license ARM cores for some parts, and this isn't just a time to market feasibility thing. Lastly, I noticed that nVidia had quite good documentation in its anti-Qualcomm propagada.. er.. technical white papers, where they went into a fair amount of details of how A15 operated. Now that they're using their own core I sure hope we get something even more thorough. Great take that to Qualcomm who says nary a thing about their tech. Fingers crossed.

For the fastest Tegra SoC they will switch to their own architecture. For the mobilephone version i think they will use a ARM design likely something in the mid-range.

Homeles · Jan 6, 2014

Exophase said:
V
The 128KB L1, presumably instruction cache figure is also far out there.

Right!? That's a massive L1. If we take their "7-way" and "128KB+64KB caches" at face value, one would presume that Nvidia's trying to take on Haswell at the high end. Which they certainly could do, but there's not a chance in hell of them competing in that market.

NTMBK · Jan 6, 2014

Homeles said:
Right!? That's a massive L1. If we take their "7-way" and "128KB+64KB caches" at face value, one would presume that Nvidia's trying to take on Haswell at the high end. Which they certainly could do, but there's not a chance in hell of them competing in that market.

Hey, the "tablets that need a fan" market is ripe for competition! :thumbsup:

Ajay · Jan 6, 2014

Quick note: if you want to talk about the CPU specifically, that's fine. Otherwise if this is a general SoC discussion this gets thunked to Mobile Devices
-ViRGE

Except that the technical folks tend to be in CPUs or Video Cards. Consumer Electronics doesn't lend itself well to the discussion of chip level features of the CPU or GPU. We've also had discussions of SoCs in this forum for some time now. SoCs are just an evolution of CPUs with higher levels of integration (and there is more to come), YMMV.

TuxDave · Jan 6, 2014

Exophase said:
These "K1v2" figures do seem... weird. 7-way superscalar? I can't think of a single even remotely general purpose CPU that's so wide at the decode level, not even IBM's POWER processors. It might be technically feasible, especially if they can only handle that throughput in AArch64 mode. But the cost of being able to actually rename the operands of 7 instructions is high

...snip...

Yeah I mostly agree that a 7-wide instruction decode is kind of out of this world. I highly suspect that the method of counting is not we I would call "standard". I don't know much about ARM's ISA or nVidia's CUDA ISA but if I had to use an Intel analogy.

Option #1: They take the post-decoded micro ops as the decode width. So for example the load-op is a single instruction but several micro ops. So I would count that as 1 but you can argue that it could count as 2 as you get two micro ops as a result. As you mentioned, getting 7 independent instructions to all get renamed at once is a little crazy.

Option #2: They simply mean the backend execution width and not the decode width. Same thing with the cyclone. 6-wide decode for a tablet is a little silly. I suspect that metric is a little weird as well.

tviceman · Jan 6, 2014

Confusion around whether the Denver-based TK1 will come out on 28nm or 20nm. If it's coming 9 months from now, it would be disappointing to see it on 28nm considering that TSMC is supposedly ramping up 20nm either this quarter or next. Regardless, it's nice to see Denver coming to fruition a little sooner than what was expected (all the expectations had Denver's release coinciding with FinFets and using Maxwell GPU IP). Qualcomm and mediatek can't be allowed to run the entire non-Apple ARM market so I hope TK1 is every bit as good as it's supposed to be. I also hope other CPU makers start licensing Kepler/Maxwell GPU. Getting performance-GPU proliferation to the entire market would be awesome.

Homeles · Jan 6, 2014

Anand and Brian Klug have a new article that should clear things up:
http://www.anandtech.com/show/7622/nvidia-tegra-k1

tviceman · Jan 6, 2014

Homeles said:
Anand and Brian Klug have a new article that should clear things up:
http://www.anandtech.com/show/7622/nvidia-tegra-k1

Anand is deducing that Denver TK1 is still on 28nm... I'm still holding out hope. Otherwise I'd like to see Nvidia just skip vanilla 20nm and go to TSMC's 20nm finfets for Tegra 6.

Homeles · Jan 6, 2014

You mean 16nm?

bullzz · Jan 6, 2014

i think intel and qcomm will be worried if nvidia executes on time. denver CPUs should put in the leagues of silvermont or better. and I dont think even cherrytrail GPU would be big enough to beat this. qcomms 805 looks like a re-spin of last years chip

NTMBK · Jan 6, 2014

bullzz said:
i think intel and qcomm will be worried if nvidia executes on time. denver CPUs should put in the leagues of silvermont or better. and I dont think even cherrytrail GPU would be big enough to beat this. qcomms 805 looks like a re-spin of last years chip

I think that Intel's 14nm Cherry Trail won't be worried by a 28nm part...

tviceman · Jan 6, 2014

Homeles said:
You mean 16nm?

Whatever they're calling it. Pretty sure it's technically still just 20nm but with finfet.

nvidia tegra K1

Senior member

Diamond Member

Diamond Member

Senior member

Diamond Member

Golden Member

Diamond Member

Member

Diamond Member

Diamond Member

Diamond Member

Member

Lifer

Diamond Member

Platinum Member

Lifer

Lifer

Lifer

Diamond Member

Platinum Member

Diamond Member

Platinum Member

Senior member

Lifer

Diamond Member