External Level 3 Cache

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Pudgygiant

Senior member
May 13, 2003
784
0
0
Eh, SRAM is what I meant. DRAM would be slow.

Why would HT need more than 16 pins? I can understand 17 (for some sort of control pin) and 32 (as it's dual-pumped 16-bit) but any other number I don't get...
 

Lynx516

Senior member
Apr 20, 2003
272
0
0
OK here it goes why does a 16 bit HT bus use 74pins. First of All HT uses differential signaling so a high is the difference bettween say pin 1 and 2 and so for every data signal it needs 2 pins. The HT bus is actually a 16bit bi-directional bus. So it has one 16bit bus one way and another going the other. So you could think of it as a 32bit bus. There are 64 data pins, 4 clock pins and 4 control pins making a total of 74 pins for a 16 bit bus.
 

MadRat

Lifer
Oct 14, 1999
11,999
307
126
So each 16-bit pathway is unidirectional? Interesting.

It would be interesting to see a no-holds-barred videocard using twin memory controllers, each using a octet bank of 400MHz DDR, tied to the GPU using a quad set of 16-bit HT links. That would be like a 50 GB/sec memory bandwidth if I'm figuring correctly, enough probably to run 2048x1536 at speeds relative to the best cards today using half the resolution...
 

lexxmac

Member
Nov 25, 2003
85
0
0
HyperTransport is packet based, you wouldn't want to use it for memory. Also, a quad set of 16 bit HT links is going to have an enourmous pin count, and at 16-bit x4 @ 800 MHz, that's only 12.8 GB p/s each way. High speed DDR and DDR II would be much better suited for the job.
 

Lynx516

Senior member
Apr 20, 2003
272
0
0
I hope you mean 25.6GBps EACH WAY. You can read and write simultainiously with the HT bus so max throughput could reach 51.2GBps. Which is respectable.
 

Cashmoney995

Senior member
Jul 12, 2002
695
0
0
There was that IBM Power 8 thingy with 144 megs of cache or whatever. That was a multicore'd processor though. P4 Extremely Idiotic Edition has L3 cache which is 2 megs...

I think the real change in performance is in the L1 cache. Honestly the P4 line is slow because of its 16k L1 cache, hence the reason why the Athlon with its 128 L1 cache kicks its butt.
 

MadRat

Lifer
Oct 14, 1999
11,999
307
126
The L1 cache of the Prescott may be 16k, but not the predecessors. The Northwood and Williamette were like 12k.

The P4's HyperThreading and long pipeline are its main hinderances, not the cache design. The P4's L1 cache is meant to get the easy-to-predict tasks and the L2 cache handles the majority of the L1 misses by using the logic of locality for its guessing. The L2 cache of the P4 is equivalent in raw bandwidth to the L1 cache of the AMD processors; AMD's L2 cache is horrid in comparison. The P4 was also the first x86 processor to hardware prefetch, which was a several percentage boost for its performance.
 

Lynx516

Senior member
Apr 20, 2003
272
0
0
The P4's L1 cache is IMHO better than the Athlon/Hammer lines. It is a trace cache which allows any instrucitons found in the cache to skip the entire decode portion of the pipeline hence saving quite a few clock cycles
 

lexxmac

Member
Nov 25, 2003
85
0
0
What is this? Bash AMD day? By the way, the hammer core cpus kick the P4's ass and then some. I call for revolution! Down with the front side bus! All hail HyperTransport! uh, sorry, I'm getting worked up.
 

Lynx516

Senior member
Apr 20, 2003
272
0
0
I was not bashing the CPU just the L1 Cache. IMHO Trace cache is the way forward. The onboard controller is a massive stepforward as well. Hmm this thread seems to have gone miles off topic
 

lexxmac

Member
Nov 25, 2003
85
0
0
Well, were still talking about caches aren't we? Anyways, I been thinking about something. It has been proven that the Athlon 64 3000+ (512 KB cache) and the 3200+ (1 MB cache) are almost identical in performance. Does it not make sense that since the dram controller is on the CPU die, that the cache would be less of a help since access times to DRAM been improved so much? Would a CPU with an on die memory controller and 256 KB of L2 cache perform almost as well? Or have I lost my marbles?
 

Peter

Elite Member
Oct 15, 1999
9,640
1
0
If your workload happens to be between 512 and 1024K, there will be a very noticeable difference. With larger workloads, the effect of the larger cache will diminish with size, and with smaller task sets, it doesn't matter at all.
 

MadRat

Lifer
Oct 14, 1999
11,999
307
126
I still don't understand how an L3 cache at 3GHz isn't going to outperform 200MHz system memory. Just seems that people want to throw an fsb roadblock in the way when its unnecessary to think in those terms. The trace length to main memory is what basically limits current memory to the sub-250MHz front-side bus, right? HT is able to traverse the same distance at 800MHz, though, which means the 250MHz limitation is more or less an interface-specific limitation.

Like someone said before, HT probably isn't the way to go. A 16-bit HT link provides 6.4GB/sec of theoretical bandwidth. A 32-bit HT link provides 12.8GB/sec of theoretical bandwidth. The problem is that the latency in translating the cache addresses and memory packets across the serial bus is prohibitive. On the same token, we see $80 video graphics cards with 256-bit links in the same theoretical bandwidth ranges, but in their case using memory not caches. So perhaps we don't want an L3, but rather NUMA - non-uniform memory architecture(?) - on board the CPU sockets.

So instead of an L3 cache, too bad they cannot make interchangeable CPU sockets with integrated stage-1 NUMA memory. The stage-2 NUMA memory (system memory) would still be there running much slower. We already see memory speeds in the 500MHz+ ranges in the top end cards, so mating the memory to a 256-bit controller should be a cinch. In previous threads, someone already said the OS needs to be NUMA-aware, too, else the raw horsepower goes largely wasted. I can just see it now, in a few years people will be slaving a mediocre 256MB of 600MHz DDR RAM to their stage-1 memory add-in card and another 2-3GB of plain ole PC2700 memory into their NUMA-aware Windows XP 7.x system... because you know by then they just might need it. And I might say, with this architecture the average customer wouldn't probably ever mind performance from integrated graphics anymore.

CPU (internal)
...L1 cache <----Fastest
...L2 cache <----Current 1MB L2 caches would still suffice for most consumer apps

Socket (external)
...Stage-1 memory (integrated into a CPU socket) <----Also could be used to run integrated graphics
...Stage-2 memory (expansion slots on motherboard) <----Slowest
 

Varun

Golden Member
Aug 18, 2002
1,161
0
0
It is necessary to think of a FSB roadblock, even if it's not only 200MHz. If you want a 256bit memory controller, you need a lot of pins for that, and maybe you haven't seen an Opteron yet, but there isn't a lot of room for more pins.

In theory, yes, a high speed L3 cache would work, but really do you even need it? DRAM is running pretty fast, and it's cheap. To make cache that runs at 3GHz would be very very expensive for everyone involved, and the costs make it very prohibitive compared to the performance you would gain. With a decent size L2 and a good prefetch etc for it, I don't think a large, expensive, difficult to manufacture, high speed L3 cache that is very small in relation to the DRAM, would make a serious performance improvement to justify the expense.