External Level 3 Cache

Pudgygiant · Dec 27, 2003

Eh, SRAM is what I meant. DRAM would be slow.

Why would HT need more than 16 pins? I can understand 17 (for some sort of control pin) and 32 (as it's dual-pumped 16-bit) but any other number I don't get...

Lynx516 · Dec 27, 2003

OK here it goes why does a 16 bit HT bus use 74pins. First of All HT uses differential signaling so a high is the difference bettween say pin 1 and 2 and so for every data signal it needs 2 pins. The HT bus is actually a 16bit bi-directional bus. So it has one 16bit bus one way and another going the other. So you could think of it as a 32bit bus. There are 64 data pins, 4 clock pins and 4 control pins making a total of 74 pins for a 16 bit bus.

MadRat · Dec 27, 2003

So each 16-bit pathway is unidirectional? Interesting.

It would be interesting to see a no-holds-barred videocard using twin memory controllers, each using a octet bank of 400MHz DDR, tied to the GPU using a quad set of 16-bit HT links. That would be like a 50 GB/sec memory bandwidth if I'm figuring correctly, enough probably to run 2048x1536 at speeds relative to the best cards today using half the resolution...

lexxmac · Jan 1, 2004

HyperTransport is packet based, you wouldn't want to use it for memory. Also, a quad set of 16 bit HT links is going to have an enourmous pin count, and at 16-bit x4 @ 800 MHz, that's only 12.8 GB p/s each way. High speed DDR and DDR II would be much better suited for the job.

Lynx516 · Jan 1, 2004

I hope you mean 25.6GBps EACH WAY. You can read and write simultainiously with the HT bus so max throughput could reach 51.2GBps. Which is respectable.

Cashmoney995 · Jan 3, 2004

There was that IBM Power 8 thingy with 144 megs of cache or whatever. That was a multicore'd processor though. P4 Extremely Idiotic Edition has L3 cache which is 2 megs...

I think the real change in performance is in the L1 cache. Honestly the P4 line is slow because of its 16k L1 cache, hence the reason why the Athlon with its 128 L1 cache kicks its butt.

MadRat · Jan 4, 2004

The L1 cache of the Prescott may be 16k, but not the predecessors. The Northwood and Williamette were like 12k.

The P4's HyperThreading and long pipeline are its main hinderances, not the cache design. The P4's L1 cache is meant to get the easy-to-predict tasks and the L2 cache handles the majority of the L1 misses by using the logic of locality for its guessing. The L2 cache of the P4 is equivalent in raw bandwidth to the L1 cache of the AMD processors; AMD's L2 cache is horrid in comparison. The P4 was also the first x86 processor to hardware prefetch, which was a several percentage boost for its performance.

Lynx516 · Jan 4, 2004

The P4's L1 cache is IMHO better than the Athlon/Hammer lines. It is a trace cache which allows any instrucitons found in the cache to skip the entire decode portion of the pipeline hence saving quite a few clock cycles

lexxmac · Jan 4, 2004

What is this? Bash AMD day? By the way, the hammer core cpus kick the P4's ass and then some. I call for revolution! Down with the front side bus! All hail HyperTransport! uh, sorry, I'm getting worked up.

Lynx516 · Jan 4, 2004

I was not bashing the CPU just the L1 Cache. IMHO Trace cache is the way forward. The onboard controller is a massive stepforward as well. Hmm this thread seems to have gone miles off topic

lexxmac · Jan 4, 2004

Well, were still talking about caches aren't we? Anyways, I been thinking about something. It has been proven that the Athlon 64 3000+ (512 KB cache) and the 3200+ (1 MB cache) are almost identical in performance. Does it not make sense that since the dram controller is on the CPU die, that the cache would be less of a help since access times to DRAM been improved so much? Would a CPU with an on die memory controller and 256 KB of L2 cache perform almost as well? Or have I lost my marbles?

Peter · Jan 4, 2004

If your workload happens to be between 512 and 1024K, there will be a very noticeable difference. With larger workloads, the effect of the larger cache will diminish with size, and with smaller task sets, it doesn't matter at all.

MadRat · Jan 14, 2004

I still don't understand how an L3 cache at 3GHz isn't going to outperform 200MHz system memory. Just seems that people want to throw an fsb roadblock in the way when its unnecessary to think in those terms. The trace length to main memory is what basically limits current memory to the sub-250MHz front-side bus, right? HT is able to traverse the same distance at 800MHz, though, which means the 250MHz limitation is more or less an interface-specific limitation.

Like someone said before, HT probably isn't the way to go. A 16-bit HT link provides 6.4GB/sec of theoretical bandwidth. A 32-bit HT link provides 12.8GB/sec of theoretical bandwidth. The problem is that the latency in translating the cache addresses and memory packets across the serial bus is prohibitive. On the same token, we see $80 video graphics cards with 256-bit links in the same theoretical bandwidth ranges, but in their case using memory not caches. So perhaps we don't want an L3, but rather NUMA - non-uniform memory architecture(?) - on board the CPU sockets.

So instead of an L3 cache, too bad they cannot make interchangeable CPU sockets with integrated stage-1 NUMA memory. The stage-2 NUMA memory (system memory) would still be there running much slower. We already see memory speeds in the 500MHz+ ranges in the top end cards, so mating the memory to a 256-bit controller should be a cinch. In previous threads, someone already said the OS needs to be NUMA-aware, too, else the raw horsepower goes largely wasted. I can just see it now, in a few years people will be slaving a mediocre 256MB of 600MHz DDR RAM to their stage-1 memory add-in card and another 2-3GB of plain ole PC2700 memory into their NUMA-aware Windows XP 7.x system... because you know by then they just might need it. And I might say, with this architecture the average customer wouldn't probably ever mind performance from integrated graphics anymore.

CPU (internal)
...L1 cache <----Fastest
...L2 cache <----Current 1MB L2 caches would still suffice for most consumer apps

Socket (external)
...Stage-1 memory (integrated into a CPU socket) <----Also could be used to run integrated graphics
...Stage-2 memory (expansion slots on motherboard) <----Slowest

lexxmac · Jan 14, 2004

I agree totally. Couldn't have said it better myself.

Varun · Jan 14, 2004

It is necessary to think of a FSB roadblock, even if it's not only 200MHz. If you want a 256bit memory controller, you need a lot of pins for that, and maybe you haven't seen an Opteron yet, but there isn't a lot of room for more pins.

In theory, yes, a high speed L3 cache would work, but really do you even need it? DRAM is running pretty fast, and it's cheap. To make cache that runs at 3GHz would be very very expensive for everyone involved, and the costs make it very prohibitive compared to the performance you would gain. With a decent size L2 and a good prefetch etc for it, I don't think a large, expensive, difficult to manufacture, high speed L3 cache that is very small in relation to the DRAM, would make a serious performance improvement to justify the expense.

Search

External Level 3 Cache

Pudgygiant

Senior member

Lynx516

Senior member

MadRat

Lifer

lexxmac

Member

Lynx516

Senior member

Cashmoney995

Senior member

MadRat

Lifer

Lynx516

Senior member

lexxmac

Member

Lynx516

Senior member

lexxmac

Member

Peter

Elite Member

MadRat

Lifer

lexxmac

Member

Varun

Golden Member

TRENDING THREADS