• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Asynchronous RAM

velis

Senior member
I was thinking about all the latency trouble chips have with RAM the other day and came up with this crazy idea:

Currently the processor determines it needs some data. It checks it's caches, potentially checks other processors for latest copies and finally goes to DRAM to get what it needs. This of course is quite time consuming even with all the smart prefetch algorithms and similar stuff implemented on the processors.

Would it actually be useful to have this implementation:
Fully buffered RAM. It takes requests for data, puts them in the appropriate buffer and transmits "when the time is right".
The processor in question would have it's prefetch algorithms optimized in such a way that it would extract all the required RAM addressed from instruction cache at the moment the cache would be getting filled. That way all the RAM addressed could be sorted and checked against caches ahead of the code actually being executed. I suppose the functionality partly already exists in today's prefetch implementations.
Anyway, the addresses that would not already be in caches, would be sent to the RAM modules (along with some priority data) for retrieval which would in turn eventually fill the caches with all the potentially relevant data.
Since the L1 I-caches are rather small (a few microseconds worth of instructions), perhaps a processor could have a separate interface through which OS would send it the code that will be executed in the next time slice. Analyzing this code would achieve the desired effect plus it could in advance break the instructions into micro ops and prepare the whole lot for insertion into i-cache at the instant the code would be scheduled by the OS.

The goal of this implementation is to have all the data in caches by the time the code actually requires this data. Hopefully the implementation itself would work so great that a thread would request the data before being swapped by the OS scheduler - finding all the data in the cache when it would be scheduled for execution again.

Of course, the processor has to have large enough caches not to dump some of prefetched data in favor of currently running threads thus defeating the purpose of such prefetch algorithms. It also kinda makes sense to have shared L2 / L3 caches or at least some super fast method of switching data among caches.

I hope I was clear enough in my rambling. Could this approach pay off in any scenario or are processors simply too fast nowadays and this method would have no tangible effect because the predecoded data set is too small / processors already do that?
 
ill admit i didnt really read your post fully, but why dont we make dual core, dedicated ram per core?? but if one core needs more it may use the other's
 
Originally posted by: velis
Currently the processor determines it needs some data. It checks it's caches, potentially checks other processors for latest copies and finally goes to DRAM to get what it needs. This of course is quite time consuming even with all the smart prefetch algorithms and similar stuff implemented on the processors.

This depends somewhat on whether the cache is read-through or not, but basically, yes. There's only so much you can do to 'prefetch' from RAM; the overhead of trying to guess what data will be used next is almost always not worth it, especially since you are probably kicking other data out of the L2 cache.

The processor in question would have it's prefetch algorithms optimized in such a way that it would extract all the required RAM addressed from instruction cache at the moment the cache would be getting filled.

Except that you don't know exactly which code is going to be executed next; it may not even be in your L1 instruction cache yet! You'd have to do something like branch prediction that tries to guess which data you will need. And at the kind of time scales we're talking about (tens of nanoseconds), you won't be able to guess far ahead enough accurately enough most of the time for it to work the way you want it to.

Not to mention that in some architectures, it's either impossible or slow (well, relative to instruction execution) to check if an address is already in your cache. In an architecture with read-through L2 cache, you generally *can't* do this; the only way to access the cache is to try to read the memory, and if it's already in cache the data will be returned from there.

Since the L1 I-caches are rather small (a few microseconds worth of instructions), perhaps a processor could have a separate interface through which OS would send it the code that will be executed in the next time slice.

Um... and how is the OS going to know which code will be run next? The thread you're going to swap in is probably elsewhere in RAM, so you would usually have to go to the RAM to find it!

Analyzing this code would achieve the desired effect plus it could in advance break the instructions into micro ops and prepare the whole lot for insertion into i-cache at the instant the code would be scheduled by the OS.

What's going to do this 'analysis' for you? The CPU? Why not just execute the code in the first place?

Hopefully the implementation itself would work so great that a thread would request the data before being swapped by the OS scheduler - finding all the data in the cache when it would be scheduled for execution again.

But then you're taking cache away from the stuff being run in the meantime -- and what if *that* thread needs to access the RAM while it is running?

I hope I was clear enough in my rambling. Could this approach pay off in any scenario or are processors simply too fast nowadays and this method would have no tangible effect because the predecoded data set is too small / processors already do that?

CPUs are already pipelined such that you can execute instructions out-of-order (when possible) while other operations wait on memory (or the ALU/FPU, or whatever).

I think the basic problem here is that the work you would have to do to look at the upcoming code from the L1 instruction cache and predict what data you will need next from RAM takes a significant amount of time relative to how long it would take to just execute the code. And if your next instruction is 'jump to an address not in the L1 cache', the CPU has no possible way to know what comes after that! There are also cases where you know you will do a read, but you can't know what the exact address is until you actually execute the instructions leading up to it (think about code that calculates an offset into an array and then reads some value from it).

I can't see this as being worth the absurd amount of complexity it would add given current processor speeds and cache sizes. Now, if you do something like putting 1MB of instruction cache on the die rather than 64K... then maybe you can prefetch instructions far enough ahead that you could have time to then prefetch referenced data along with it and sometimes make it worthwhile. But my gut feeling is that in most code, you'll hit too many scenarios where you don't know what the exact address is for it to ever break even, and it would be more effective to just have more L2/L3 cache (thus improving your cache hit rates).
 
While this could be debated further I see that implementation would be very hard at best bringing negligible improvements. After all, todays caches have >90% hit rate. That hurts the processor, but I guess this method would not be the solution for the problem. Best to leave it die 😉
 
you've basically described what modern caches do. although the OS can prefetch parts of code by fetching linear addresses that sit on cachelines (brings in a cacheline at a time), this is usually best left to the hardware unless you are guaranteed to be needing that code and jumping to it later on. that's the only instance i've ever used this method.

ill admit i didnt really read your post fully, but why dont we make dual core, dedicated ram per core?? but if one core needs more it may use the other's

this is really dependent on the memory needs of the architecture. if you're only using 50% of your memory throughput with 2 cores behind one memory controller, chances are you're not going to see much benefit with 2 memory controllers, 1 per core.

perhaps a processor could have a separate interface through which OS would send it the code that will be executed in the next time slice. Analyzing this code would achieve the desired effect plus it could in advance break the instructions into micro ops and prepare the whole lot for insertion into i-cache at the instant the code would be scheduled by the OS.

the memory controller is the only real interface the cpu has to the code. you're describing a microcode sequencer and entire front-end decoder/fetcher that would dispatch uops to a microcode execution engine. you've just described a modern cpu.
 
This may be something similar.

http://www.theregister.co.uk/2005/11/04/sun_rock_scout/
In the Rock chips due out in 2008, Sun will employ a technique called a "hardware scout" to boost performance.

"We launch a hardware thread that has its own register file and that runs hundreds of cycles ahead of the main thread," Tremblay said. "It looks for land mines as a scout does and brings in all the interesting data."

The scout works while a main software thread is stalled and, via pre-fetching, helps bring data to the cache. Then when the main thread catches up much of the data it needs is already in memory.
 
Back
Top