A few questions on CPU's and their funtions

SViscusi · Apr 13, 2011

I have a project for my O/S class on cpu's and there are a few things that confuse me.

1. During fetching am I correct to assume that the data is taken from the source i.e. hd, then it moves up the mempory chain ram- L3 chache to L2 to L1 to registers?

2. L1 instruction cache. The cpu loads the to be used instruction into the registers, correct? Does it happen in parralel with the fetching of data?

3. Branch prediction. AFAIK branch prediction is all about trying to grab the next piece of data(instruction?) to be used. Where is the data stored and on a related subject where does data reside after it exits one stage of the pipeling before (i.e. while it waits) to enter the next?

4. Are cache and registers emptied the same way system ram is emptied?

LiuKangBakinPie · Apr 14, 2011

Here's a question. How does the cpu know what data to load? How does it know what is data that's going to be used and if its going to be used again?

Born2bwire · Apr 14, 2011

Oh boy...

1. I can't remember how the multilevel cache is handled off-hand. When the data is requested and you have to go all the way to disk to get it I can imagine a few ways of how they could load it in a multilevel memory system. However, I cannot recall what is generally done in practice though. Obviously the simplest thing to do is to load a large block from disk into RAM then take a smaller portion of that block from the RAM and put into the cache and so forth.

2. Yes and no. It depends on the chip. The most basic chip without pipelining is going to do the IF first and then process the instruction which may or may not then load data. But in a pipelined CPU the IF and other fetches can be done in parallel. The instruction is loaded into a unique register from the data registers. In a pipelined path, the sequence of steps that are done by the CPU to process an instruction is divided into stages that can be executed independently (with the exception of memory operations). This still depends on how the CPU is structured. The IF may share the same memory space as the data and so there may be a race conditions and you cannot do it simultaneously.

3. Branch prediction is not about the data but about the instructions. Branch prediction tries to predict the next instruction when the code has the possibility of going to different instructions (like in a for loop, if statement, case statement, while, etc.). So the predictor takes a guess as to the next instruction and the pipeline starts to process the guess. At the same time, the actual instruction for the branch (the if statement let's say) is being processed in the pipeline and when it finishes the result is compared against the prediction. If they are the same, nothing happens, but if they are different then the pipeline is flushed and the correct instruction is loaded in.

4. What do you mean by emptied? The contents of the memory and registers are never deleted. The address space is just kind of noted as being used or not used at best and that has more to do with the code. Been a while though and I don't have my resources with me anymore.

LKBP:
The instructions given to the CPU specifically state where to load and get the data from. Depending on the instruction set, you can have a single instruction generate multiple CPU instructions. For example, let's say I have an instruction to ADD(R1,R2)->R3. This explicitly tells the CPU to add the registers R1 and R2 and store it into R3. But I could come up with an instruction set that allows ADD(R1,MEM(XXXXX))->R3. Where we add R1 and the data at some address into R3. As far as I can recall this isn't generally used. Generally we first have to tell the CPU to load the data at XXXX address into R2 and then do the ADD between the registers. If I were to allow the register and memory addition then it would generate CPU instructions of the same nature.

So the first step is to load the data into a register. This is done by first going through the memory structure (cache to RAM to disk) looking for the memory and address location. For example, if we start off fresh we find that the cache and RAM will probably not have any of our data and so we eventually go to the disk and load the data at the requested memory address into RAM and cache and then into our register. But we note that the memory management does not just grab the data at that address. It grabs a block of memory. So instead of grabbing 2 bytes it may grab 8 bytes of data. This data is the data located in the next 6 bytes after the 2 bytes that we want. This block is then written into RAM and the caches and then we load the part we want from the block into the register.

The block is used because usually we use data that is spatially close and often within a small timeframe. That is, if I am iterating through an array, I generally will grab data successively in memory and request this data again and again within small time increments. So by loading in a block of memory, we have a good chance of loading in data that we will want in the near future. So the next time we want to load from memory and check the cache we may find that the data is already in the cache and we can save a lot of time. We don't know this until we execute the instruction but these things follow various trends that we can take advantage of.

LiuKangBakinPie · Apr 14, 2011

Born2bwire said:
Oh boy...

1. I can't remember how the multilevel cache is handled off-hand. When the data is requested and you have to go all the way to disk to get it I can imagine a few ways of how they could load it in a multilevel memory system. However, I cannot recall what is generally done in practice though. Obviously the simplest thing to do is to load a large block from disk into RAM then take a smaller portion of that block from the RAM and put into the cache and so forth.

2. Yes and no. It depends on the chip. The most basic chip without pipelining is going to do the IF first and then process the instruction which may or may not then load data. But in a pipelined CPU the IF and other fetches can be done in parallel. The instruction is loaded into a unique register from the data registers. In a pipelined path, the sequence of steps that are done by the CPU to process an instruction is divided into stages that can be executed independently (with the exception of memory operations). This still depends on how the CPU is structured. The IF may share the same memory space as the data and so there may be a race conditions and you cannot do it simultaneously.

3. Branch prediction is not about the data but about the instructions. Branch prediction tries to predict the next instruction when the code has the possibility of going to different instructions (like in a for loop, if statement, case statement, while, etc.). So the predictor takes a guess as to the next instruction and the pipeline starts to process the guess. At the same time, the actual instruction for the branch (the if statement let's say) is being processed in the pipeline and when it finishes the result is compared against the prediction. If they are the same, nothing happens, but if they are different then the pipeline is flushed and the correct instruction is loaded in.

4. What do you mean by emptied? The contents of the memory and registers are never deleted. The address space is just kind of noted as being used or not used at best and that has more to do with the code. Been a while though and I don't have my resources with me anymore.

LKBP:
The instructions given to the CPU specifically state where to load and get the data from. Depending on the instruction set, you can have a single instruction generate multiple CPU instructions. For example, let's say I have an instruction to ADD(R1,R2)->R3. This explicitly tells the CPU to add the registers R1 and R2 and store it into R3. But I could come up with an instruction set that allows ADD(R1,MEM(XXXXX))->R3. Where we add R1 and the data at some address into R3. As far as I can recall this isn't generally used. Generally we first have to tell the CPU to load the data at XXXX address into R2 and then do the ADD between the registers. If I were to allow the register and memory addition then it would generate CPU instructions of the same nature.

So the first step is to load the data into a register. This is done by first going through the memory structure (cache to RAM to disk) looking for the memory and address location. For example, if we start off fresh we find that the cache and RAM will probably not have any of our data and so we eventually go to the disk and load the data at the requested memory address into RAM and cache and then into our register. But we note that the memory management does not just grab the data at that address. It grabs a block of memory. So instead of grabbing 2 bytes it may grab 8 bytes of data. This data is the data located in the next 6 bytes after the 2 bytes that we want. This block is then written into RAM and the caches and then we load the part we want from the block into the register.

The block is used because usually we use data that is spatially close and often within a small timeframe. That is, if I am iterating through an array, I generally will grab data successively in memory and request this data again and again within small time increments. So by loading in a block of memory, we have a good chance of loading in data that we will want in the near future. So the next time we want to load from memory and check the cache we may find that the data is already in the cache and we can save a lot of time. We don't know this until we execute the instruction but these things follow various trends that we can take advantage of.

Yes but won't it be different in how the cpu make use of the caches and the techniques they deploy from one API to another? Some use cache blocking techniques some use scattering etc etc

SViscusi · Apr 14, 2011

Born2bwire said:
Oh boy...

1. I can't remember how the multilevel cache is handled off-hand. When the data is requested and you have to go all the way to disk to get it I can imagine a few ways of how they could load it in a multilevel memory system. However, I cannot recall what is generally done in practice though. Obviously the simplest thing to do is to load a large block from disk into RAM then take a smaller portion of that block from the RAM and put into the cache and so forth.

2. Yes and no. It depends on the chip. The most basic chip without pipelining is going to do the IF first and then process the instruction which may or may not then load data. But in a pipelined CPU the IF and other fetches can be done in parallel. The instruction is loaded into a unique register from the data registers. In a pipelined path, the sequence of steps that are done by the CPU to process an instruction is divided into stages that can be executed independently (with the exception of memory operations). This still depends on how the CPU is structured. The IF may share the same memory space as the data and so there may be a race conditions and you cannot do it simultaneously.

3. Branch prediction is not about the data but about the instructions. Branch prediction tries to predict the next instruction when the code has the possibility of going to different instructions (like in a for loop, if statement, case statement, while, etc.). So the predictor takes a guess as to the next instruction and the pipeline starts to process the guess. At the same time, the actual instruction for the branch (the if statement let's say) is being processed in the pipeline and when it finishes the result is compared against the prediction. If they are the same, nothing happens, but if they are different then the pipeline is flushed and the correct instruction is loaded in.

Thanks, this clears a few things up for me.

4. What do you mean by emptied? The contents of the memory and registers are never deleted. The address space is just kind of noted as being used or not used at best and that has more to do with the code. Been a while though and I don't have my resources with me anymore.

I guess what I meant was when is an area reused.

Thanks for your help.

LiuKangBakinPie · Apr 14, 2011

SViscusi said:
Thanks, this clears a few things up for me.

I guess what I meant was when is an area reused.

Thanks for your help.

it differ from API to API how it is used. Caches are very good to hide ram latency. So you'll have to look at what API you are using first to see how it uses the cache

Born2bwire · Apr 15, 2011

LiuKangBakinPie said:
it differ from API to API how it is used. Caches are very good to hide ram latency. So you'll have to look at what API you are using first to see how it uses the cache

API has nothing to do with how the CPU runs; an API is a programming interface. An interface is a means by which one program can interact with another program or set of code. For example, Java has a serial communications API which is an interface for you to use Java to send commands or bitstreams via the serial port. The API can be a simple wrapper or it can be something more complicated but it does not imply that it affects memory management. The memory management on the CPU is generally all hardware. When you compile your program, the compiler will recreate the program's code using the CPU's instruction set thus leaving how the cache is handled to the CPU. Software can do some high level memory management in the guise of things like virtual memory, page cache, disk cache. But all these sit on levels above the CPU.

There are various schemes on how to handle the cache on the CPU. For example, if you have filled up the cache and you have a miss and need to load in a new block, which one do you replace? The simplest is to replace the oldest block but you could also do things like the least recently used block. These algorithms are generally hard coded into the memory management on chip. I'm sure people can find exceptions where one can use software to choose the algorithm but by far what is more typical for a programmer to do is to write cache aware programs. That is, you should structure your program in such a way that it will work more optimally with a cache. A simple example is if you have a very large contiguous block of memory that represents a two dimensional matrix, then you would want to traverse the array successively in memory. This could mean that you have to traverse the array along the columns as opposed to the rows in your program. This way, when you move to the next data block you can make use of the spatial locality of the data. Even with today's compilers this can make a decent impact. Or you can try to use cache oblivious algorithms which are structured to perform independently of the cache size. For example with our aforementioned matrix, we could split up the matrix into a set of submatrices which we operate upon independently. We can divide up the matrix so that an entire submatrix (or two) will reside on the cache. In this way we can operate on an entire block without having to worry about a miss.

metafor · Apr 15, 2011

SViscusi said:
I have a project for my O/S class on cpu's and there are a few things that confuse me.

1. During fetching am I correct to assume that the data is taken from the source i.e. hd, then it moves up the mempory chain ram- L3 chache to L2 to L1 to registers?

From the program point of view, memory is separate from storage. So the application would have to specifically allocate memory (malloc), open an IO port to the storage peripheral device and copy the data to memory. Only then can the CPU access it. But yes, it then moves up the chain.

2. L1 instruction cache. The cpu loads the to be used instruction into the registers, correct? Does it happen in parralel with the fetching of data?

Usually yes. But it depends on what you mean by data. If you mean the data to be worked on, then on most architectures with separate data and instruction caches, these happen in parallel, yes.

3. Branch prediction. AFAIK branch prediction is all about trying to grab the next piece of data(instruction?) to be used. Where is the data stored and on a related subject where does data reside after it exits one stage of the pipeling before (i.e. while it waits) to enter the next?

Branch prediction occurs on branches. That is, instead of going to the next instruction, you jump to some other instruction. When running a program, the default assumption by the processor is to increment the program counter (address of the instruction) by 1 instruction length (4 words on most RISC architectures). When a branch occurs, the address, as well as whether to go to that address or not, is predicted. For example, say the program has:

0x00000004 - Load instruction
0x00000008 - Add instruction
0x0000000C - Branch to 0x00000014 if result was 0
0x00000010 - Add instruction
0x00000014 - Store instruction

When the CPU is executing 0x0000000C, it will guess whether to fetch 0x00000010 (next instruction) or 0x00000014 (instruction to jump to if the result of the add was 0).

Pipelines are separated by pipeline registers. At the end of a cycle, pipeline registers are clocked to capture the result of that stage. The results then drive the inputs to the next stage.

4. Are cache and registers emptied the same way system ram is emptied?

I'm not sure what you mean by this.

LiuKangBakinPie · Apr 16, 2011

Did I say api? Meant kernels. This is great info. The cpu provides the caches and hardware prefetchers to help you manage the data so the caches are transparent to the programmer. What I meant with the kernels is that most are compute bound and have sets which can be tuned to fit inside the cache and there is some kernels that can't without the loss of a performance. Radix is a example of that. Is this correct?

Born2bwire · Apr 21, 2011

LiuKangBakinPie said:
Did I say api? Meant kernels. This is great info. The cpu provides the caches and hardware prefetchers to help you manage the data so the caches are transparent to the programmer. What I meant with the kernels is that most are compute bound and have sets which can be tuned to fit inside the cache and there is some kernels that can't without the loss of a performance. Radix is a example of that. Is this correct?

Yeah, like I mentioned above, programmers can write codes that are cache aware. They are written so that they try to minimize the limitations of the cache. This is done by trying to reduce the size of the program, the size of data blocks that are accessed in sequence or in short time periods and so forth. Although I'm not sure if Radix is an example of this. My recollection is that radix was a data structure; a tree. Most of the time when we talk about trees we talk about it in the case of say linked memory. That would be difficult to optimize the cache since you could easily jump around in memory space from one instruction to the next. Though I'm sure one could find ways to address this I just don't recall anything offhand when it comes to trees and caches.

SViscusi · Apr 21, 2011

I just want to say thank you to all who helped me, it cleared up a lot of confusion.

A few questions on CPU's and their funtions

SViscusi

Golden Member

LiuKangBakinPie

Diamond Member

Born2bwire

Diamond Member

LiuKangBakinPie

Diamond Member

SViscusi

Golden Member

LiuKangBakinPie

Diamond Member

Born2bwire

Diamond Member

metafor

Junior Member

LiuKangBakinPie

Diamond Member

Born2bwire

Diamond Member

SViscusi

Golden Member

TRENDING THREADS