CPU

Vegitto · Nov 14, 2005

Well, how does a CPU work? I don't get it, and I can't find it.. Help?

mdchesne · Nov 14, 2005

Born2bwire · Nov 14, 2005

Originally posted by: Vegitto
Well, how does a CPU work? I don't get it, and I can't find it.. Help?

Here's a quick synopsis: http://www.pcworld.com/howto/article/0,aid,16682,pg,2,00.asp

The instructions can be thought of as being held in an instruction cache. The registers and your data cache are thought of as separate entities in a CPU block diagram. The two (instruction and data cache that is) can be the same cache and just share the memory or they can be separate (it's easier to think about if they're separate). The instruction pointer is quite simply a register that holds the address of your next instruction. So what happens is that the computer looks at the instruction pointer, retrieves the instruction from the memory location referenced from the instruction cache. Now if the instruction is not in the cache, then we need to check the higher memory. In your home PC this means going from L1 cache->L2 cache->RAM->Hard disk. What happens with the caches and memory is a whole other story but for now it's easy to think of it as a black box that will retrieve your instructions (and data). The instructions are coded in what is called a microinstruction. So there is a decoding unit that decodes the bitstream to generate all the needed control signals for the CPU. Once we have fetched our instruction and decoded it, we increment the pointer to the next instruction (the simplest is to assume that the instructions are linear in memory). We then proceed to the registers. The registers are not necessarily the data cache. They are just a small number of registers that hold data for fast use. The data cache is usually incorporated into the CPU's general memory. From the registers we retrieve any needed data for our instruction, like two numbers needed for an ADD. The output of the registers is sent to an ALU (Arithmetic Logic Unit) which then performs any necessary arithmetic or binary logic operation. For example, we could do an addition or subtraction or a bitwise AND. The output from the ALU is then sent to memory, or your data cache. If needed, we write the ALU output to memory or we can retrieve data from the memroy. Then from the memory stage it goes back to the registers, in case we also want to put the data back into the registers (data from the ALU or from the memory).

Memory is always going to be very slow, so we do not always access it on every instruction. Ideally, we only work with the registers and write the results into memory as needed. We can also update the instruction pointer if we are doing a branch (for loop, jump, if statement, etc.). In a simple CPU, the branch is determined by the ALU. For example, for a greater than, equal to, or less than, we do a subtraction of the two items in the ALU and compare them. After we do the ALU stage, we can go back and change the instruction pointer to the necessary memory location (the new memory location is encoded into the microinstruction or referenced in the microinstruction).

So we go from Instruction Fetch->Registers->ALU->Memory->Registers->Instruction Fetch. There is a vast amount of complexity added onto this in today's CPU's. Stuff like branch prediction, forwarding, pipeline, scoreboarding that are ways of increasing the speed and throughput of the processor. We also do not always use every single stage. Sometimes we do not need to retreive data from the registers or we do not always need to access memory. Other times we do not need to do an ALU operation, just a pass through. From this, you can already guess that a better way of doing things is to look at the instruction and allow us to pass over stages that are unnecessary. This is where pipelining and forwarding become relevant. A pipeline is like doing the laundry (oh god I'm using an analogy). You wash, dry, and fold your laundry. But any person that does laundry does not wash, then dry, and then fold and then start it all over again. They first wash, dry the first load and start washing the second load. Then they fold the first load, dry the second load and wash the third load. And so on. You operate in parallel. So since we have structured the CPU in stages, why not run the stages in parallel. That is, while we are waiting on the first instruction to finish with the register unit, we can retrieve the next instruction from the instruction cache. This is called pipelining. A simple way to do this is to add a bank of registers between each stage to hold the previous stage's output and act as the input to the next stage. This way we can hold the old output and operate the previous stage without losing the data. It also allows us to bypass unneeded stages. For example, memory is extremely time consuming. And we do not need to access the memory everytime. Instead of waiting for the memory stage to finish before advancing the pipeline, one thing to do is that we can take the results from the ALU and have it bypass memory and straight to the output (register writeback). Now, the only time we need to stop the pipeline (stall) is if we need to wait for the memory to finish so that the output from the ALU can also be used to interact with the memory stage.

All that I have explained is off the top of my head from what I remember so I'm sure there are corrections and elaborations in order.

Titan · Nov 14, 2005

I'm bored enough to reply to this very vague question. But don't trust me so much, google, wikipedia, and howstuffworks are handy. I'm a software guy who has a taste for hardware, so here's what I know.

1) Electricity. They require voltage and current, like any circuit. The act of electrons flowing through a conductor. In the modern case, the juice flows through dopants on silicon.

2) Transistors. According to my first C++ class prof who was an EE guy, he told us day one that a transistor is a circuit that is a switch that goes on an off. The interesting thing here is that Power=Voltage*Current, but when a voltage is switched on (to say, 1) the current is 0. And when the voltage is off, 0, current is 1. So, under textbook conditions, switching a transistor on and off consumes no power. In reality, there is resistance and other electrical factors that cause them to waste power and heat. But if it was all superconducting, it would drain no power.

Transistors form the basis for binary logic, 1 or 0, on or off. They are grouped together to form logic "gates" which represent logical connections of "and," "or," "not," and combinations like "nand." By linking such logical gates together you can do things like add 2 binary numbers.

3) Clock frequency. Everything in a computer runs on a high-speed clock. the CPU syncs to the clock, and other things take longer. The period of each cpu cycle is a "cycle" and some things take more than one cycle for the CPU to do.

4) Computer architecture. This is a whole senior-level class in college. The main thing a CPU does is interface with memory and process data it gets from memory. Nowadays, the ALU is built onto a chip, so math is done there too. The most essential piece is the registers. A register is where a number loaded from RAM goes. They are X bits wide. A Pentium 4 has 32-bit wide registers and an Opteron has 64-bit registers. This means that the biggest number hardware can represent in one register is that many bits, and becomes the hardware limitation. You can't count to infinity with 32 bits, you top out at around 4 billion. With 64, the limit is 4 billion squared. If you want to count higher, you have to get clever in software, by using more data, and ultimately, more registers.

Most modern CPUs have at least a dozen registers. An Itanium has 128 registers, IIRC. Some are used for very specific things. One may hold the address of the program in RAM it is running, and that register tells it where the next instruction is coming from. Your computer works because the software that is loaded into RAM (technically, I should say "Memory") is executed by the CPU. The cpu reads the instruction code, decodes it, and performs the task according to the code, like add two numbers. Depending on how the ISA (Instruction Set Architecture) is designed, this could take 3 instructions like, load number 1 into register 1, load number 2 into register 2, add them and put the result into register 1. Because you can add, you can multiply, and you can subtract if you have a way to represent negative numbers, and because you can multiply, you can divide. All a CPU is is a complicated adding machine that directs traffic into and out of a memory system.

4a) Pipelines. My favorite part of comp arch was learning about how pipelines work. It's just like an assembly line. Say you need to perform one task, like build one car, or execute one add instrcution. Well, both of these things take a set amount of time, adding 2 numbers may take 4 steps like: fetch the instruction, decode the instruction, execute the instruction, write out the final result. That would mean every add takes 4 cpu cycles. So if we break it up into 4 steps, and the CPU has lots of add instructions, we can do 4 adds in the time it takes to do one. If building a car takes 100 steps, we can build 100 cars in the time it takes to build 1 if we have enough resources and we need to build 100 cars.

But, computer programs cause a problem because they are non-determinisitc. The programmer can tell the cpu if register 1 = 0, branch to instruction X, otherwise, don't branch. This changes the flow of program execution. So branching can mean your pipeline has the wrong code qued up, and it has to waste cycles to get the right code. If you have a 4 stage pipe, you could waste 3 cycles because you didn't have the right code in the pipe.

Electrically and logically, it is easier to increase speed (mhz) by lengthening the pipeline. If you break your car assembly line from 100 steps into 1000, you can make them 10 times faster, and each step is simpler, so even faster. But with a pipeline that long, if you say change models of car on the line, you will have to stall and 999 steps will have to wait for their thing to come down the pipe. A pentium 4 northwood has 20 steps, and a prescott has 31. Intel did so to push the ghz as far as they could go. In some benchmarks they pay a penalty because their pipreline stalls are longer than a 12 or 13 stage pipeline which is more common for things like AMD chips. CPUs are made fast by predicting whether or not code will branch to avoid these kinds of stalls. This is called a Branch Prediction Engine.

Most modern CPUs have mutliple pipelines to increase paralellism. If you know you are going to have a lot of data, like in graphics, adding more pipelines to a GPU to break up the work increases overall performance. CPUs also have seperate pipelines for crunching floating point numbers, which can take multiple cycles to multiply/divide.

I just felt like spilling what I know. That's the basics. Intel guys feel free to hammer on me and point out where I went wrong or misled.

TuxDave · Nov 14, 2005

Originally posted by: Born2bwire
A pipeline is like doing the laundry (oh god I'm using an analogy). You wash, dry, and fold your laundry.

Haha... but it's the classical analogy!

icarus4586 · Nov 14, 2005

Hannibal at Ars Technica has written some very good articles about CPU architectures. I remember he did one about the G4 and the K7, one about the G4e and the P4, one about the Pentium line, one about Apple and the PowerPC line... lots and lots of stuff. Most of the stuff I've learned about how CPUs work has been from him. He's actually got a book coming out sometime soon on the topic too.
Anyhow, go to arstechnica.com and search for some of those articles, they're good.

EDIT: Anand and Johann have also written some good stuff on the topic. There's plenty of information if you look for it.

Bassyhead · Nov 14, 2005

http://en.wikipedia.org/wiki/Cpu

dmens · Nov 15, 2005

CPU's are (in general) von Neumann machines that execute instructions of a predetermined format. Any given ISA (instruction set architecture) has infinite variations in implementation based on required performance and power envelope. I'd be happy to try to answer specific questions since this is such a broad topic.

Vegitto · Nov 15, 2005

Well, okay, thanks

. But, what I still don't get is, how do they make it work and all? I just can't see why such a tiny piece of silicon could 'do' in seconds what'd take me years.

Also, just for the fun of it, could you build your own CPU? Just a tiny one? I heard from someone you could buy an IBM kit for $8000 and build a computer (a whole one) yourself from scratch. And I don't mean; put CPU in socket, add heatsink, remove heatsink, add thermal paste, add heatsink etc. etc., but I mean get a few wires and solder.

dmens · Nov 15, 2005

Uh... lots of money and man-hours? From concept to silicon takes as long as 7-8 years. The architecture isn't really "hard". Trying to design sane circuitry to implement architectural ideas is hard.

You can build a simple CPU and burn it in a FPGA... try the MIPS 2000. The tricky part is getting the thing to interface with the outside world....

Vegitto · Nov 15, 2005

Well, I'm trying to grasp the concept of this, is there any way I could (home)build some kind of logic circuit by myself?

Calin · Nov 15, 2005

Originally posted by: Vegitto
Well, okay, thanks . But, what I still don't get is, how do they make it work and all? I just can't see why such a tiny piece of silicon could 'do' in seconds what'd take me years.

Also, just for the fun of it, could you build your own CPU? Just a tiny one? I heard from someone you could buy an IBM kit for $8000 and build a computer (a whole one) yourself from scratch. And I don't mean; put CPU in socket, add heatsink, remove heatsink, add thermal paste, add heatsink etc. etc., but I mean get a few wires and solder.

The CPU is a piece of programmable logic.
Let's take it in steps:
Let's assume you have four 8-bit registers, an arithmetic unit (add/remove) and a shift unit.
Now, every one of the registers is connected to the two inputs and output of the arithmetic unit by some wires, and to the input and output of the shift unit by some other wires.
You have now 4 registers, that means two bits to discriminate between them.
So, if you want to add first and second register and put the result in the third register, you would have an instruction like ADD R1, R2, R3. Binary this would be 0000 00 01 10
The first four zeroes would be the instruction code (let's say add is the first instruction in an instruction set of 16 instructions). The next group is the first register, then the second, then the third. So, ADD R2, R1, R3 would be 0000 01 00 10.
How you build that in silicon? Well, the arithmetic unit knows to add the two inputs when a certain pin of it is set, and to remove them when a different pin is set. So, you use a demultiplexer(?) that takes as input the instruction code, and for 0000 activates the line for adder, for 0001 (let's say) activates the line for the remover. Then, you could have the "move" instruction - let's say 0010. Now, the move command will have just two registers, but to simplify things it will have the same width. The DEMUX-er will have the line 0010 to activate a different "path" that will connect the output of a register with the input of the "destination" register.
Note that this simple CPU would use best a 10-bit byte (in order to accomodate a full instruction). The x86 have variable length instructions (as the memory was extraordinary expensive in those days), but this adds alot of complexity.
You could use non-multiplexed instructions (when the instruction code has only a single bit of 1 when a single bit of 1 is needed). This will simplify even more the design, but will increase the instruction size (instead of the 4 bits in my example it will be 16 bits).

Now the "clock cycle" come into effect. The registers generate on output their value all the time, but they load a new value from their input lines only on clock cycle (I think on the rising edge). This way, you can use simple logic that change their outputs to wrong values during the operation (as long as the values are good when the cycle hits, it's ok). An example: you want to get an 1 on exit. You put it as a value OR its negated. When you move from 1 to 0, the OR will get at start a 1 and a 1 negated. Then, it will get an 0 (the signal changed) and a 1 negated (the change didn't moved thru the negation block yet), and then a 0 and a 0 negated. While at first and at end the exit is good, there is a time when the exit is wrong.
This happens all the time in the circuitry, so the clock is there to only load the values when they are stabilised at the good value.

Born2bwire · Nov 15, 2005

Originally posted by: Vegitto
Well, okay, thanks . But, what I still don't get is, how do they make it work and all? I just can't see why such a tiny piece of silicon could 'do' in seconds what'd take me years.

Also, just for the fun of it, could you build your own CPU? Just a tiny one? I heard from someone you could buy an IBM kit for $8000 and build a computer (a whole one) yourself from scratch. And I don't mean; put CPU in socket, add heatsink, remove heatsink, add thermal paste, add heatsink etc. etc., but I mean get a few wires and solder.

Do a search, there are many examples of home built computers from discrete IC's. What I mean is, built from individual transistors. Generally they are controlled via manual switches so these aren't Pentiums we're talking about, but they are very impressive in my book. For my computer engineering undergrad design course, we built a pipelined CPU in VHDL and the caches for the CPU. So we actually had the logic necessary for a CPU and memory. We could have (if we were insane) just gone from that directly to putting together a computer from discrete parts. If you want to understand how the very specifics of a computer works, then start reading up on digital logic. A computer is nothing more than a series of digital logic units, pretty much boolean logic (AND, OR, NOT), multiplexers, decoders, latches, and registers. From those building blocks you can construct a simple computer.

icarus4586 · Nov 15, 2005

The smallest functional unit of a CPU is a transistor. Grouping transistors together you can get various logic gates, for simple stuff like AND, OR, NOR, NOT, XOR, NAND, etc. Grouping these together can give you bigger units like flip flops, multiplexers, decoders... Further grouping those you can make stuff like adders and subtracters.
The easiest way to get this kind of knowledge is to take a digital systems course.

TuxDave · Nov 15, 2005

Originally posted by: Vegitto
Well, I'm trying to grasp the concept of this, is there any way I could (home)build some kind of logic circuit by myself?

Yes you can. Go get a breadboard, power supplies, IC logic gates, wires, switches, wire cutters and LEDs and you're good to go to build basic logic circuits. Most logic gates are packaged for you and contain maybe 3-4 gates of whatever variety you want. Now the next part is for you to go read any logic design textbook and build the 'elevator controller' which I think every EE eventually builds in their lives.

CTho9305 · Nov 15, 2005

The architecture isn't really "hard". Trying to design sane circuitry to implement architectural ideas is hard.

The architecture nowadays is most definitely hard.

Designing a CPU from parts is going to be hard - even once you've got the CPU built, you need to write code to run on it. Do you know any programming languages?

dmens · Nov 15, 2005

Yeah the architecture is hard haha. It only seems less complicated when some team spent the last 5 years hashing out the high level spec and all you have to do is read a bunch of documents.

Search

CPU

Vegitto

Diamond Member

mdchesne

Banned

Born2bwire

Diamond Member

Titan

Golden Member

TuxDave

Lifer

icarus4586

Senior member

Bassyhead

Diamond Member

dmens

Platinum Member

Vegitto

Diamond Member

dmens

Platinum Member

Vegitto

Diamond Member

Calin

Diamond Member

Born2bwire

Diamond Member

icarus4586

Senior member

TuxDave

Lifer

CTho9305

Elite Member

dmens

Platinum Member

TRENDING THREADS