How exactly does a CPU work?

BigToque

Lifer
Oct 10, 1999
11,700
0
76
Are there any websites that explain the process?

I understand a little bit about programming, but I have absolutely no understanding of software that interacts with hardware.

Eg. I type 1+1 into calculator, something gets sent to the CPU and the CPU returns the result of 2. How does that work?
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Since it's not clear what your background is, this is probably a decent starting point. If you want a detailed understanding, this book will get you there.
 

Cogman

Lifer
Sep 19, 2000
10,283
135
106
The links that CTho gave are fairly good for explaining it. Basically what happens when you push 1 + 1 = into the calculator you are initializing different registers, one you push 1 you have that in a temp register (in case you want to add a two onto it) then when you push + that sends the 1 to a register to wait to be processed. Somewhere the calculator also stores that a + sign was pushed to process it on the instance of the next operator. When the next operator is pushed, the calculator pushes the data through the specific gates for the operation intended. For addition it is 3 and gates and an XOr gate (the counting circuit). For multiplication, it is just a repeating the addition gate x number of times. For division, I believe they use a bit shift operation (I haven't done division yet).

But that is the basic principle. Push data into a register, then push the data from the registers to the specific instruction that you want to preform on it. x86 CPUS have hundreds of instructions, so it comes to the compiler to re-arrange your code according to the available instruction set.
 

TuxDave

Lifer
Oct 8, 2002
10,571
3
71
Originally posted by: Cogman
The links that CTho gave are fairly good for explaining it. Basically what happens when you push 1 + 1 = into the calculator you are initializing different registers, one you push 1 you have that in a temp register (in case you want to add a two onto it) then when you push + that sends the 1 to a register to wait to be processed. Somewhere the calculator also stores that a + sign was pushed to process it on the instance of the next operator. When the next operator is pushed, the calculator pushes the data through the specific gates for the operation intended. For addition it is 3 and gates and an XOr gate (the counting circuit). For multiplication, it is just a repeating the addition gate x number of times. For division, I believe they use a bit shift operation (I haven't done division yet).

But that is the basic principle. Push data into a register, then push the data from the registers to the specific instruction that you want to preform on it. x86 CPUS have hundreds of instructions, so it comes to the compiler to re-arrange your code according to the available instruction set.

When you're designing for performance, the multiplier isn't done with multiple additions. Can you imagine a million cycles to compute the product of something times a million? Instead we build multipliers that use booth encoders, followed by a wallace tree and actually only ONE stage of full addition.

Division by bitshifting is only for convenient divisions by numbers of power 2. Instead the usual method is an iterative process where each cycle you yield some number of bits of the quotient.
 

Born2bwire

Diamond Member
Oct 28, 2005
9,840
6
71
Originally posted by: TuxDave
Originally posted by: Cogman
The links that CTho gave are fairly good for explaining it. Basically what happens when you push 1 + 1 = into the calculator you are initializing different registers, one you push 1 you have that in a temp register (in case you want to add a two onto it) then when you push + that sends the 1 to a register to wait to be processed. Somewhere the calculator also stores that a + sign was pushed to process it on the instance of the next operator. When the next operator is pushed, the calculator pushes the data through the specific gates for the operation intended. For addition it is 3 and gates and an XOr gate (the counting circuit). For multiplication, it is just a repeating the addition gate x number of times. For division, I believe they use a bit shift operation (I haven't done division yet).

But that is the basic principle. Push data into a register, then push the data from the registers to the specific instruction that you want to preform on it. x86 CPUS have hundreds of instructions, so it comes to the compiler to re-arrange your code according to the available instruction set.

When you're designing for performance, the multiplier isn't done with multiple additions. Can you imagine a million cycles to compute the product of something times a million? Instead we build multipliers that use booth encoders, followed by a wallace tree and actually only ONE stage of full addition.

Division by bitshifting is only for convenient divisions by numbers of power 2. Instead the usual method is an iterative process where each cycle you yield some number of bits of the quotient.

Which is something to keep in mind when programming. If you're going to divide by a number a few times or more, it's better to store the inverse and multiply by the inverse instead of dividing (same with constants, multiply by 0.1 instead of dividing by 10). Another trick is to use nested equations, that will reduce the number of multiplications.

I wouldn't add anything more to what CTHo9305. Wiki is always a good source for preliminary material and I have a copy of "Computer Organization & Design" right next to me on my bookshelf. With a topic as wide and complex as a CPU it would be better if the OP could ask specific questions after reviewing the offered materials.
 

Cogman

Lifer
Sep 19, 2000
10,283
135
106
Does that really make a difference? I would think that there would be a significant penalty from using floats over ints (especially when you aren't converting from floats to ints). I guess Ill have to pull out the good 'ole profiler.

#include <cstdlib>
#include <iostream>

const double PI = 3.1425;
const double INV_PI = 1 / PI;

using namespace std;
void WaistOTimeInt()
{
for(int i = 0; i < 999999999; ++i)
i / 3;
}

void WaistOTimeFloat()
{
for(int i = 0; i < 999999999; ++i)
i * INV_PI;
}

void WaistOTimeConv()
{
for(int i = 0; i < 999999999; ++i)
i / PI;
}

int main(int argc, char *argv[])
{
WaistOTimeInt();
WaistOTimeFloat();
WaistOTimeConv();

system("PAUSE");
return EXIT_SUCCESS;
}

Here is my program, for all that are interested. Turns out, that and integer / float is faster then a integer * a float (usually it is about the same) which is faster then an integer / an integer (that surprised me). I used MingW GCC 3.4.2. I tried it with optimizations and without optimizations.

So, I guess you where somewhat right, an integer / integer is slow, so don't do it, but an integer / float is just as fast as an integer * float.
 

TuxDave

Lifer
Oct 8, 2002
10,571
3
71
Originally posted by: Cogman
Does that really make a difference? I would think that there would be a significant penalty from using floats over ints (especially when you aren't converting from floats to ints). I guess Ill have to pull out the good 'ole profiler.

#include <cstdlib>
#include <iostream>

using namespace std;
void WaistOTimeInt()
{
for(int i = 0; i < 99999999; ++i)
i / 10;
}

void WaistOTimeFloat()
{
for(int i = 0; i < 99999999; ++i)
i * 0.1;
}

void WaistOTimeConv()
{
for(int i = 0; i < 99999999; ++i)
i / 10.0;
}

int main(int argc, char *argv[])
{
WaistOTimeInt();
WaistOTimeFloat();
WaistOTimeConv();

system("PAUSE");
return EXIT_SUCCESS;
}

Here is my program, for all that are interested. Turns out, that and integer / float is faster then a integer * a float (usually it is about the same) which is faster then an integer / an integer (that surprised me). I used MingW GCC 3.4.2. I tried it with optimizations and without optimizations.

So, I guess you where somewhat right, an integer / integer is slow, so don't do it, but an integer / float is just as fast as an integer * float.

I like what you're attempting but I want you to change the i/10.0 to i/pi and the i*0.1 to i*(whatever 1/pi is). Floating point division has various latencies to complete the division. Simple divide by 1 or divide by 2 has early out conditions to terminate the division process into practically a single cycle op.

Integer/Integer I would expect to be slow too, but I didn't think it would be slower than float/float. Can you also make sure that you are NOT using a single precision float? I want to make sure the integer and floating point numbers you are using have roughly the same bit-width and at the same time asking to return similiar bitwidth data.
 

Cogman

Lifer
Sep 19, 2000
10,283
135
106
Originally posted by: TuxDave

I like what you're attempting but I want you to change the i/10.0 to i/pi and the i*0.1 to i*(whatever 1/pi is). Floating point division has various latencies to complete the division. Simple divide by 1 or divide by 2 has early out conditions to terminate the division process into practically a single cycle op.

Integer/Integer I would expect to be slow too, but I didn't think it would be slower than float/float. Can you also make sure that you are NOT using a single precision float? I want to make sure the integer and floating point numbers you are using have roughly the same bit-width and at the same time asking to return similiar bitwidth data.

Edited to reflect what you said, but the results are almost exactly the same. doing the inverse multiplication is SLIGHTLY faster then doing the straight division. Integer / Integer is still the slowest. I didn't do a float / float comparison (hope I didn't say that) as it wouldn't quite work with this method of testing.

Over all, my conclusion is that unless you are doing VAST amounts of data processing there is very little penalty for using the division operator, you would be much better off optimizing something else. (at least with the gcc c++ compiler)
 

TuxDave

Lifer
Oct 8, 2002
10,571
3
71
Originally posted by: Cogman
Originally posted by: TuxDave

I like what you're attempting but I want you to change the i/10.0 to i/pi and the i*0.1 to i*(whatever 1/pi is). Floating point division has various latencies to complete the division. Simple divide by 1 or divide by 2 has early out conditions to terminate the division process into practically a single cycle op.

Integer/Integer I would expect to be slow too, but I didn't think it would be slower than float/float. Can you also make sure that you are NOT using a single precision float? I want to make sure the integer and floating point numbers you are using have roughly the same bit-width and at the same time asking to return similiar bitwidth data.

Edited to reflect what you said, but the results are almost exactly the same. doing the inverse multiplication is SLIGHTLY faster then doing the straight division. Integer / Integer is still the slowest. I didn't do a float / float comparison (hope I didn't say that) as it wouldn't quite work with this method of testing.

Over all, my conclusion is that unless you are doing VAST amounts of data processing there is very little penalty for using the division operator, you would be much better off optimizing something else. (at least with the gcc c++ compiler)

My original thought before this thread was (from fastest to slowest)
INT Addition, INT Multiplication, FP Mult, INT Divide, FP Divide, FP Square Roots.

I just can't recall if there's anything special for FP multiplication. But yeah, this is very interesting stuff that I wish I knew why your numbers are showing otherwise. I also wish I knew what your compilier was doing to make sure it doesn't change your divisions into multiplications. But then again I'm a programmer noob so I have no idea what compiliers will or will not do.

One last request! Can you create two very long FP arrays with random FP numbers and divide each of them entry-wise and multiply them entry-wise? Just wondering if the fact we're dividing against the same constant over and over again is doing anything.
 

Cogman

Lifer
Sep 19, 2000
10,283
135
106
Originally posted by: TuxDave
My original thought before this thread was (from fastest to slowest)
INT Addition, INT Multiplication, FP Mult, INT Divide, FP Divide, FP Square Roots.

I just can't recall if there's anything special for FP multiplication. But yeah, this is very interesting stuff that I wish I knew why your numbers are showing otherwise. I also wish I knew what your compilier was doing to make sure it doesn't change your divisions into multiplications. But then again I'm a programmer noob so I have no idea what compiliers will or will not do.

:) Well, it just goes to show that you can't claim any programming style is "Faster" then another without first profiling it. Optimization is a tricky thing because you don't know how much of it is already being done by the compiler. Chances are that another compiler would could give very different results then what I got.

Anyways, that was a fun Segway from the OP. So, yeah, CPU's just work :)

[edit]GCC is open source if you ever want to really learn what the compiler is doing :D[/edit]
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: Cogman
<lots of code>

Did you disassemble the resulting binary to make sure the compiler is doing what you think it's doing? For simple loops like these, you have to be very careful to check that you're measuring what you think you're measuring. When your loop bodies are as short as yours are, you also have to pay attention to microarchitectural constraints - every mainstream x86 processor other than the Core 2 line takes a 1 cycle penalty on taken branches. For a single instruction loop body, the loop overhead dominates.

When it comes to cycle counts of instructions, you can look them up in AMD's optimization guides: K7 (appendix F), K8 (appendix C), Family 10h (appendix C).
 

Cogman

Lifer
Sep 19, 2000
10,283
135
106
Originally posted by: CTho9305
Originally posted by: Cogman
<lots of code>
When it comes to cycle counts of instructions, you can look them up in AMD's optimization guides: K7 (appendix F), K8 (appendix C), Family 10h (appendix C).

True, but when it comes right down to it, each function will receive the same penalty as the last. The purpose wasn't to get the fastest loop possible, rather, measure which operation was faster. The only reason I had a loop in there is because preforming the operation just once would be way too fast to measure. (I don't know how I would measure clock cycles)