Why/how does the Cpu run at 3.5GHz, yet the IGpu runs at 1.2GHz or less ?

SOFTengCOMPelec

Platinum Member
May 9, 2013
2,417
75
91
Example taken, Intel I7-4771 (nonK)

Clock Speed: 3.5 GHz
Graphics Max Dynamic Frequency: 1.2 GHz
Graphics Base Frequency: 350 MHz

So taking an Intel I7-4771 (NonK) as a reference of a high end, consumer (non-extreme) current cpu (sorry Amd/Arm fans), its integrated graphics (max) frequency rate, is approximately three times slower than the cpu rate.

3.5GHz/1.2GHz is approx 3.

Why ?

i.e. Why is it not 3.5GHz cpu/3.5 GHz integrated-graphics-COMPUTE ?

I can understand that it would probably get a lot hotter, but commodity cpus have previously been TDP 125, so I can't see that being a huge problem.
If it saves buying an expensive graphics card (not sure how valuable it would be), some users would prefer it.
Also I understand that it may then need some kind of special "extra" ram, to cope with the video bandwidth, but there would be various solutions to this problem, such as Motherboard Vram, or some kind of muli-chip cpu/ram solutions.

These days, both the cpu and the compute abilities of the IGpu, can be used as compute/cpu/processing units, so the overall abilities of the cpu chip package, would be enhanced if this were possible.

I'm sure there must be some good technical reasons, why the clock frequencies are so different. I just can't understand why, at the moment.
 

SOFTengCOMPelec

Platinum Member
May 9, 2013
2,417
75
91
Example

I could do lots of mathematical calculations, using AVX2, at 256 bits now, and 512 bits when AVX512/AVX3 comes out (probably next year, Skylake etc).
Which is similar to using a graphics processor as a "compute" unit. (SIMD like = Single Intruction, Multiple Data).
These are done at 3.5 GHz.

So why the big drop down to <=1.2 GHz when they are done in the graphics compute section of the chip ?
(Stating the obvious, but the process technology, transistors etc, of the cpu/IGpu are potentially exactly the same).
 
Last edited:

Lonyo

Lifer
Aug 10, 2002
21,938
6
81
Clock speed is partially dictated by the complexity of the pipeline you are running, and how you choose to design your core.
Making a higher frequency core can be more difficult than making low frequency but wider (as Intel already experienced).

GPUs aren't designed for high frequency because it's not particularly beneficial. You can just make them wider since they do incredibly parallel operations. For desktop CPUs, frequency helps because things aren't generally as parallel. A 12 core 1GHz CPU would be slower than a 4 core 3GHz CPU in 99% of workloads.
A 192 "core" GPU at 1GHz is pretty much going to be the same speed as a 64 core 3GHz GPU, but it's easier to make 192 slow cores than 64 fast ones, so that's what they do.
 

SOFTengCOMPelec

Platinum Member
May 9, 2013
2,417
75
91
Clock speed is partially dictated by the complexity of the pipeline you are running, and how you choose to design your core.
Making a higher frequency core can be more difficult than making low frequency but wider (as Intel already experienced).

GPUs aren't designed for high frequency because it's not particularly beneficial. You can just make them wider since they do incredibly parallel operations. For desktop CPUs, frequency helps because things aren't generally as parallel. A 12 core 1GHz CPU would be slower than a 4 core 3GHz CPU in 99% of workloads.
A 192 "core" GPU at 1GHz is pretty much going to be the same speed as a 64 core 3GHz GPU, but it's easier to make 192 slow cores than 64 fast ones, so that's what they do.

Thanks.

Your explanation, is making sense to me.

So the gpu section is NOT (generally) putting in caches, carry-look-ahead logic, register-re-allocation and all the other techniques, that advanced cpus are using to get the last drop of cpu execution speed (single thread).
(I am NOT an expert of gpus, so some of what I just said, may be WRONG, it was just to illustrate a point, rather than being 100% accurate).

Hence with the given number of transistors available to the gpu (what you have just said), means that the MOST parallel gpu compute/ALU units can be made (i.e. maximizing number of available compute nodes), giving the best possible performance per chip area (transistor), at the expense of considerably more parallelism being required.

So the simpler logic technology used in the gpu (compute) capabilities, reduces its maximum clock frequency.
A bit like in the old days of TTL logic, simple/cheap ripple binary counters were relatively slow, as you had to wait for the (possible) carry to "ripple" through.
But the more expensive (because it needed more transistors/gates, hence more chip silicon area needed), synchronous binary counters, could achieve a much higher maximum clock rate. (No need to wait for possible carries to propagate).
 
Last edited:

zir_blazer

Golden Member
Jun 6, 2013
1,219
511
136
Check for Kaveri Reviews, many of them should make comments regarding that fact, like this. Basically, for GPUs, they prefer to use very high density to put more compute units in place, than to run fewer of them at higher Frequencies, and that can also reflected in manufacturing process decisions like Kaveri (Through I don't recall Intel saying a lot about their case).
 

SOFTengCOMPelec

Platinum Member
May 9, 2013
2,417
75
91
Check for Kaveri Reviews, many of them should make comments regarding that fact, like this. Basically, for GPUs, they prefer to use very high density to put more compute units in place, than to run fewer of them at higher Frequencies, and that can also reflected in manufacturing process decisions like Kaveri (Through I don't recall Intel saying a lot about their case).

Thanks, yes the Kaveri reviews (including the one you linked to above), are also shedding light on the cpu/igpu balance.
I'm amazed that the new Kavari APUs are:-

to integrate whopping 2.41 billion of transistors into a 245mm2 Kaveri die, which is 85% higher transistor count compared to Trinity/Richland design (246mm2, 1.303 billion transistors, 32nm SOI).

2.41 Billion transistors is a crazy high number of transistors, for just one chip.

The article(s) explaining why AMD decided Kevari would NOT be faster (or not much) as regards X86, because the actual IC production node was NOT optimized for cpu frequency, has been a real eye opener to me.
For AMDs sake, I hope their HSA (or whatever it's called), works out, but I am somewhat sceptical.
E.g. Intel/AMDs new instruction sets, some of which were introduced ten or more years ago, are still not used that much. Microsoft windows only relatively recently introduced the requirement that the cpu MUST have SSE2 (I think), even though it has been around for ten or more years.
 

Lorne

Senior member
Feb 5, 2001
873
1
76
And just time, Then it will work its way up in frequency.
I remember the same argument back when the L2 was off board, Then it went daughter board, Then on chip.
The argument then went to simular discussion with L3, Like with AMD's K6 converting the off board L2 to L3 and that L3s will never be needed.
Now both MFG CPUs have them.
Also the on die MMU.

It just takes time, Also who knows, This little thread may have flicked on a light in somones head.
 

witeken

Diamond Member
Dec 25, 2013
3,899
193
106
i.e. Why is it not 3.5GHz cpu/3.5 GHz integrated-graphics-COMPUTE ?


I'm sure there must be some good technical reasons, why the clock frequencies are so different. I just can't understand why, at the moment.
Because that's how GPUs work, they're for multithreaded applications. It's more power efficient to have 100 cores running at 100MHz than 1 at 10GHz. You can also see that in the Haswell Macbook Air: with the same TDP (actually, 5W lower from 20W to 15W) the IGP is about 16% more powerful with 40EUs than HD4000 with 16EUs.

Similar to the CPU discussion, on the GPU front Haswell has to operate under more serious thermal limits than with Ivy Bridge. Previously the GPU could take the lion’s share of a 17W TDP with 16 EUs, now it has 15W to share with the PCH as well as the CPU and 2.5x the number of EUs to boot. As both chips are built on the same 22nm (P1270) process, power either has to go up or clocks have to come down. Intel rationally chose the latter. What you get from all of this is a much larger GPU, that can deliver similar performance at much lower frequencies. Lower frequencies require lower voltage, which in turn has a dramatic impact on power consumption.

Take the power savings you get from all of this machine width, frequency and voltage tuning and you can actually end up with a GPU that uses less power than before, while still delivering incrementally higher performance. It’s a pretty neat idea. Lower cost GPUs tend to be smaller, but here Intel is trading off die area for power - building a larger GPU so it can be lower power, instead of just being higher performance.

The GPU: Intel HD 5000 (Haswell GT3)
 

SOFTengCOMPelec

Platinum Member
May 9, 2013
2,417
75
91
Because that's how GPUs work, they're for multithreaded applications. It's more power efficient to have 100 cores running at 100MHz than 1 at 10GHz. You can also see that in the Haswell Macbook Air: with the same TDP (actually, 5W lower from 20W to 15W) the IGP is about 16% more powerful with 40EUs than HD4000 with 16EUs.



The GPU: Intel HD 5000 (Haswell GT3)

For mobile applications (especially) what you have just said, makes lots of sense. Increase the size of the IGpu making it more powerful (faster overall), and yet lowering its clock frequency, to bring it down to the desired TDP.
Resulting in a part which is both powerful (fast), and yet very power efficient and coolish running.
 

SOFTengCOMPelec

Platinum Member
May 9, 2013
2,417
75
91
And just time, Then it will work its way up in frequency.
I remember the same argument back when the L2 was off board, Then it went daughter board, Then on chip.
The argument then went to simular discussion with L3, Like with AMD's K6 converting the off board L2 to L3 and that L3s will never be needed.
Now both MFG CPUs have them.
Also the on die MMU.

It just takes time, Also who knows, This little thread may have flicked on a light in somones head.

I know what you mean.
I still remember when computers routinely came WITHOUT any kind of floating point hardware/accelerators built in, and when monitors were all monochrome.

Even apparently really simple stuff like serial RS232 ports, use to be done using plug in cards, i.e. not even the chipset(s) or anything on the motherboard would handle it, and it is difficult to think of something simpler than a RS232 port, ALL 1 bit of it!

Presumably one day, all the Dram will be built into the cpu chip.
Eventually followed by the entire PC being put into a single chip, eventually including the SSD (I guess, as well, as really, it is just more chips).

Someone famous (possibly the top man at IBM) said something like, there will probably never be a need for more than 5 or 10 computers, in the world.
Christmas/Birthday cards have optionally been available with microchips to play tunes for ages, and I refuse to guess how many micro-controllers there are in a middle of the range road car.

Back on topic.
I'm amazed, and did not realize that the transistors on an Intel cpu/IGpu and/or AMD Apu might be different, depending on if they are part of the cpu (Very high frequency, powerful current drive to minimise capacitive delays, fastest rise/fall times), or part of the GPU, optimized for super high density, with much less current drive capability (i.e. higher RDSon resistance, smaller, less silicon), but fine at the lower frequencies of the gpu unit. (I'm speculating on the specific transistor variations, as I don't know what really happens).
 
Last edited:

witeken

Diamond Member
Dec 25, 2013
3,899
193
106
Back on topic.
I'm amazed, and did not realize that the transistors on an Intel cpu/IGpu and/or AMD Apu might be different, depending on if they are part of the cpu (Very high frequency, powerful current drive to minimise capacitive delays, fastest rise/fall times), or part of the GPU, optimized for super high density, with much less current drive capability (i.e. higher RDSon resistance, smaller, less silicon), but fine at the lower frequencies of the gpu unit. (I'm speculating on the specific transistor variations, as I don't know what really happens).

Is that true? AMD had to lower the clock speeds of the CPU of Kaveri, because it was made on a high density process, optimized for GPUs.
 

SOFTengCOMPelec

Platinum Member
May 9, 2013
2,417
75
91
Is that true? AMD had to lower the clock speeds of the CPU of Kaveri, because it was made on a high density process, optimized for GPUs.

My understanding (but I will happily stand corrected by any IC process engineers, or anyone else really, as I am not 100% sure) is that part of it is indeed the process, which is the same for all the ICs transistors (unless usually impracticably expensive techniques are used, which I think means adding extra process steps, to allow flexible chemical composition variations, between the cpu and gpu).

But there is still wiggle room for the masks/layers to be changed, between the cpu and gpu transistors. E.g. The shape/profile of each transistor.
Which to a lesser extent CAN change the transistors characteristics, but the overriding thing is still the overall process.

N.B. I am NOT an IC process engineer, so will happily stand corrected.
 
Last edited:

sefsefsefsef

Senior member
Jun 21, 2007
218
1
71
I don't think there's anything mysterious going on here. All individual transistors with the same dimensions and on the same process, whether part of a CPU or GPU, are able to switch at the same frequency. However, switching individual transistors isn't what determines clockspeed. Clock speed is determined by how quickly a signal can propagate through the longest contiguous series of gates in the circuit in a single clock cycle.

This is where pipelining comes in. Pipelining reduces the number of gates a signal has to go through on each cycle (by allowing for operations to take multiple cycles, often using latches to hold intermediate values between clock cycles). This lets each cycle be a shorter amount of time, and therefore increases maximum clock speed.

Pipelining takes die area to pull off, and burns extra power compared to no-pipelining, and it therefore makes sense to use it on 4 CPU cores, but not on 2048 GPU cores. Hence, CPUs have higher clockspeeds than GPUs.
 

SOFTengCOMPelec

Platinum Member
May 9, 2013
2,417
75
91
I don't think there's anything mysterious going on here. All individual transistors with the same dimensions and on the same process, whether part of a CPU or GPU, are able to switch at the same frequency. However, switching individual transistors isn't what determines clockspeed. Clock speed is determined by how quickly a signal can propagate through the longest contiguous series of gates in the circuit in a single clock cycle.

This is where pipelining comes in. Pipelining reduces the number of gates a signal has to go through on each cycle (by allowing for operations to take multiple cycles, often using latches to hold intermediate values between clock cycles). This lets each cycle be a shorter amount of time, and therefore increases maximum clock speed.

Pipelining takes die area to pull off, and burns extra power compared to no-pipelining, and it therefore makes sense to use it on 4 CPU cores, but not on 2048 GPU cores. Hence, CPUs have higher clockspeeds than GPUs.

Thanks.
I see.
So lots of things (like pipelines) have been heavily stripped down or even removed entirely, so that as many gpu processors can be fitted into the available room. Hence the dramatic lack of maximum clock speed.

After making this thread, and reading peoples replies, plus some internet reading, it dawned on me that I was being a bit silly.
Take the floating point units (obviously present in both the cpu and gpu these days) as an example.

The cpus floating point units have huge amounts of "extra" logic gates so that it can do the multiply in a very quick amount of time (fewest cycles), and extensive divide speed up logic, but this (presumably) uses up lots of space on the silicon.
So in order for the gpu to have a huge number of (much slower than the cpus) floating point units, they have to (I assume) not have much in the way of speed up logic, such as super fast multiply/divide etc.
Hence their slower clock frequencies and/or performance, on a per processor core (cpu or gpu) basis.

My understanding on graphics processors, is that double precision (64 bit) floating point units, are present in only 1 in 4 (for expensive high end video cards) and only 1 in 16, for the rest. (some have no double precision hardware, instead they use software to create it, I think).
i.e. Most of the floating point units on gpus are single precision ONLY.

This shows the extent to which igpu/gpu processors are VERY stripped down cpu cores, so that a huge number can be made on a silicon chip.
 

Roland00Address

Platinum Member
Dec 17, 2008
2,196
260
126
Is that true? AMD had to lower the clock speeds of the CPU of Kaveri, because it was made on a high density process, optimized for GPUs.

True to a degree.

32nm for AMD/GF was SOI which works better for high clock speeds than bulk process, 28nm for AMD/GF is bulk process.

Second there is some tradeoff between density and higher clock speeds, to some extent the 28nm GF process is focusing on density for they are moving to a foundry type business.