Could a 12.8ghz cpu equal a quad core 3.2ghz?

Rubycon · Mar 5, 2010

Tsavo said:
Wow, that'd be 144 GHz!!

Yes hence the gross reference.

classy · Mar 5, 2010

I'll take the quad for many reasons. A multi core chip is far more efficient and make better use of resources than a single core. Software is becoming more multi threaded to boot. I can execute several different programs, with all of them operating smoothly with a multicore chip. A single core no matter how fast, will always be single tasked.

Jovec · Mar 5, 2010

This is easily testable, barring any unknowns that may arise from extreme clockspeeds.

iCyborg · Mar 6, 2010

aigomorla said:
Put your i7 @ 133x20 vs 186x15 and run a few benchmarks.
You'll see first hand the 186 bclk is faster.

133x20 = 2.66GHz
186x15 = 2.79GHz
You suck at math

IntelUser2000 · Mar 6, 2010

It's striking how accurate Intel's predictions(or projections if you'd like it that way) that's shown back 5 years ago was: http://anandtech.com/showdoc.aspx?i=2368&p=2

(3rd pic)

Looks like they see ~10 large monolithic die CPUs as the limit before we go fully into heterogenous computing. After that we should see 4 large core CPUs with many small core attached.

Sounds familiar?

The rumor is that Haswell will integrate Larrabee, and being full x86, will bring forth a true vision they sought back in 2005.

Why heterogenous? As you will see with Gulftown, trying to scale to multi cores will become a REAL problem. Aside from performance scaling(or lack there of), there will be problems that arise from it. Few people with multi core processors already report problems in games, be it stutter, or crashing. Prepare to see worse things happening.

Quad cores for everyone? Fine.

How about 4 x 12.8GHz cores vs 16 x 3.2GHz cores as a comparison instead?

Ben90 · Mar 6, 2010

I really don't want to actually benchmark this but if someone blindly protests this, I will.

For single socket systems, increasing QPI is near pointless. Lets refer to a diagram:

So if you completely 100% max out your northbridge you will gain a ~4.1% increase in performance by overclocking it to allow another 800MB/s through. How many setups are out there that utilize all 16x16x4 PCI-e lanes off the northbridge? Pretty close to zero in the consumer space. I do not remember the formula for performance increase from the latency reduction, but it won't be much.

DrMrLordX · Mar 6, 2010

aigomorla said:
And once again a 12ghz cpu will dominate everything and everything, as long as your cache speed raitos and ram timings were in sync.

Ah, there's the rub, isn't it?

Ramping up CPU speed without increasing cache size or reducing memory latency leads to poor CPU scaling with clock speed. LN2-cooled Clarksdales run into that problem which is why the Clarksdale-floggers are all fighting to increase BCLK to improve QPI and memory speeds.

It's easy to achieve an acceptable level of memory latency when you have multiple cores that are running at clockspeeds that we've been seeing for years. It can be a bit taxing on the memory controller when it has to field requests from four or more cores all attempting to utilize the same banks of memory, but both Intel and AMD have figured out how to do it. You can easily get 100-150 cycle memory latency on LGA1366, LGA1156, or AM3.

Good luck doing that on a 12.8 ghz CPU resembling any modern x86 uarch while using dual or triple-channel DDR3. Ain't gonna happen!

The amount of L2 you would need on a chip like that would be absurd . . . somewhere along the lines of 48-60 megs, or more if you could get away with it.

The reason why we went multi cores and not faster speed was because 12ghz silicon as someone stated is close to impossible.
The heat output alone would tear apart a LN2 POT.

It could probably be done, but it would be very silly.

But if we ignore the heat, and do just raw numbers... 12ghz clocked machine is not something to laugh at.
Even @ 6ghz+ territory we see insane numbers being pulled in.

I would love to see a single-core Penryn-alike at 12 ghz with a huge bank of L2.

Ben90 · Mar 6, 2010

I think everyone here can agree that any current processor (over)clocked to 12.8ghz on one core would suck hard compared to a 3.2ghz quad using raw throughput as the metric. I took the OP meaning as slightly different. I was imaging comparing our current 3.2ghz quad cores to an imaginary processor that was designed to scale completely up to 12.8ghz.

It may be my foggy memory, but I remember discussions like these rampant when dual cores first started emerging. All the benchmarks at the time pointed to single cores being more efficient as duals created a lot of overhead. This is not to say that duals didn't have their advantages as they could handle program hangs and such a lot more smoothly, however they could not keep up with single cores in regards to pure throughput when clocked at half speed. This is where I realize my tests earlier in this thread were flawed. The dual core setup was running the same cache/memory speeds as the single core. This gave an edge to the dual core setup that would not exist in our theoretical situation.

Therefore I will not retest anymore configurations (unless I really need to prove that a faster QPI doesn't do anything realworld, its the Uncore/RAM that increase the clock efficiency) as there have been hundreds of similar tests years ago and they all came out with the same conclusion:

In realworld usage single cores have more throughput than a half clocked dual setup due to the overhead of managing multiple cores. This assumes that everything else stays proportional. Since our scenario already has a 12.8ghz core, I think its safe to assume they could push the uncore to the same ratio. Something we can all agree on however is no current processor will reach 12.8ghz and the guy on Ebay is attempting to make his product sound more impressive using simple marketing that is simply not true.

Voo · Mar 6, 2010

DrMrLordX said:
Good luck doing that on a 12.8 ghz CPU resembling any modern x86 uarch while using dual or triple-channel DDR3. Ain't gonna happen!

The amount of L2 you would need on a chip like that would be absurd . . . somewhere along the lines of 48-60 megs, or more if you could get away with it.

That doesn't make any sense. After all you still have to get the exact same amount of information, independent of you're using a multicore or not. So if both of them run the programm equally fast they would both need the same amount of data from the RAM at the exact same time.

And the L1/L2 caches of multicores are included in the L3 cache so that's wasted cache that you wouldn't need for the single core. So I really can't understand your point, because you could quadruple the L1/2 caches for the single core compared to one core of the quad. And you wouldn't have to include the smaller caches in the L3 cache which would save some logic and space.

The best argument for the multicore so far was the overhead for context switching and I'd agree with that.

Other than that it's impossible to do that, so we can't test it at all. At least as long as Intel doesn't think it'd be fun to optimize a platform for single core performance - well they tried that with netburst and we see how that ended. Why the possible comparisions (disable X cores on a quad and over/underclock) are useless shouldn't be that hard to understand.

DrMrLordX · Mar 6, 2010

Voo said:
That doesn't make any sense. After all you still have to get the exact same amount of information, independent of you're using a multicore or not. So if both of them run the programm equally fast they would both need the same amount of data from the RAM at the exact same time.

The whole point to cache is to keep as much of the working set in cache as possible under the assumption that cache will provide better latency and bandwidth than the system memory. The larger the cache, the less often the processor has to hit system memory . . . so if the chip scales upward in clockspeed without any increase in memory speed or improvement in memory timings, then larger and larger caches are needed to compensate. That was my point.

Yes, you do have to fill the cache, and yes, this is done by accessing system memory, but provided you don't have to flush the cache too often, larger cache should equal fewer system memory accesses overall.

I do suppose that there would be a point beyond which the speed of the core could be so much greater than that of the system memory that prefetch operations would stall frequently, calling into question whether or not cache would be a solution to much of anything (or simply necessitating the need for even greater amounts of cache). At that point you might be able to say "to hell with it" and go with eDRAM cache ala Power7 and try to run your system without any system memory at all (I do not mean to imply that Power7 attempts this feat). But that would be even sillier than trying to run your processor at 12.8 ghz.

And the L1/L2 caches of multicores are included in the L3 cache so that's wasted cache that you wouldn't need for the single core.

That assumes that the fictional 12.8 ghz single-core CPU even had L3 to begin with. The last time Intel promised clockspeeds anywhere near 12.8 ghz, it was with Netburst, and the desktop Netburst chips generally did not have L3. However, as clockspeeds kept going up, so did L2.

So I really can't understand your point, because you could quadruple the L1/2 caches for the single core compared to one core of the quad.

If you have the die space available for it, sure. But that was pretty much what I was saying you'd have to do anyway. Maybe I was overestimating how much L2 you'd really need? If you look at something like a Wolfdale, you get 3 mb of L2 per core, so if you were to quadruple the clockspeed of one high-end Wolfdale core (E8400-E8700), presumably you'd at least want to quadruple the L2. So, 12mb, rather than 48.

Still, cache in that quantity won't solve all your problems with every app. Anything that was sufficiently memory-intensive would cause all kinds of stalls assuming you were running something like . . . DDR2-800, so you'd want more cache if possible.

Voo · Mar 6, 2010

The only reason to distinguish between L2/L3 cache is because of multicore processors. Though it'd probably make sense to keep the L3 cache, but just in the sense of larger and slower and not with the special cases you need at the moment.
Also the reason for more caches has more to do with smaller processes - we can pack xMB of SRAM into a feasible diesize, so why not do it? More cache is always better.

And I'd say it's logical to say that memory access times would be the same for the multicore and singlecore CPU. If they both run the programm equally fast, they'd both have to stall the CPU comparably long for memory access, I don't see any reason why that'd be any different.

Also if you reserve x% space per core for L2 cache and you have 4 cores, than it's not surprisingly that your single core would get four times as much L2 cache in the same area (in theory, in practice larger caches are slower so you'd probably want to make the cache hiearchy a bit longer)

PlasmaBomb · Mar 6, 2010

IntelUser2000 said:
Either he owns a liquid nitrogen processing plant or he loves cake.

Did you just call Ruby a he?

DrMrLordX · Mar 6, 2010

Voo said:
Also the reason for more caches has more to do with smaller processes - we can pack xMB of SRAM into a feasible diesize, so why not do it? More cache is always better.

Intel seems to have taken this philosophy to heart up through their LGA775 processors (and in all fairness, they're still using a lot of cache on their LGA1156/1366 procs, just in the form of L3). AMD has been farting around with 512k L2 on many of their processors for years (at least now they have L3 to compensate). Of course Intel is ahead on process tech so go figure.

And I'd say it's logical to say that memory access times would be the same for the multicore and singlecore CPU. If they both run the programm equally fast, they'd both have to stall the CPU comparably long for memory access, I don't see any reason why that'd be any different.

An interesting point. You might have 1/4 the rated system memory latency on the quad running at 1/4 the clockspeed of the single-core CPU, but even if you get 1/4 the stalls per core, you've got four cores stalling . . . it would be interesting to see how prefetch would work on the quad in this scenario as compared to the faster single-core CPU.

Nevertheless, Ben90 seems to have intended for us to think of the fictional 12.8 ghz CPU as having scaled perfectly from 3.2 ghz to 12.8 ghz, so that implies that it would have enough cache and/or sufficiently fast system RAM to compensate.

dalauder · Mar 6, 2010

So are we only discussing a theoretical processor that comes out tomorrow at 12.8GHz? Because if we want to consider when clock rates like that will be possible, probably using carbon instead of silicon, then arguments like how Netburst processors lacked a L3 cache don't really mean much because all the top processors have L3 caches now.

I've heard Phenom II x2's having one benefit compared to x4's that they split the cache less ways. But I guess x2's run at the same clock as x4's, not double or quadruple like the 12.8GHz one being suggested. Still, if you gave a quad a 6MB L3 cache and 4x faster single core a 24MB cache, which isn't inconcievable, the single core would win every benchmark if you don't multitask.

But the best system, if we are limited by a magic 12.8 GHz would be dual core. Gimme a 6.4GHz x2 w/ 12MB L3 over anything I've heard here. 2 cores = minimal hangs. Also, I want it to be made by AMD and come with coupons to Jack in the Box, cuz they serve breakfast any time of the day.

dalauder · Mar 6, 2010

DrMrLordX said:
assuming you were running something like . . . DDR2-800, so you'd want more cache if possible.

Yeah, if I'm using a 12.8GHz CPU, I'm running theoretical DDR3-2000 at like 5-5-5 timings and having aigomorla watercool my RAM and everything else.

dalauder · Mar 6, 2010

Ben90 said:
I really don't want to actually benchmark this but if someone blindly protests this, I will.

For single socket systems, increasing QPI is near pointless. Lets refer to a diagram:

So if you completely 100% max out your northbridge you will gain a ~4.1% increase in performance by overclocking it to allow another 800MB/s through. How many setups are out there that utilize all 16x16x4 PCI-e lanes off the northbridge? Pretty close to zero in the consumer space. I do not remember the formula for performance increase from the latency reduction, but it won't be much.

Wouldn't PCIe 3.0 double that max bandwidth to 36GB/s? Then what if this processor could handle quadruple channel ram at 8.5GB/s x 4 = 34GB/s? Then QPI's the new bottleneck, right?

When's everyone else gonna wake up? I'm waiting for a reply.

Ben90 · Mar 6, 2010

Nevertheless, Ben90 seems to have intended for us to think of the fictional 12.8 ghz CPU as having scaled perfectly from 3.2 ghz to 12.8 ghz, so that implies that it would have enough cache and/or sufficiently fast system RAM to compensate.

Yea I was debating about mentioning RAM in there, but I took the easy way out and decided RAM would scale with the CPU. As clock speeds go up, the latency of RAM becomes more and more of a factor as there are more cache misses, therefore more RAM access. I actually do not know how well a single core would handle hundreds of cycles waiting for the RAM to respond as I'm no engineer, but unless its OOO worked absolutely perfectly I guess we can give another +1 to multicore.

Wouldn't PCIe 3.0 double that max bandwidth to 36GB/s? Then what if this processor could handle quadruple channel ram at 8.5GB/s x 4 = 34GB/s? Then QPI's the new bottleneck, right?

Then we would be getting a new chipset, and most likely a QPI v2.0. As far as I know, the IMC does not go through QPI.

I do have a question for you guys though. When disabling cores on an i7, it still keeps the whole 8mb of l3$ right? And if so does it still leave the area for the unused processors l2$ used or does the single core get the full 8MB to itself?

IntelUser2000 · Mar 6, 2010

PlasmaBomb said:
Did you just call Ruby a he?

Yes. I think enough people pointed out that Ruby isn't a he though.

Wouldn't PCIe 3.0 double that max bandwidth to 36GB/s? Then what if this processor could handle quadruple channel ram at 8.5GB/s x 4 = 34GB/s? Then QPI's the new bottleneck, right?

I haven't seen a benchmark that showed GPUs taking advantage of PCI Express 2.0, it would be even more true with 3.0. Video cards have so much memory that's also super fast to care much about PCI Express link speeds. Remember we are also theorizing that there would be greater than 25GB/s communication happening between CPU and GPU. I don't think that happens.

Besides, the PCI Express controller will be right on the X58 chipset, and won't have to go through QPI. If the PCI Express controller was on the CPU and had to go through QPI to get to the video card I could understand, but its not.

I do have a question for you guys though. When disabling cores on an i7, it still keeps the whole 8mb of l3$ right? And if so does it still leave the area for the unused processors l2$ used or does the single core get the full 8MB to itself?

If it was designed properly it would act just like an single core processor with L3 cache.

2x clocked with x cores vs 1x clocked with 2x cores comparison:

2x clocked with x cores scaling limiters:
-Memory latency. 100 cycles at 3.2GHz equals 200 cycles at 6.4GHz
-Memory bandwidth
-Large caches that clock near/identical to core clock speeds to counteract having higher memory latency. It's likely you still need L3 cache

1x clocked with 2x cores scaling limiters:
-Memory bandwidth. Latency won't be much of a problem as bandwidth since relative latency will be lower
-Still need large caches to counteract having 2x more cores but with similar bandwidth.
-Programming limitations. You'd want a well threaded program to perform well. There's no such thing on a 2x clocked core with 1x cores
-Cache coherency. The dedicated caches per core and the last level cache would need to be synchronized well. Again, no problem with 2x clocked core with 1x cores

The intercore communication, cache coherency, and programming limitations will hamper a 1x clocked, 2x core, while retaining all the problems of scaling on a 2x clocked, 1x core CPU.

Personally, I'd think the ideal amount of cores lie around 4. A single core would always run full load with single app, so it'd sacrifice responsiveness. Dual cores will mitigate that a lot. Quad yet might help further still. Beyond that is completely overkill.

DrMrLordX · Mar 6, 2010

IntelUser2000 said:
Beyond that is completely overkill.

Once Thuban and Gulftown are in the wild and running in people's systems, we'll know whether or not that is true (for desktop workloads).

Could a 12.8ghz cpu equal a quad core 3.2ghz?

Madame President

Lifer

Senior member

Golden Member

Elite Member

Platinum Member

Lifer

Platinum Member

Golden Member

Lifer

Golden Member

Lifer

Lifer

Junior Member

Junior Member

Junior Member

Platinum Member

Elite Member

Lifer