Why are more slower cores better than one fast core?

Exophase · Apr 5, 2013

Idontcare said:
I don't quite follow the part in bold. Processing in parallel always results in more work needing to be done, specifically the overhead associated with creating and feeding data to the parallel threads as well as the aggregation of thread results and finalization of the parallelized task's output.

At best the overhead (signified as Almasi's and Gottlieb's IPC in the equation above) is negligible and the compute time asymptotically approaches that of Amdahl's Law...and only then can you start to speak of the code being so perfectly parallelized that Amdahl's Law asymptotically approaches that of flat-out serial code on a commensurately faster processor.

I think there's some legitimacy to the claim, in that in the same way a lot of tasks are inherently serial (or have a dominating serial component) some tasks have inherent parallelism and are hard to write as a single task. Multithreading (within a program, as opposed to multiprocessing) was used on processors long before there were multiple hardware cores or threads to benefit.

To take a really low level example - say you have a system that continually reads in data from some sensor, processes it (filtering, etc) and sends it out via some port. Let's also say there's no ability to handle this stuff in hardware using something like DMA. To handle all this stuff happening at the same time you either need threads/interrupts to help you switch context behind the program's back or the program has to have state machines that multiplex all this stuff into timeslices. Point is, you're going to absorb some overhead of handling a bunch of stuff at the same time regardless of if you have multiple processors or not.

Having multiple processors may involve more overhead, but it could actually involve less instead. Because the code can be running on all cores simultaneously there could be no switching overhead on individual cores, if you lock each thread to a different core.

There are actually some processors without interrupts where you have to leverage multiple cores to make it happen. They're more on the toy side like Parallax Propeller.. but you could argue that leaving interrupts out of the design makes for a more economical processor. I wouldn't want to program on one of those though.

I'd argue that generally, unless you're on something very slow or where realtime/fast response time is very critical, the performance benefit from multicore isn't that big of a deal. I can't think of any cases where it'd offer a huge advantage like SunnyD was saying, but there could be a useful application I'm ignoring. One place where it is beneficial is if you have assymetric cores, or if your cores are asleep a lot of the time and it takes less power and wakeup latency to have one core soak all the background stuff vs pushing an active core harder. But that's more power oriented than performance.

Idontcare · Apr 5, 2013

Stoneburner said:
Would a 12ghz single core necessarily multitask better than a 3ghz quad core? Or even equally as well?

Maximilian said:
In effect it would multitask equally as good as the quad with software that can use a quad (the single core will never actually multitask just go so fast it seems like it but that's besides the point) overall the 12ghz single core is better though because it can run single threaded stuff much faster than the quad would be capable of.

Maybe I have a different concept of what "multi-tasking" is meant to capture in terms of the user's experience but when I think of multi-tasking it takes me back to the archaic days of Windows 3.11 and being able to set your timeslice increment for processor thread switching when running multiple applications simultaneously.

And in those days, as is true now, a single-core CPU at 12GHz would suck balls, metaphorically speaking, compared to a quad-core at 3GHz when it came to system usability and multi-tasking.

Take a single 12Ghz core and try running something processor-demanding like transcoding a movie in the background while attempting to prepare a powerpoint presentation or work with a spreadsheet in Excel. It would push you to the edge of insanity.

Do that on a dual-core though where the foreground activity isn't a halting stuttering mess of user frustration and you'll be much happier with how well the dual-core multi-tasks compared to the single core.

Multi-core CPUs are to multi-tasking what SSDs were to spindle-based drives and 4K random IOPs. Leagues apart.

Idontcare · Apr 5, 2013

Exophase said:
I think there's some legitimacy to the claim, in that in the same way a lot of tasks are inherently serial (or have a dominating serial component) some tasks have inherent parallelism and are hard to write as a single task. Multithreading (within a program, as opposed to multiprocessing) was used on processors long before there were multiple hardware cores or threads to benefit.

To take a really low level example - say you have a system that continually reads in data from some sensor, processes it (filtering, etc) and sends it out via some port. Let's also say there's no ability to handle this stuff in hardware using something like DMA. To handle all this stuff happening at the same time you either need threads/interrupts to help you switch context behind the program's back or the program has to have state machines that multiplex all this stuff into timeslices. Point is, you're going to absorb some overhead of handling a bunch of stuff at the same time regardless of if you have multiple processors or not.

Having multiple processors may involve more overhead, but it could actually involve less instead. Because the code can be running on all cores simultaneously there could be no switching overhead on individual cores, if you lock each thread to a different core.

There are actually some processors without interrupts where you have to leverage multiple cores to make it happen. They're more on the toy side like Parallax Propeller.. but you could argue that leaving interrupts out of the design makes for a more economical processor. I wouldn't want to program on one of those though.

I'd argue that generally, unless you're on something very slow or where realtime/fast response time is very critical, the performance benefit from multicore isn't that big of a deal. I can't think of any cases where it'd offer a huge advantage like SunnyD was saying, but there could be a useful application I'm ignoring. One place where it is beneficial is if you have assymetric cores, or if your cores are asleep a lot of the time and it takes less power and wakeup latency to have one core soak all the background stuff vs pushing an active core harder. But that's more power oriented than performance.

Oh yeah, completely agree there. I was thinking we were talking about just a single unified program, say povray, that was parsing to threads.

You are describing something more akin to an entire operating system which will have hundreds if not thousands of low-utilization background threads and the latency there trumps the absolute processor utilization such that more cores will always get more work done than fewer faster cores.

I have to admit I am only superficially in-tune with the entire breadth of this specific thread. I skimmed it and read what I thought was an oddity...but the oddity has been explained so I am thoroughly satisfied with accepting the merits of the original claim :thumbsup:

edit: that's ironic, I posted my 6pm post above without knowing that your 5:55pm post existed, but we both managed to touch on timeslices in our respective posts

bronxzv · Apr 5, 2013

Idontcare said:
I don't quite follow the part in bold.

I'm not sure this is what UaVaj has in mind when talking about lower latency, but one thing for sure is that if the memory subsystem is constant, the processor with 4x the frequency will endure 4x more stall cycles at each last level cache miss, much like increasing the core count typically provide sublinar speedups, increasing the frequency provide sublinear speedup, well threaded code may indeed well be faster on the 4-core at 1/4 the frequency

not to mention the latency hiding effect of SMT, where a 4-core with 8 threads quand enjoy nearly a 5x speedup with well balanced threading (actual Core i7 3770K measurements)

UaVaj · Apr 5, 2013

thanks exophase n bronxzv. :thumbsup:

sefsefsefsef · Apr 5, 2013

Even if physics let you have a faster single core CPU, you still wouldn't like it. Main memory doesn't care how fast your super fast single-core CPU is. Its latency will be the same (think about it in terms of nanoseconds). This will eventually be a bottleneck for every application when you're trying to just scale up the core clockspeed, so even if you have a faster CPU, you will get 0% performance improvement. For this reason, using more slow cores is a nicer solution.

BrightCandle · Apr 5, 2013

Unfortunately there are always going to be problems with multithreading. Its not magically going to get better. The tools are really primitive, the problems are really hard to debug and the general difficulty of doing it can't be understated. Its made worse by the fact that we know a large set of problems can never utilise many cores, the algorithms only scale with the log of growth of cores or not at all.

Games are the things everyone wants to scale well today but games are very difficult to multithread. The core of their algorithm for updating the world state is very complicated, everything is interacting with everything else and this makes it a very hard thing to get running on multiple cores well.

I don't anticipate this changing any time soon, the tools to do multithreading are not getting any better, there are no breakthroughs in languages that really solve the problem well and we are struggling with the complexity of our software as it is, and parallel algorithms are always more complicated and harder to debug. Quite often it takes 3 cores to get back the performance of a single core for a problem so the scaling to 4 cores is of marginal benefit anyway and if you can't break the big stuff up then its often of no benefit.

Multithreading of software is going to be a slow process that takes a long time and its unlikely in the near future (years) we will see anything dramatically change on this front. It might, I could be wrong, we could have a breakthrough that I don't see coming, but I doubt it.

Idontcare · Apr 5, 2013

bronxzv said:
I'm not sure this is what UaVaj has in mind when talking about lower latency, but one thing for sure is that if the memory subsystem is constant, the processor with 4x the frequency will endure 4x more stall cycles at each last level cache miss, much like increasing the core count typically provide sublinar speedups, increasing the frequency provide sublinear speedup, well threaded code may indeed well be faster on the 4-core at 1/4 the frequency

not to mention the latency hiding effect of SMT, where a 4-core with 8 threads quand enjoy nearly a 5x speedup with well balanced threading (actual Core i7 3770K measurements)

Well that is certainly true but I thought he said (or maybe I just assumed) this was under identical IPC conditions.

The quadcore you are describing does not achieve the same IPC as the 4x faster clockspeed single core, the single core that you describe will have lower IPC owing to the bottlenecks you describe.

Which is a truth in reality, but not a truth within the confines of the thought experiment where IPC is the same for both processor configurations.

WhoBeDaPlaya · Apr 6, 2013

Could be in odd scenarious, eg. for my use case at work.
Swapped machines with a coworker - got a 2.4GHz quad-core Westmere Xeon in trade for a 3GHz dual-core Woodcrest Xeon with 4GB more RAM.

We usually run jobs with single licenses (each supports 2-3 logical cores), so he's happy that his jobs are running faster.

I need to run jobs with two different licenses simultaneously, so I'm happy too

parvadomus · Apr 6, 2013

Idontcare said:
Take a single 12Ghz core and try running something processor-demanding like transcoding a movie in the background while attempting to prepare a powerpoint presentation or work with a spreadsheet in Excel. It would push you to the edge of insanity.

Do that on a dual-core though where the foreground activity isn't a halting stuttering mess of user frustration and you'll be much happier with how well the dual-core multi-tasks compared to the single core.

This is incorrect. It has been a long time since SOs support multiprogramming, you wouldnt notice the difference between a 12Ghz CPU vs a quad core 4Ghz cpu running transcoding while doing some powerpoint.

SOs uses some type of process scheduling like round robin, giving each one a time slice to work. The only way to maybe notice something is to run the transcoding work in 3 cores while the extra core is exclusively tied to the powerpoint process (it may reduce user input latency@ powerpoint by a few milliseconds :awe:, and the same thing can be done giving priority to the powerpoint process with 1 12Ghz cpu or running the transcoding work in all cores @ the quad core cpu).

The only advantages of multicore cpus are that they avoid context switching between threads, and that they may do better use of the cache hierarchy (if you have a ton of processes fighting for only 1 CPU they might easily trash useful data that belongs to the other concurrent processes from the cache, this kinds of workloads kills locality of references).

VirtualLarry · Apr 6, 2013

What about Memristors? I've read some info suggesting that they are like storage technology (like NAND), but capable of massive parallel computation.

What would things be like, if you could execute an opcode in parallel for every bit or byte of memory available? An MPM - Massively Parallel Machine.

bronxzv · Apr 6, 2013

Idontcare said:
Well that is certainly true but I thought he said (or maybe I just assumed) this was under identical IPC conditions.

The quadcore you are describing does not achieve the same IPC as the 4x faster clockspeed single core, the single core that you describe will have lower IPC owing to the bottlenecks you describe.

Which is a truth in reality, but not a truth within the confines of the thought experiment where IPC is the same for both processor configurations.

indeed UaVaj mentioned "assuming ipc is the same [...]" which I assumed constant at the core level (from his "lower latency" remark), not at the whole system level

as you know, system memory bandwidth increased way faster than latency was improved over the years, a single LLC miss can cost you 500+ instructions on a modern system, so this is IMO, besides the power argument, a very important point to raise to answer the OP qeustion "Why are more slower cores better than one fast core"

Arkadrel · Apr 6, 2013

Anarchist420 said:
Should CPUs have stayed single core?

I don't see any difference, but then I'm a noobster so that's why I'm asking.

Like others have said.... Power comsumption/heat, and physical law's of the universe, are what's keeping us from running 1 CPU core at 12ghz instead of 4 cores@3ghz (4x3=12).

Its much much muuuuuuch more power effecient to multi-core.

Basically it all boils down to research.
IBM did ALOT of it, for super computers, testing various ways of effeciently increaseing performance.

Now everyone is copying them, or everyone reached the same conclusions.

cytg111 · Apr 6, 2013

Idontcare said:
Maybe I have a different concept of what "multi-tasking" is meant to capture in terms of the user's experience but when I think of multi-tasking it takes me back to the archaic days of Windows 3.11 and being able to set your timeslice increment for processor thread switching when running multiple applications simultaneously.

And in those days, as is true now, a single-core CPU at 12GHz would suck balls, metaphorically speaking, compared to a quad-core at 3GHz when it came to system usability and multi-tasking.

Take a single 12Ghz core and try running something processor-demanding like transcoding a movie in the background while attempting to prepare a powerpoint presentation or work with a spreadsheet in Excel. It would push you to the edge of insanity.

Do that on a dual-core though where the foreground activity isn't a halting stuttering mess of user frustration and you'll be much happier with how well the dual-core multi-tasks compared to the single core.

Multi-core CPUs are to multi-tasking what SSDs were to spindle-based drives and 4K random IOPs. Leagues apart.

OS's have evolved since the early days of windows 3.0, what you are talking about is cooperative multitasking and that means that every god damned process the os is running needs to give up its timeslice for things to move on, if it does not.. happy reboot. Today we are enjoying preemptive multitasking and that means that the os get the final say in the matter, and to be honest that works MUCH better in a more-than-one core envoriment(cause one realtime thread can sort of dos(denial of service) the rest of the os).

SlowSpyder · Apr 6, 2013

I wonder what I'd want more, if it could be made? If CPU makers could focus on clockspeed and IPC increases like they used to in the single core days and magically get those types of increases vs. continuing to add cores. What would you rather have, a next gen 8 core i7 (16HT threads) with say 40-50% more IPC at 6+GHz or something like a next gen 32 core i7 @ 2.6GHz (that's 64 HT threads!) with similar IPC of today, assuming same TPD and such? Both would be pretty cool.

Idontcare · Apr 6, 2013

SlowSpyder said:
I wonder what I'd want more, if it could be made? If CPU makers could focus on clockspeed and IPC increases like they used to in the single core days and magically get those types of increases vs. continuing to add cores. What would you rather have, a next gen 8 core i7 (16HT threads) with say 40-50% more IPC at 6+GHz or something like a next gen 32 core i7 @ 2.6GHz (that's 64 HT threads!) with similar IPC of today, assuming same TPD and such? Both would be pretty cool.

I'd rather have 1GHz twelve-core that has single-latency cache and a 1-stage pipeline than a single-core 12GHz 50-stage pipeline with 6-cycle latency cache microarchitecture.

You give up a lot of IPC in going for clockspeed.

Lower clockspeed saves you crazy amounts of power, both dynamic and static.

Dynamic power will be lower because you can use less voltage to operate at the targeted clockspeeds, and lower static losses because of the lowered voltage and lower operating temperatures.

BrightCandle · Apr 6, 2013

See as how we are on 4 cores today and there isn't much to move us past it I think its fair to say even Intel has realised more cores doesn't bring any more performance to the average user. 1 core to 2 cores was a big deal in Windows because the jumping of the mouse on hard drive access disappeared, overall the experience was smoother. 2 to 4 cores I think most people noticed a lot less benefit. There are still relatively few programs that utilise more than 2 cores well (although plenty of games that show performance improvements but their utilisation is quite low). Scaling out with more cores isn't really a good way to improve performance today, which is why we are getting these small IPC improvements instead.

I don't know where we go next for performance, clock speed seems to be done, algorithmic improvements in FP and ALU and such seem to have topped out, the benefits of deeper pipelines have been done to death and multiple cores has topped out at just 4 on the desktop. Not many more options for performance improvements remaining.

Idontcare · Apr 6, 2013

BrightCandle said:
Not many more options for performance improvements remaining.

If they continue to get the power consumption down, eventually they will have plenty of power budget to start doing silly inefficient stuff like speculative processing and so forth.

desura · Apr 6, 2013

I've always wondered why don't they just like devote the entire chipspace to one core, and if need be space it out so that heat dissapates easier. Like, with each process shrink they seem deterimined to reduce the size of the die.

But anyways, a 12ghz core would melt pretty fast. That's like higher than the liquid nitrogen cooled setups are capable of.

aigomorla · Apr 6, 2013

its really relative on how the program is written

We had a very large single core vs dual core war on this forum when the AMD X2's first came out.

Then we had a insane war on the C2D vs C2Q.

what we learned from that war was its all at the mercy of the programers.

If you think of a quadcore as a truck.. and a dual core as a sedan... if the road isnt made for a truck.. a sedan can get to point A - B Faster.
If the road was made for a truck.. the truck can carry more cargo from point A - B Faster.

the road being the programer on how he writes the program.

now if u want me to add a wrench in this works... lets talk about Hyper Threading.

Cuz those arent real cores which act like real cores.

Idontcare · Apr 6, 2013

desura said:
I've always wondered why don't they just like devote the entire chipspace to one core, and if need be space it out so that heat dissapates easier. Like, with each process shrink they seem deterimined to reduce the size of the die.

Money. It raises the production cost considerably.

sefsefsefsef · Apr 6, 2013

Idontcare said:
You give up a lot of IPC in going for clockspeed.

This is true only because of the main memory latency issue. Unless the application is entirely cache resident, then as clockspeed goes up, IPC must go down.

It is not, however, a fundamental truth of CPU design. If you can have an on-die cache that is large enough for your workload, you can have both high IPC and extremely high clockspeeds-- absolute performance will scale linearly as clockspeed increases. Furthermore, from an architectural perspective, you generally don't have to give up features which improve IPC in order to make room for features which allow for a higher clockspeed ceiling-- you can have both.

Exophase · Apr 6, 2013

sefsefsefsef said:
This is true only because of the main memory latency issue. Unless the application is entirely cache resident, then as clockspeed goes up, IPC must go down.

It is not, however, a fundamental truth of CPU design. If you can have an on-die cache that is large enough for your workload, you can have both high IPC and extremely high clockspeeds-- absolute performance will scale linearly as clockspeed increases. Furthermore, from an architectural perspective, you generally don't have to give up features which improve IPC in order to make room for features which allow for a higher clockspeed ceiling-- you can have both.

Internal cache latency, pipeline length, and latency of various operations (all in terms of clock cycles) all go up with clock speed too. There are physical limits to how much time it takes to perform those things just like there is with accessing an external DRAM. For example. no one has single cycle 32KB 8-way set associative L1 caches at 4GHz. L2 cache latency cycles is impacted as well. No one has single cycle FMADs at 4GHz. Fetch and decode take several pipeline stages, with several more dedicated to scheduling and sometimes even multiple stages dedicated to execution of simple integer operations. Very high clocked designs for their time (Netburst, Cell) have even had pipeline stages present just to artificially give time for the signals to propagate.

It may be hard to notice because high clock processors today also tend to do a lot of other things to try to keep IPC as high as possible. But if they were designed around a substantially lower clock speed the IPC would be higher or there'd be less design sophistication necessary to achieve the same IPC.

It's true that performance will scale linearly as clock speed increases, but that's only because the timing characteristics are mostly locked to fixed numbers of cycles that are specified for whatever maximum clock speed the design can support. Trying to put variable cycle times all over the internal design severely compromises things.

Generally, the increase in peak clock speed outweighs the IPC penalty of getting the clock speed or it wouldn't be chosen. It's arguable that in some cases that didn't really work out to be true (Netburst again), perhaps due to a combination of practical clock scaling not going as high as anticipated and software not playing as nice as you hoped it would..

But the real issue is if you want to claim that a CPU that is designed to run at both 4GHz and 1GHz is every bit as efficient as two CPUs that run at 4GHz and 1GHz respectively (basically the big.LITTLE scenario). The CPU designed to run at 1GHz will have the potential to do it more efficiently than the 4GHz CPU scaled down to 1GHz.

2is · Apr 6, 2013

If more slower cores were better than a few faster ones, AMD would be killing Intel right now.

Exophase · Apr 6, 2013

2is said:
If more slower cores were better than a few faster ones, AMD would be killing Intel right now.

But then you have the problem that an i7 is not just a quad core and an FX-8350 is not quite an octa core, making it hardly a cut and dry comparison.

Why are more slower cores better than one fast core?

Diamond Member

Elite Member

Elite Member

Senior member

Golden Member

Senior member

Diamond Member

Elite Member

Diamond Member

Senior member

No Lifer

Senior member

Diamond Member

Lifer

Lifer

Elite Member

Diamond Member

Elite Member

Diamond Member

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member

Elite Member

Senior member

Diamond Member

Diamond Member

Diamond Member