Setting performance expectations for Bulldozer(client)

Edrick · Mar 4, 2011

Cerb said:
Because Apache, MySQL, DB2, Oracle whatever-the-name-is-now, and Postgres can all make great use of more cores almost as well as faster cores, and are all very low IPC, throwing a wrench in your assumptions. In fact, most server software that isn't HPC is low IPC. Performance per watt can be quite good.

Agreed. But we already have options (multi-socket) for having 16-32-64 cores on a server. Granted they are expensive (relative), but they do exist. It gets to a point where having 20 cores per socket, with 4 sockets, that we start to bottleneck in other places of the system. Scaleability of the x86 designs come into play here. One can argue that if you need this horsepower, chips like Power7 and Itanium may be better options.

Personally, I would much rather see cores have more stages, greater SMT, better IPC, faster clocked, lower wattage, more instruction sets like FMA, etc., even if we have less of them. Yes that means applications will need to be recompiled to take advantage of this, but they generally need to be re-designed to take advantage of more threads as well (usually).

I am more upset with Intel and AMD as they just throw more cores at us and charge more $$, instead of trying to make advances in the IPC. The truth is that more cores does not help 90% of their customers. But they want you to think it does, and market it as such.

Voo · Mar 4, 2011

Cerb said:
Because Apache, MySQL, DB2, Oracle whatever-the-name-is-now, and Postgres can all make great use of more cores almost as well as faster cores, and are all very low IPC, throwing a wrench in your assumptions. In fact, most server software that isn't HPC is low IPC. Performance per watt can be quite good.

And the majority of businesses still want to keep latency down and aren't all that interested in throughput. Who cares if the server can handle 10k users at once if every one of those has to wait 1 second for a response?

But then that shouldn't really be a problem for AMD since x86 is still quite strong per core, but just thinking that per core performance is completely uninteresting is a bit too simple. There's a google paper about that.

Concillian · Mar 4, 2011

IntelUser2000 said:
i7 975 vs i7 2600: http://www.anandtech.com/bench/Product/287?vs=99

http://www.tomshardware.com/reviews/sandy-bridge-core-i7-2600k-core-i5-2500k,2833-16.html

Some single threaded benchmarks:
-LAME
-iTunes conversion
-Windows Media Encoder

1) Cinebench single threaded shows much larger.

2) the tests you cite are all TIME based tests. Keep in mind that in a time based test, 100% faster means it completes in 50% the time. If looking at speed comparison you need to roughly double the % difference between them or calculate in the appropriate way (faster number / 1 + x% = slower number... then use algebra to find x.)

Current SB single threaded performance is about 30-50% faster than AMD Phenom II in most benchmarks. In some it's even more than this. The best comparison to do this with is the i3 2100, since there's no turbo, unfortunately there's no PHenom at 3.1 GHz, so we have to add 3.33% to the 3.0 GHz Phenom.

http://www.anandtech.com/bench/Product/289?vs=80

Worst case for the Intel, Winrar shows the x4 as a little less than 40% faster (4 cores vs. 2 cores. It should be close to 100% faster in the best multithreaded cases). If you give the x4 an extra 10% for an efficiency penalty of 4 cores vs. 2 cores, you still come out the the Intel is 30% faster per core... in the test that gives the highest margin of victory to AMD.

You can't dismiss single core performance when most of the benchmarks are showing AMD running about 60% of Intel clock for clock.

IPC delta is larger than ever, and it REALLY hurts in certain games that don't really make use of multicore very well (Starcraft II and WoW, most notably). In situations like this, IPC is everything.

I think your original assumption of a 20% spread in IPC is very, very wrong. Look at the stars core CPUs and the i3 2100 is over 50% faster in most benchmarks.

IntelUser2000 · Mar 4, 2011

Concillian said:
the tests you cite are all TIME based tests. Keep in mind that in a time based test, 100% faster means it completes in 50% the time. If looking at speed comparison you need to roughly double the % difference between them.

Who doesn't know that? Are you trying to catch me off guard here?

LAME: 20% faster with 10% higher clock(3.8GHz vs 3.46GHz Turbo for 2600 and 870)
WinZIP: 13.4% faster
iTunes: 15.4% faster

Cinebench ST is about the highest gain for single threaded ones.

Current SB single threaded performance is about 50% faster than AMD Phenom II. The best comparison to do this with is the i3 2100, since there's no turbo, unfortunately there's no PHenom at 3.1 GHz, so we have to add 3.33% to the 3.0 GHz Phenom.

http://www.anandtech.com/bench/Product/289?vs=80

Don't need to do that. You can compare it against Phenom II 550BE. A dual core chip.

Good article that showed comparison between Nehalem and Penryn with Turbo and Hyperthreading off:
http://www.xbitlabs.com/articles/cpu/display/intel-core-i7.html

Phenom II vs. Penryn: http://www.anandtech.com/bench/Product/80?vs=49

5-10%

5-10% from Deneb to Penryn, Penryn to Nehalem, Nehalem to Sandy Bridge. At best, you are looking at 30%. Actually, Nehalem gains more like 0-5%, if you look at wider array of benchmarks. The big gains like here: http://www.anandtech.com/bench/Product/48?vs=45

Result from multi-thread advancement like SMT and memory bandwidth improvements.

IPC delta is larger than ever, and it REALLY hurts in certain games that don't really make use of multicore very well (Starcraft II and WoW, most notably). In situations like this, IPC is everything.

No one disagrees with you, but assuming 50% single thread performance advantage for Intel is such a false assumption. Neither games are explicitely single threaded either, which will take advantage of specific multi-threaded gains Intel put in the Nehalem generation.

Cerb · Mar 4, 2011

Edrick said:
Agreed. But we already have options (multi-socket) for having 16-32-64 cores on a server. Granted they are expensive (relative), but they do exist. It gets to a point where having 20 cores per socket, with 4 sockets, that we start to bottleneck in other places of the system. Scaleability of the x86 designs come into play here. One can argue that if you need this horsepower, chips like Power7 and Itanium may be better options.

Depends, but there is clearly that demand. AMD is not making 12-core CPUs for your desktop, the 16-core BDs won't be, nor the 20-core to follow. They are to be flagship server CPUs, not what everyone in the world buys.

Personally, I would much rather see cores have more stages, greater SMT, better IPC, faster clocked, lower wattage, more instruction sets like FMA, etc., even if we have less of them. Yes that means applications will need to be recompiled to take advantage of this, but they generally need to be re-designed to take advantage of more threads as well (usually).

Higher clocks mean higher TDP, and Intel already learned their lesson about trying to push the clock speed limits. Clock speeds will not jump up.

Higher clocks with higher IPC will only raise the TDP faster, and increase the chances of not being able to meet higher clock speed targets. Neither AMD nor Intel are willing to have a single socket exceed 150W, either.

Lower wattage necessitates lower clocks, fewer transistors switching (less work done per clock), and/or LVDS and/or low power manufacturing (which won't reach very high clocks).

You are asking for the impossible, unless you want to pay $500+ for low-end CPUs, with a fridge unit for cooling.

I am more upset with Intel and AMD as they just throw more cores at us and charge more $$, instead of trying to make advances in the IPC. The truth is that more cores does not help 90% of their customers. But they want you to think it does, and market it as such.

Intel has been improving both IPC and total performance per thread per generation since the Pentium-M reset. AMD has been since the K6. Both companies have been increasing total per-thread performance, as they have also been adding cores, and will continue to do so. Nobody in their right mind would intentionally create weaker cores for the sake of having a ton of them. If nothing else, Amdahl's law would bite them in the ass, for CPU work. The reality, though, is that we are well into diminishing returns of IPC improvements.

Voo said:
And the majority of businesses still want to keep latency down and aren't all that interested in throughput. Who cares if the server can handle 10k users at once if every one of those has to wait 1 second for a response?

Nobody, if latency matters, and latency is sufficiently low without thousands of requests. However, if that became due to choosing a server with too little memory bandwidth, then it was poor planning. If latency is low for a small number of users, it's generally not hard to keep it low for a large number of users, so long as there is a maximum number planned for. After that, either you let it get slow, or you deny new connections. What CPU config is best is a more specific problem than just that, and many times it's easier and cheaper to buy some headroom than to actually figure out the answer (so long as you can figure out a good minimum estimate). If everyone were able to know exactly what was needed, and buy just that, server markets would become unrecognizable compared to today.

But then that shouldn't really be a problem for AMD since x86 is still quite strong per core, but just thinking that per core performance is completely uninteresting is a bit too simple. There's a google paper about that.

But there's a point--generally, wherever Intel is at with their fastest CPUs--where whether it is interesting or not doesn't matter, because you can't buy a faster one, and they haven't figured out how to produce a faster one that enough people will want to buy. It's not that people don't want faster single cores. It's that a single core is not fast enough. For a problem that can use many cores, the answer to this is simple: do the extra work needed to use more cores.

If there were CPUs that would stay within reasonable power budgets, available at low cost, with much higher performance per core, such that you would not need to scale out to more cores, obviously people would buy more of those. In fact, they do, because that's pretty much a comparison of the current Xeons against Opterons. OTOH, if the per-thread performance of the fastest clock speed Xeon isn't enough, and it's not an IO constraint, there is nowhere to go. It's not because anyone really wishes this were the case, but because real world constraints make it the best option, and we've got to make the best of that.

Edrick · Mar 4, 2011

Cerb said:
Higher clocks mean higher TDP, and Intel already learned their lesson about trying to push the clock speed limits. Clock speeds will not jump up.

Higher clocks with higher IPC will only raise the TDP faster, and increase the chances of not being able to meet higher clock speed targets. Neither AMD nor Intel are willing to have a single socket exceed 150W, either.

Lower wattage necessitates lower clocks, fewer transistors switching (less work done per clock), and/or LVDS and/or low power manufacturing (which won't reach very high clocks).

You are asking for the impossible, unless you want to pay $500+ for low-end CPUs, with a fridge unit for cooling.

Smaller processes fix most of that. IPC improvements can actually help TDP as we see with SB. So yes, it is possible. But no one is rushing to get out the next node process since there is a lack of competition at this point in time.

But throwing more cores into the mix really bumps up TDP as well. So my arguement of less cores and better IPC/faster cores is not impossible. It comes down to more cores = more sales, since most consumers think more cores is always better.

I would take a 4 core SB running @ 4.0ghz over a Westmere 6 core running @ 3.0Ghz anyday. (I know the numbers are not exact, just making a point). And as we move into the 8 core world, this will only get worse. How many of us really need 8 core/16 threads today? Granted a few years from now more programs utilize it, sure. But I am talking right now.

Cerb · Mar 4, 2011

Edrick said:
Smaller processes fix most of that. IPC improvements can actually help TDP as we see with SB. So yes, it is possible. But no one is rushing to get out the next node process since there is a lack of competition at this point in time.

Ah, so Intel, if they really wanted to, could go ahead and create an 11nm CPU? Process shrinks only get a certain amount of speed and power reduction, and they can only be done so quickly. It takes money, time, and good engineers.

But throwing more cores into the mix really bumps up TDP as well. So my arguement of less cores and better IPC/faster cores is not impossible.

Do you think that if Intel completely got rid of 3 cores on a 4-core die, that one core left would be able to be clocked much faster, within TDP?

It comes down to more cores = more sales, since most consumers think more cores is always better.

Duallies are still what most people buy, though.

I just don't see how you expect them to work miracles. As far as I can tell, unless you want to go way over currently-accepted power budgets, they are getting you the most single- and multi-core performance possible.

I would take a 4 core SB running @ 4.0ghz over a Westmere 6 core running @ 3.0Ghz anyday. (I know the numbers are not exact, just making a point). And as we move into the 8 core world, this will only get worse. How many of us really need 8 core/16 threads today? Granted a few years from now more programs utilize it, sure. But I am talking right now.

I'll give you one better: how about running an n-core SB (n>3), but when only using one taxing thread, having the other n-1 cores entirely powered off, and running that one core as fast as it would be clocked if it had been designed as a single-core CPU to begin with?

PreferLinux · Mar 4, 2011

Cerb said:
I'll give you one better: how about running an n-core SB (n>3), but when only using one taxing thread, having the other n-1 cores entirely powered off, and running that one core as fast as it would be clocked if it had been designed as a single-core CPU to begin with?

Simple: it is NOT as fast as if it had been designed as a single core.

Cerb · Mar 4, 2011

PreferLinux said:
Simple: it is NOT as fast as if it had been designed as a single core.

How much faster could it be if it had been? Has Intel given any indication that they would clock them faster if fewer cores existed on the die?

Mopetar · Mar 4, 2011

Yeah, it could clock higher. That's why you see larger amounts of turbo when only a single core is being used for both Intel and AMD. Eventually you'll hit a ceiling in terms clock speed where additional gains consume an unreasonable amount of TDP headroom. If your have a multi-threaded workload, you'll probably be able to get better returns from more, slower cores than a single fast core.

Having a good turbo when only one core is being used helps to balance the design by getting back some of the performance lost due to having multiple cores. It's probably not going to be able to boost the clock rate as high as a chip designed to have a fast, single core, but having multiple cores is generally more useful than only one.

Edrick · Mar 4, 2011

Cerb said:
Ah, so Intel, if they really wanted to, could go ahead and create an 11nm CPU? Process shrinks only get a certain amount of speed and power reduction, and they can only be done so quickly. It takes money, time, and good engineers.

If Intel had been pushed by AMD or any other company in the past 5 years, then I would be willing to bet we would be on 22nm now and not in 2012.

Cerb said:
Do you think that if Intel completely got rid of 3 cores on a 4-core die, that one core left would be able to be clocked much faster, within TDP?Duallies are still what most people buy, though.

Yes I do. Not 4 times faster, but still much faster. And if it was designed as a single, then sure....why not.

Cerb said:
I just don't see how you expect them to work miracles. As far as I can tell, unless you want to go way over currently-accepted power budgets, they are getting you the most single- and multi-core performance possible.

Not miracles, just less marketing. You really doubt intel could have released a 3.8ghz i7 950 for example? You really doubt Intel could have released a SB clocked at 4Ghz? OCers are well past that.

Cerb said:
I'll give you one better: how about running an n-core SB (n>3), but when only using one taxing thread, having the other n-1 cores entirely powered off, and running that one core as fast as it would be clocked if it had been designed as a single-core CPU to begin with?

Sure, why not. Turbo 3.0.

If I run 1 thread or 4 threads, I am running within a range of just a few hundred Mhz thanks to Turbo 2.0. Before that, I was running the same speed no matter how many threads.

IntelUser2000 · Mar 4, 2011

Edrick said:
If Intel had been pushed by AMD or any other company in the past 5 years, then I would be willing to bet we would be on 22nm now and not in 2012.

Mmm, this part isn't at all likely. Process tech has always been 2 years for Intel, never less, never more(you know aside from plus or minus 1-2 months).

You won't see more than 10-20% increase in clocks by having a single core, because diminishing returns hit hard if you push one metric, and advanced power management on multi-core processors make sure the power usage isn't = n x core

Power gating isn't everything, which is why different binned processors have different lowest power usage figures.

OneEng1 · Mar 5, 2011

While single threaded performance is still moderately important today, it is becoming more and more irrelevant. In designing BD, AMD has designed for a more parallel processing design within a minimum die space.

SMT adds more die space, and requires quite a bit more validation for the design. It is essentially a way to share core resources.

CMT is a different approach to sharing of resources on the CPU die. It is arguably a more efficient design giving more performance per area than SMT while needing less validation.

As for the assumption that 20c is not needed, that is absurd. The idea that more sockets would be a good alternative to more cores is disingenuous. 10 cores clocked at 2 times the speed would consume 2^3 (8) times the power while 20c would consume 2 times the power of a 10 core to get the same performance result in highly threaded server and HPC applications.

Thermally limited core by core level overclocking gives you the best of both worlds. If you are only using 2 cores, then clock those higher while utilizing the entire cache for just 2 cores.

SB does not feel much like a "Tock" to me. It is certainly not a radical departure from Core 2/Nehalem architecture, but rather an enhancement.

SB still has cores that each handle both FPU and INT. AMD has had different pipelines for INT and FPU for some time; however, BD takes this to an entirely different level. This allows sharing at a completely different level within the chip. In the long run, this architecture also allows for different numbers of different specialized computing units in order to create specialized processors for different workloads.

A BD created for heavy FPU could contain multiple FPU's with fewer int units for instance. A BD created for graphics could contain severaly GPU units, etc, etc.

As for relative performance of BD to SB, we don't have long to wait I think to just read the benchmarks. I suspect that the price of Intel processors will be coming down shortly after the BD launch

IntelUser2000 · Mar 5, 2011

OneEng1 said:
As for the assumption that 20c is not needed, that is absurd. The idea that more sockets would be a good alternative to more cores is disingenuous. 10 cores clocked at 2 times the speed would consume 2^3 (8) times the power while 20c would consume 2 times the power of a 10 core to get the same performance result in highly threaded server and HPC applications.

No it wouldn't. Because 2x clock is faster everywhere. Cores are subject to diminishing returns. 1.5x clock should do it. Cubed result with power will only result with voltage increases, otherwise it'll be linear, then only with the cores, not the whole chip. Though, you are right in general.

SB does not feel much like a "Tock" to me. It is certainly not a radical departure from Core 2/Nehalem architecture, but rather an enhancement.

It's not a Tick either. It's hard to draw a line between what's a new architecture and what's an enhancement. But almost no part of the chip is unchanged, so I disagree with you. Change from reorder buffers to physical register file is enough to call it a new OoO architecture though.

Arkadrel · Mar 5, 2011

If Intel had been pushed by AMD or any other company in the past 5 years, then I would be willing to bet we would be on 22nm now and not in 2012.

If intel had been pushed, and AMD had more than 20% market share... intel would not have made as much money as they do now with 80% of the market.

So intel would have had less money.... putting up fabs is expensive, as is the R&D.

I dont think if market was like 50% AMD / 50% Intel, we would be further along than we are now, I dont think we d be running 22nm cpus yet.

Cerb · Mar 5, 2011

Edrick said:
You really doubt intel could have released a 3.8ghz i7 950 for example? You really doubt Intel could have released a SB clocked at 4Ghz? OCers are well past that.

Yes, I do doubt that they could (1) get enough that fast to be worth it for them (2) within their current TDP limits. OCers are always ahead, but they can freely ignore limitations like power envelope.

If AMD comes out with something truly competitive, than those samples which aren't currently worth binning differently would magically become new expensive low-volume CPUs. However, that behavior would be the same regardless of cores implemented.

If the Pentium 4 had never happened, and per-thread performance were job #1, by current measures, we might be up to around 10-15% better performance per thread. Multiple cores would still be the best option to get the most out of Moore's Law.

Edrick · Mar 5, 2011

Cerb said:
Multiple cores would still be the best option to get the most out of Moore's Law.

I disagree with this. I am a developer for a large financial institution, and I know how difficult it is to add more threads to applications. All these threads need to be managed properly within the code or else you can actually cause slowdowns (wait states). Sure 2,3,4 threads can be done with little effort. And 5,6,7,8 can be done with much more effort. But when we start hitting 16,20,32 threads, that is where it can get messy. And in fact, some applications would hit a thread limitation well before that. There is a reason games are lagging behind the hardware (and they are spending millions in development).

So sure, for servers, more cores will always be better. Simply because we can run more applications on them. But for the desktop, how many 2-4 threaded applications would you want to run at the same time? It is going to get to a point real soon where more cores will mean almost nothing. When you get a 16 core cpu (32 threads) in a few years, let me know what it takes to utilize it all. And let me know if you would do that all the time (DC apps not included).

One reason dual cores are still being sold today is faster clocked dual cores is still better than slower clocked quads for many people. And faster quads are better than slower hexes for many people (myself included). We will see a point where Intel and AMD will have to stop relying on more cores as a way to improve the cpu. I know some may not agree with me, and I could be wrong, but this is what I believe.

Edrick · Mar 5, 2011

IntelUser2000 said:
Mmm, this part isn't at all likely. Process tech has always been 2 years for Intel, never less, never more(you know aside from plus or minus 1-2 months).

Ok, and if Intel shaved 1-2 months off each process since 90nm, then we would be on 22nm as of now (or next month). Bottom line is they didn't have to. They have been able to milk the most out of each process. And they are a business so who can blame them.

I do not think each process is dependent of the previous. Yes, each one cost billions, and there is no doubting that. but if Intel chose to skip 32mn and go directly to 22nm, they would have saved time and money in developing 32nm. And then 22nm could have been released much sooner. I believe this to be true.

But, as a business, why would they? Why give up all the revenue in 32nm? They wouldn't. But that does not mean they couldn't. They have a business model and they follow it. And business models change based on market conditions and competition. Being top dog allows you to be stagnant. I don't think you can disagree with that statement. I am not saying that Intel has been really stagnant, but on the other hand I don't think they have been "working around the clock" so to speak.

OneEng1 · Mar 5, 2011

IntelUser2000 said:
No it wouldn't. Because 2x clock is faster everywhere. Cores are subject to diminishing returns. 1.5x clock should do it. Cubed result with power will only result with voltage increases, otherwise it'll be linear, then only with the cores, not the whole chip. Though, you are right in general.

Actually, the theoretical formula says that frequency is exactly linear to power consumption and that voltage is only squared I believe .... so not only did I have that backwards, I also got it wrong.

Having said that, it still isn't true in real life. The equation P=C*F*V^2 is simplistic in its assumptions. In real life, factors change as frequency is increased. This can be seen with empirical measurements. Still, your point is more correct than my original assumption of f^3.

It is still true that having more cores and fewer sockets is much more power efficient than fewer cores and more sockets. It is also much less expensive. More cores not only gives you more performance per watt, it also gives you more performance per $.

IntelUser2000 said:
It's not a Tick either. It's hard to draw a line between what's a new architecture and what's an enhancement. But almost no part of the chip is unchanged, so I disagree with you. Change from reorder buffers to physical register file is enough to call it a new OoO architecture though.

From a performance perspective, it is a Tick IMHO. The only possible rationale for a "Tock" would be the inclusion of a grahics core.

I guess I relate a "Tock" to an architectural change. Making tweaks to the various buffers seems like "Tick" work to me

Still, it is a silly argument on either side. It is a pretty weak "Tock" in terms of performance and impact on the market.

Edric said:
So sure, for servers, more cores will always be better.

Absolutely. Lets keep in mind here that the 20c part in the future as well as the soon to be released 16c part are only for server/workstation duty and only run in Socket G34 having 2 banks of 2 channel memory (4 channel if you will).

I would argue that even on the desktop, the future is more cores. The potential scaling is simply impossible to match with clock speed. The only really long tasks on PCs are fortunately also the ones that are easily threaded.

I agree that threaded code is more difficult than non-threaded code; however, come on guys. It isn't THAT hard in general. Now some tasks simply don't lend themselves to being easily threaded.... but most really CPU intensive tasks do. I don't say this without any background either. I have written automotive test system software that operates hundreds of threads simultaneously. It wasn't without difficulty, but it also isn't "years" of work either.

The fact of the matter is that any task that needs high performance is going to have to move to highly threaded architecture since clock speed is not going anywhere fast.

If my argument is true, then BD has a better architecture moving forward since it has the ability to put more smaller cores into a single chip along with specialized cores (like the FPU and GPU) to enhance its performance vertically and horizontally.

I completely expect Intel to follow suit in a few years (after IB perhaps?).

Mopetar · Mar 5, 2011

Edrick said:
I disagree with this. I am a developer for a large financial institution, and I know how difficult it is to add more threads to applications. All these threads need to be managed properly within the code or else you can actually cause slowdowns (wait states). Sure 2,3,4 threads can be done with little effort. And 5,6,7,8 can be done with much more effort. But when we start hitting 16,20,32 threads, that is where it can get messy. And in fact, some applications would hit a thread limitation well before that. There is a reason games are lagging behind the hardware (and they are spending millions in development).

There are ways to get around these difficulties, it's just that for the longest time programmers didn't receive much education about multi-threaded programming and in some ways academia still hasn't caught up to the changes in the industry. Another problem is that if you have a large legacy codebase, it's already a nightmare to maintain and making changes in it can result in all manner of strange bugs.

You could use a thread pool similar to Apple's GCD where you can easily find segments of the code that parallelize well in order to take advantage of more cores. I don't know much about financial software, but I assume there are some operations that need to applied to numerous members of a data set. These are the cases where having huge numbers of threads can significantly improve performance. Of course if there's not good support for something like that in the language that you use it's not going to help you much.

Only within the last half decade have consumer devices had more than a single core. It's no surprise that a lot of developers haven't spent a lot of time learning how to program for them. I feel that the amount of information covered in these topics when I got my CS degree were woefully inadequate compared to the growing modern need.

Cerb · Mar 5, 2011

Edrick said:
I disagree with this.

You'd rather have a few percent more performance for one thread, instead of all the performance of a whole other CPU? The trade-off is far from linear.

Even Intel can only get so much performance per thread per generation. Within any given time frame, there is very little room to grow. R&D needed to make improvements that really work take time, as future efforts build on previous efforts.

When you get a 16 core cpu (32 threads) in a few years,

I'll probably be on 8c/8t, then. With luck, it will take a few years to take advantage of it, that I might avoid upgrade cycles, like I have been w/ my C2D.

It sucks that you can't use those extra resources, but if you can't, then you have to accept lower performance. There aren't other good options available, yet (unless you could squeeze good performance out of an Itanium, but that would be its own bottomless can of worms).

One reason dual cores are still being sold today is faster clocked dual cores is still better than slower clocked quads for many people.

That doesn't work out very well, since the duals and quads run at similar speeds. Most people really don't need, nor have any uses for, more than two CPU cores. Hex and up hit high TDPs a bit too readily, at similar speeds, though.

And faster quads are better than slower hexes for many people (myself included). We will see a point where Intel and AMD will have to stop relying on more cores as a way to improve the cpu. I know some may not agree with me, and I could be wrong, but this is what I believe.

We will continue to get both clock speed and IPC improvements from both Intel and AMD, as well. They aren't relying on more cores to improve the CPU--there are people demanding more cores, because they can't make each single core much faster (within real-world constraints), yet manufacturing advancements allow them to cram more and more xtors on a cheap chip.

Idontcare · Mar 5, 2011

Mopetar said:
There are ways to get around these difficulties, it's just that for the longest time programmers didn't receive much education about multi-threaded programming and in some ways academia still hasn't caught up to the changes in the industry. Another problem is that if you have a large legacy codebase, it's already a nightmare to maintain and making changes in it can result in all manner of strange bugs.

You could use a thread pool similar to Apple's GCD where you can easily find segments of the code that parallelize well in order to take advantage of more cores. I don't know much about financial software, but I assume there are some operations that need to applied to numerous members of a data set. These are the cases where having huge numbers of threads can significantly improve performance. Of course if there's not good support for something like that in the language that you use it's not going to help you much.

Only within the last half decade have consumer devices had more than a single core. It's no surprise that a lot of developers haven't spent a lot of time learning how to program for them. I feel that the amount of information covered in these topics when I got my CS degree were woefully inadequate compared to the growing modern need.

There are fundamental limitations to thread-scaling, and there really is a fundamental maximum scaling potential before the performance does in fact suffer and scaling reverses (it begins to take longer for the job to finish).

Nothing can be done to avoid these aspects of parallelized workloads on realized hardware, it is fundamental.

^ that's from my thesis on the topic of thread-scaling limitations, but here's the same message from a source you might put more credibility towards:
https://share.sandia.gov/news/resou...lower-supercomputing-sandia-simulation-shows/

Voo · Mar 5, 2011

Idontcare said:
There are fundamental limitations to thread-scaling, and there really is a fundamental maximum scaling potential before the performance does in fact suffer and scaling reverses (it begins to take longer for the job to finish).

Nothing can be done to avoid these aspects of parallelized workloads on realized hardware, it is fundamental.

That's a law isn't it? I quite remember seeing similar slides in one lecture or the other, but hell if I remembered the name~

Well but it's only true if you limit yourself to a cache coherent SMP, no? If you remove cache coherency problems the world looks a bit better already and then nobody says every core has to have access to the whole memory.. granted then we end up with a CPU that's more or less a small cluster, but I'd think it shows it's not really "fundamental", just not feasible today (well or at all).

Though message parsing makes programming a whole lot more complicated and then the majority of programmers has already problems with in comparison "simple" mechanisms.

But yep I'm quite interested in what Intel/AMD are planning as soon as it shows that manycores aren't feasible in reality.

Mopetar · Mar 5, 2011

There's nothing wrong with having a small cluster. Google finished the TeraSort benchmark in 68 seconds using MapReduce.

As long as the workloads are largely independent, the entire application scales well. The problem is that not all parts of an application can be broken up.

Take a fairly naive problem where there is an array of size X with integers in each element and some variable initialized as 0. Assume we have a program that first adds or subtracts 1 from every element of the array depending on whether or not it's odd or even. Next, the program will start at the beginning of the array and either add or subtract that amount with the variable depending on whether the variable is currently odd or even.

The first part can be easily parallelized and can take advantage of multiple cores. The second part must be done sequentially so there's no way to take advantage of the extra cores. The two parts of the program are fairly similar in terms of operations used, but even with a CPU with tons of cores, the program will never run more than twice as fast as if a single-core CPU were used instead.

Many cores are feasible, but just not feasible for all workloads. There are still things that they can be used for even if any single application doesn't need them. For example, all web browsers could be run in a VM so that even if they are hit by some kind of attack, the host machine isn't compromised.

If we hit a saturation point (and some would say for a lot of people we already have.) Intel and AMD can simply just stop adding more cores and focus on reducing the amount of power necessary and the cost of the chip.

Voo · Mar 5, 2011

Mopetar said:
There's nothing wrong with having a small cluster.

Oh sure, I didn't say that and I've written one or two algorithms in MPI, so I think I get the gist of it

But still, writing efficient message parsing programs is way more complicated than writing efficient programs using threads and that's way more complicated than writing sequential programs. Also with mp you can forget finegrained parallelism, the overhead is way too large so stuff that works fine with a SMP won't necessarily work in a cluster and so on.

And then nobody argues that there are purely sequential workloads out there and a whole lot more algorithms for which parallel versions exist but which aren't work efficient.

Setting performance expectations for Bulldozer(client)

Golden Member

Golden Member

Diamond Member

Elite Member

Elite Member

Golden Member

Elite Member

Senior member

Elite Member

Diamond Member

Golden Member

Elite Member

Junior Member

Elite Member

Diamond Member

Elite Member

Golden Member

Golden Member

Junior Member

Diamond Member

Elite Member

Elite Member

Golden Member

Diamond Member

Golden Member