[] IBM unveils Power8 and OpenPower pincer attack on Intels x86 server monopoly

Idontcare · May 4, 2014

thunng8 said:
No .. hard data on Haswell vs Ivy Bridge hyperthreading performance. You claimed Haswell hyperthreads much better.

Given the very reason why the opportunity even exists for such a thing as hyperthreading to deliver non-zero performance boost in the first place (unoptimal pipeline, cache misses and stalls, etc) one would hope that in its continued optimizations of its Core microarchitecture that Intel would be reducing, not increasing, the inefficiencies of the pipeline and as such we should see less and less performance boosts from hyperthreading as each iteration of the microarchitecture comes to pass.

The only thing that should reverse this expected trend is if Intel were to suddenly increase clockspeeds thereby widening the opportunity for hyperthreading to reduce inefficiencies in the pipeline.

(I'm not posting this to say anything you don't already know, am merely posting it to add to the ongoing discussion on the topic)

Phynaz · May 4, 2014

Didn't Haswell go wider, so it has more execution units to fill with a second thread?

Idontcare · May 4, 2014

Phynaz said:
Didn't Haswell go wider, so it has more execution units to fill with a second thread?

Yes, but the execution units aren't shared by the threads. Whichever thread is active (thread A or B), it gets all the execution units until it stalls and the other thread is activated.

By being wider, IPC increases but only on a per-thread basis. It doesn't improve the ability of the core to handle more concurrent threads as that is dependent on the opportunity for switching between threads (which generally requires a stall in the pipeline somewhere).

Remember that hyperthreading is basically a way to keep the pipe loaded in the event of a stall (cache miss, branch mispredict, etc). Anything that is done to the core to improve pipeline efficiency (better cache, better branch prediction, etc) will actually result in a net reduction in opportunities for the second thread to be activated since the first thread is effectively stalling less often.

Increasing the width of the processor should not result in an increase in the prevalence for thread stalls, unless something went awry in the resource balance during design.

tarlinian · May 5, 2014

Idontcare said:
Yes, but the execution units aren't shared by the threads. Whichever thread is active (thread A or B), it gets all the execution units until it stalls and the other thread is activated.

What? This is the whole point of SMT...to share execution units between multiple threads. Any old processor can switch contexts to run one thread at a time. A smart implementation may decide to fill all execution units from one thread until a specific pipeline stalls, but it certainly doesn't switch context between different threads.

AtenRa · May 5, 2014

Idontcare said:
Yes, but the execution units aren't shared by the threads. Whichever thread is active (thread A or B), it gets all the execution units until it stalls and the other thread is activated.

By being wider, IPC increases but only on a per-thread basis. It doesn't improve the ability of the core to handle more concurrent threads as that is dependent on the opportunity for switching between threads (which generally requires a stall in the pipeline somewhere).

Remember that hyperthreading is basically a way to keep the pipe loaded in the event of a stall (cache miss, branch mispredict, etc). Anything that is done to the core to improve pipeline efficiency (better cache, better branch prediction, etc) will actually result in a net reduction in opportunities for the second thread to be activated since the first thread is effectively stalling less often.

Increasing the width of the processor should not result in an increase in the prevalence for thread stalls, unless something went awry in the resource balance during design.

Pipeline stalling is only one reason for the SMT. You can use all the execution resources between two threads simultaneously. But because you dont have double the resources(execution units) you dont always have two threads per cycle.
The second thread will only get the execution resources available at the time, so if you dont have any execution resources that the second thread needs to use, for example an ALU, then only a single thread will be used.

jj109 · May 5, 2014

As of Ivy Bridge, it strategy seems to be decode from both threads, shove them into the buffer, and let the OoOE engine sort them out.

On one hand, the effective throughput increases since two threads can use more ports than one thread (usually). On the other hand, the execution engine doesn't distinguish which thread is which... micro-ops are dispatched based on readiness and order of arrival. This causes some weird situations where applications can decrease in performance because a critical thread has another thread competing for the same core resources.

Phynaz · May 5, 2014

Idontcare said:
Yes, but the execution units aren't shared by the threads. Whichever thread is active (thread A or B), it gets all the execution units until it stalls and the other thread is activated.

By being wider, IPC increases but only on a per-thread basis. It doesn't improve the ability of the core to handle more concurrent threads as that is dependent on the opportunity for switching between threads (which generally requires a stall in the pipeline somewhere).

Remember that hyperthreading is basically a way to keep the pipe loaded in the event of a stall (cache miss, branch mispredict, etc). Anything that is done to the core to improve pipeline efficiency (better cache, better branch prediction, etc) will actually result in a net reduction in opportunities for the second thread to be activated since the first thread is effectively stalling less often.

Increasing the width of the processor should not result in an increase in the prevalence for thread stalls, unless something went awry in the resource balance during design.

I think this was true back in the P4 days. I believe the current versions of HT are more advanced and do take advantage of wider execution widths.

Lepton87 · May 5, 2014

IDC is right, every architecture since Nehalem gains less and less from HT. A site did IPC comparison at 2.8GHz and it clearly shows that, I can find a link if you are willing to change your misconceptions but you'll probably discard that evidence for a number of reasons.

Phynaz · May 5, 2014

Lepton87 said:
I can find a link if you are willing to change your misconceptions but you'll probably discard that evidence for a number of reasons.

Is this directed at me?

Lepton87 · May 5, 2014

Phynaz said:
Is this directed at me?

At everyone who disagrees with IDC, anyway I found what I've been looking for.

http://ixbtlabs.com/articles3/cpu/intel-ci7-123gen-p3.html

Overall gains from enabling HT are:

Nehalem 11.2%
SB 10.5%
IB 9.4%

They didn't test HW. If you have any evidence that the trend was reversed with HW please link, so far I haven't seen anything that would prove that.

Phynaz · May 5, 2014

As we have known, the improvements from Hyperthreading are workload dependent. For example in the Office and Multitasking test summaries in your link the gains are increasing with each generation.

What's with the accusatory attitude?

tarlinian · May 5, 2014

Lepton87 said:
At everyone who disagrees with IDC, anyway I found what I've been looking for.

http://ixbtlabs.com/articles3/cpu/intel-ci7-123gen-p3.html

Overall gains from enabling HT are:

Nehalem 11.2%
SB 10.5%
IB 9.4%

They didn't test HW. If you have any evidence that the trend was reversed with HW please link, so far I haven't seen anything that would prove that.

Generally the architectural changes being made in each generation of Intel chips do two things. Some improve the ability of a thread to take advantage of the execution resources available to it (e.g., increase OoO window). Others actually provide more execution resources, which would help results on threads with a large amount of easily extractable ILP. The first type of improvement will result in less of a benefit from SMT as there are fewer resources available to the second thread. The second type of improvement will help SMT. It may be the case that the first type of improvement has had more of an effect the second resulting in less of a benefit from hyperthreading, but it doesn't mean that any architectural change resulting in single-threaded improvement will result in a reduction in the benefits of hyper-threading.

TuxDave · May 5, 2014

Idontcare said:
Yes, but the execution units aren't shared by the threads. Whichever thread is active (thread A or B), it gets all the execution units until it stalls and the other thread is activated.

Maybe that's some legacy SMT implementation but SMT today is shifting to make more of the uArch pipeline a "free-for-all". But modern SMT implementations, some stages still remain mutually exclusive to one thread at a time but at least the execution units just let the out of order engine figure thing out.

But everything else you said was correct. As we improve single threaded performance to improve utilization, SMT starts to benefit less and the obvious decision would be to decrease the thread count of SMT to improve perf/watt. The other approach is sort of the opposite where you just drop single threaded performance because in MT, SMT will backfill and you will not incur a significant impact in overall IPC but now you get to make some efficiency/clock speed improvements.

Ahmdahl's law can guide your ST/MT tradeoff.

Edit: Oh, one extra point. Some traces are so bad that it exceeds any reasonable reorder buffer depth to find stuff to do. SMT really wins there.

Ajay · May 5, 2014

Lepton87 said:
Overall gains from enabling HT are:

Nehalem 11.2%
SB 10.5%
IB 9.4%

Best guess is that Intel is keeping xtors for HT to some set minimum (say 5% or less, for example). If cache hits and branch prediction improve, then there will be fewer multitasking IPC gains. Modern OSes manage the threads so that both still get run, but without much of an IPC gain. Intel have kept HT in place because it's still pretty cheap in terms of xtor and thermal budget compared to adding additional cores.

Enigmoid · May 5, 2014

Lepton87 said:
At everyone who disagrees with IDC, anyway I found what I've been looking for.

http://ixbtlabs.com/articles3/cpu/intel-ci7-123gen-p3.html

Overall gains from enabling HT are:

Nehalem 11.2%
SB 10.5%
IB 9.4%

They didn't test HW. If you have any evidence that the trend was reversed with HW please link, so far I haven't seen anything that would prove that.

The problem is that they tested a number of singlethread apps.

For example, straight modelling with solidworks is purely singlethreaded. You can't see any improvements or problems from HT.

Crysis warhead (and a lot of the games specifically, don't use more than 4 threads).

Lepton87 · May 6, 2014

Enigmoid said:
The problem is that they tested a number of singlethread apps.

For example, straight modelling with solidworks is purely singlethreaded. You can't see any improvements or problems from HT.

Crysis warhead (and a lot of the games specifically, don't use more than 4 threads).

How do single-threaded apps invalidate the results? They just make overall performance gains from enabling HT smaller.

Ajay · May 6, 2014

TuxDave said:
Maybe that's some legacy SMT implementation but SMT today is shifting to make more of the uArch pipeline a "free-for-all". But modern SMT implementations, some stages still remain mutually exclusive to one thread at a time but at least the execution units just let the out of order engine figure thing out.

So, do you mean that that current SMT implementations are approaching the ideal of true hardware level threading - where extra IPC is extracted by various means such as executing speculative branches, hardware level extraction parallelism from prefetched code based on pattern recognition heuristics, etc.?

I thought SMT was moving in the direction better execution of software threads. Real hardware level multi-threading is expensive in terms of xtors.

TuxDave · May 6, 2014

Ajay said:
So, do you mean that that current SMT implementations are approaching the ideal of true hardware level threading - where extra IPC is extracted by various means such as executing speculative branches, hardware level extraction parallelism from prefetched code based on pattern recognition heuristics, etc.?

I thought SMT was moving in the direction better execution of software threads. Real hardware level multi-threading is expensive in terms of xtors.

I'm not really sure where how you jumped to that conclusion. I'm just saying that the proposal for 100% of a CPU pipeline being mutually exclusive to one software thread until it stalls, and then switching over may be some old proposal of SMT, but that % is dropping as multiple threads can simultaneously occupy the same "region" of the CPU pipeline. Yes, it takes up XTORs, but no, we're not doubling everything.

Oh and speculative execution has always been ok even in non-SMT implementations. You just have to make sure your prediction hardware is good enough or the speculative-ness of it low enough to limit the amount of time it's wrong.

Hope that answers your question.

Ajay · May 6, 2014

TuxDave said:
I'm not really sure where how you jumped to that conclusion. I'm just saying that the proposal for 100% of a CPU pipeline being mutually exclusive to one software thread until it stalls, and then switching over may be some old proposal of SMT, but that % is dropping as multiple threads can simultaneously occupy the same "region" of the CPU pipeline. Yes, it takes up XTORs, but no, we're not doubling everything.

Oh and speculative execution has always been ok even in non-SMT implementations. You just have to make sure your prediction hardware is good enough or the speculative-ness of it low enough to limit the amount of time it's wrong.

Hope that answers your question.

Yes that does. I jumped to that conclusion based on some old research I read back in the mid 1990's - it was on the problem of using more sophisticated trace profiling to find parallel code in the executing stream and the cost in xtors of a larger prediction block and prefetch buffer. Given today's xtor budgets compared to then, it's probably not a problem anymore.

The one thing I'm not sure about is how aggressively HW profiles instruction prefetch for parallelism. Maybe it doesn't need to be more aggressive - maybe the compilers and OSes are doing the bulk of the optimization now-a-days.

Headfoot · May 6, 2014

I didn't realize that Haswell HT being better than Ivy's wasn't common knowledge. http://forums.anandtech.com/showthread.php?t=2376155&highlight=haswell+hyperthreading

It shows up primarily in games. It showed up in a few BF4 CPU performance reviews as well. Bigger gains for Haswell i3s over Ivy i3s vs Haswell i5s from Ivy i5s

Ill_take_Power · May 10, 2014

Phynaz said:
You're new here so I'll be nice and let you know that we expect facts around here. Pulling some made up stuff that 3 power8 cores is the equivalent of 24 Intel cores is going to require some proof on your part.

I've spent years moving off power onto x86 because it's so much cheaper to run. As in no contest. And believe me I'm one of the biggest power fans around.

Let's take your example of the shared processor pool. If you do that you are soft partitioned and then must license the entire server, at twice the price per core as x86.

So if your going to continue down this path, be prepared to back up your statements. I've done the TCO, and I've spent millions to move off power.

And btw, if you're paying $47K a core for Oracle I'm gonna quit my job and come be your rep. Man he's gotta be making a pile from you.

Since you are so well versed with how Power works you can tune me out for a minute while I set the record straight and make sure the readers know the facts.

I didn't say 3 x P8 cores were equivalent to 24 x86 cores. I was comparing a 24 core 2 socket P8 to a 2 socket 24 core x86 server or 2 socket to 2 socket server. It could be a 2 socket 16 core P8 server - doesn't matter. If the P8 server only required 3 cores for Oracle, I then stated the cost. The point of the 24 x86 cores isn't that it takes 24 x86 cores to match P8 - hardly. It doesn't matter how many cores are required to match the P8 as it will license 12 cores regardless if it needs 1, 8, 16 or 24 cores.

You are wrong on how Oracle licenses on Power. It considers Power and it's Power Hypervisor to be a Hard Partition unlike VMware. Dedicated is the easiest as you license the number of cores in the dedicated VM. If a Shared Processor Pool (SPP), each VM looks at the desired capped value and rounds up. If uncapped, it looks at the number of desired virtual processors not to exceed the number of cores in the SPP. If I had 8 cores in a SPP, I could have 160 VM's in there all running Oracle and the most number of licenses required would be 8. If I had that 160 VM's and only one ran Oracle and it's desired virtual processors was 3 in this 8 co SPP, you would only license 3 oracle licenses. A 24 co P8 server can have from 1 - 24 pools - limit is 1 per core up to the max number of cores in a single pool.

The reason I use "list price" for Oracle is because it is a constant. Every customer gets a different discount. Take list and apply what you get to it. It scales with the discount as it is the same for x86 or Power. If you have spent millions moving off of Power then you did so misinformed. Feel free to show a sample TCO on your reply as nothing beats plain text to tell the truth. I'll then reply with my TCO data.

Many x86 vendors say they can beat Power with TCA and TCO but I call BS.

Phynaz · May 10, 2014

Not gonna argue with a sock puppet that makes up numbers to prove a point.

Ill_take_Power · May 10, 2014

You won't argue but you will resort to name calling. I see. You claim I am making up numbers to prove a point. Instead of you possibly being wrong because you did say you've spent years and millions moving off of Power to x86 (oh, but you love Power) and anybody who challenges you must be a liar. You don't have to admit you are wrong but I don't have to let you go unchallenged making statements as if you are an authority on them. Because I am actually an authority on them. I worked for Sun for 10 years then IBM for 4 and now for a business partner focusing on competitive takeouts with Power servers. 4+ years ago it was 70/30 against SPARC/Itanium/PA-RISC. Today, it is 70/30 against x86/VMware. One thing as a technical architect is I have to not only know the facts but be able to show them to customers, back them up with facts and ensure they can pass software audits. There is quite a bit at stake for both customers and my firm.

Let's start here. Oracle "supports" VMware but if there is a problem they can require you to reproduce the problem on a physical server (ie no VMware) or use OVM (ie Xen).
https://blogs.oracle.com/UPGRADE/entry/is_oracle_certified_to_run_on

The support statement is not the same as a licensing statement. That must adhere to partitioning rules. Oracle publishes that in their "Oracle Partitioning Policy" available here http://www.oracle.com/us/corporate/pricing/partitioning-070609.pdf
If that isn't enough, then watch this video at VMworld 2012 http://www.youtube.com/watch?v=dZ5Qip29Yt8 as they talk about licensing with a Director from Oracle. What "Richard" states is that not only do you license all cores on a x86 server (times the licensing factor of .5 - don't want to be accused of making things up again) but you must license all of the other servers in a vMotion cluster that could possibly ever run that Oracle workload. If there are 5 servers in that "farm" with only 1 server running Oracle you would have to license all 5 times their core count times .5.

My point with using 3 cores on the Power server is not to compare 3 cores to 24 or any other number. My point is that with Power, I only have to license the cores require for Oracle times its licensing factor of 1.0 because Oracle views PowerVM as a hard partition compliant product (it can separate what cores would run Oracle).

Since you didn't want to play the TCA / TCO game I will so everybody else will benefit.

Typical x86 solution to host an Oracle Enterprise Edition database. Most customers would not deploy Oracle on x86 in production without some form of increased availability so I will use Oracle RAC. If users disagree with this they are free to state so but I am just sharing my experience. With the Power server, additional clustering is not required to address or overcome the inherent deficiencies in the server platform like it is with x86. If increased availability is required, customers may choose to go with a traditional cluster product like VCS (I still call it this) or IBM's own PowerHA which I prefer as being less expensive and more robust. They may also choose to use RAC.

I'll pick a random x86 vendor that will let me get pricing from their website - I'll use list pricing for everything since discounts vary like the weather in New England - all over the place.

Solution 1
HP DL380 Gen8 - qty 1

2 x 6 co @ 2.4 GHz E5-2440
128 GB Ram
vSphere Enterprise Plus
3 year support
No internal HDD - assuming USB boot
2 x dual port 10 GbE
2 x dual port Fibre
All power cores, rail kit, misc
$25,183 each server List price

Oracle cost
Enterprise Edition - $47,500 per core
EE maintenance @ 22% per year - $10,450
RAC - $23,000 per core
RAC maintenance @ 22% per year - $5,060

Power8 solution
S824 Power8 server - qty 1

8 x 4.15 Ghz Power8 cores
256 GB Ram
DVD
Split backplane
4 x SSD (building the way I would built it and not just to lower the price which I could do by using HDD)
2 x dual port 10 Gbe adapters
2 x dual port Fibre adapters
AIX v7.1

PowerVM Enterprise Edition
3 year 24 x 7 maintenance
$79,807 server list price

Now the math!

Server: HP DL380
Cost: $25,183
qty of servers: 2
Server cost: $50,366

# of cores in solution: 24
Oracle Licensing Factor: .5
# of cores needed for Oracle (actual): 5

Total Oracle Licenses required 12

Oracle EE Lic cost: $570,000
Oracle maint cost (3 yr): $376,200
Oracle RAC Lic cost: $276,000
Oracle maint cost (3 yr): $60,720
Total x86 server + Oracle cost over 3 years: $1,333,286

Server: S824
Cost: $79.807
qty of servers: 1
Server cost: $79,807
# of cores in solution: 8
Oracle Licensing Factor: 1.0
# of cores needed for Oracle (actual): 3
Total Oracle Licenses: 3
Oracle EE Lic cost: $142,500
Oracle maint cost (3 yr): $94,050
Oracle RAC Lic cost: $ Not required
Oracle maint cost (3 yr): NA
Total Power server + Oracle cost over 3 years: $316,357

The 3 year total cost of ownership for the x86 solution shown is $1 Million dollars more than a Power8 solution. The Power solution is very typical for what we might see or use with customers. We would also consolidate the app servers and other workloads onto the Power server whereas customers typically would put the App servers on separate servers - which means even more cost.

Somebody may question or say it isn’t fair or that it is convenient of me to just use 1 Power8 server whereas I am comparing it to 2 x HP x86 servers. Just in case, here are those numbers. Don’t want somebody to accuse me of making things up (now I am just having fun with you @Phynaz. Hope you take this all in the spirit in which it is meant which is to set the record straight and make sure customers are properly informed.

Server: S824
Cost: $79.807
qty of servers: 2
Server cost: $159,614
# of cores in solution: 16
Oracle Licensing Factor: 1.0
# of cores needed for Oracle (actual): 6
Total Oracle Licenses: 6
Oracle EE Lic cost: $285,000
Oracle maint cost (3 yr): $188,100
Oracle RAC Lic cost: $138,000
Oracle maint cost (3 yr): $91,080
Total Power server + Oracle cost over 3 years: $861,794

For those who want everything equal the Power solution is still $470K less than the x86 and everything else I have said remains true. If the Oracle workload grows and needs more resources, the Power server can dynamically add a single core at a time and any increment of memory to the VM. You just add the appropriate Oracle licensing. Likewise, if the workload were to decrease you could dynamically remove cores and memory as well and even redeploy Oracle licenses to other workloads or other servers - a license at a time. All about flexibility.

It is because of the reliability and efficiency of the Power server and it’s hypervisor which delivers this benefit. The more workloads consolidated onto Power the greater the savings. For those who have worked with Power, they know they can drive utilization very high without sacrificing performance or response times. VMware tends to manage up to the 30 - 35% level and then add another server. If you add another RAC node the cost go up dramatically. If you are just relying on vMotion then you also add the cluster farm costs. Heaven forbid you had a Oracle RAC environment Plus VMware with more servers than are configured with RAC where you would license Oracle at $70,500 per core across all of the cores in that vMotion cluster farm. Hope this shows and settles the pricing discussion, at least if running Oracle on x86 vs Power. Customers can choose to run on x86 with VMware instead of Power for any reason they want and pay more for it just like people choose everybody to buy a Ford over a Chevy. You don’t need a reason as it is your money. In the example I have shown above though, it will cost more to run Oracle on x86 with or without VMware vs running it on Power. Cheers!

PPB · May 10, 2014

Idontcare said:
Yes, but the execution units aren't shared by the threads. Whichever thread is active (thread A or B), it gets all the execution units until it stalls and the other thread is activated.

By being wider, IPC increases but only on a per-thread basis. It doesn't improve the ability of the core to handle more concurrent threads as that is dependent on the opportunity for switching between threads (which generally requires a stall in the pipeline somewhere).

Remember that hyperthreading is basically a way to keep the pipe loaded in the event of a stall (cache miss, branch mispredict, etc). Anything that is done to the core to improve pipeline efficiency (better cache, better branch prediction, etc) will actually result in a net reduction in opportunities for the second thread to be activated since the first thread is effectively stalling less often.

Increasing the width of the processor should not result in an increase in the prevalence for thread stalls, unless something went awry in the resource balance during design.

Which actually adds to the fact that HT performance has been going down since it's reintroduction with Nehalem. The more efficient is the core, the less performance HT can squeeze out of it.

That's why I even stopped recommending i7's to the likes of renderers, and video editers and instead I have been telling them to go for i5's. The perf/price ratio is even less favorable to the i7 every time Intel makes a tock (or a tick+, like IB was) and the issues with some software running like garbage because of having HT on hasnt been addressed yet.

Rakehellion · May 14, 2014

rtsurfer said:
Even if the Power CPUs consume more power, they would finish the task faster than Intel CPU & race to Idle faster.

It should even out in the end.

Twice the power for twice the performance isn't a bargain, isn't an advantage, and isn't an accomplishment.

Server CPUs typically remain at 100% load all the time, so there's no race to finish.

[] IBM unveils Power8 and OpenPower pincer attack on Intels x86 server monopoly

Elite Member

Lifer

Elite Member

Member

Lifer

Senior member

Lifer

Platinum Member

Lifer

Platinum Member

Lifer

Member

Lifer

Lifer

Platinum Member

Platinum Member

Lifer

Lifer

Lifer

Diamond Member

Junior Member

Lifer

Junior Member

Golden Member

Lifer

[] IBM unveils Power8 and OpenPower pincer attack on Intels x86 server monopoly