Core i7-4770K is performance crippled

mrmt · Jun 5, 2013

beginner99 said:
But jsut assume the socketed version was already here. Then why not give it a desktpo sku number and sell it for $500? you don't loose anything and if you get a couple if people to buy it who would less have bought a 4770k you profit.

But Intel *does* lose something. Intel would have a mainstream platform with prices of a premium platform competing against their own premium platform. And the chips for the premium platform are manufactured in a *very* mature process, while 4770 is manufactured in a more recent process. I'd bet that ROI for SNB-E would be bigger than for this SKU.

What you are seeing here is the limits of intel "one core fits all" policy. While they can go for big scale gains in R&D, it's obvious that they have to compromise somewhere. First they had to compromise on low power, so we had 2 hours battery notebooks. Now desktop has to compromise somewhere. Add to that the fact that most of the money is in mobile and you have the current landscape for desktop.

BenchPress · Jun 5, 2013

beginner99 said:
Fine-grained locking with low overhead per lock gets you the highest performance. But you need TSX for that.

Click to expand...

No you don't, see my previous post in this same thread which you conveniently over-read. Java offers this functionality in it's standard library since Java 5, or 2004, almost 10 years. And you could "roll your own" before.

First of all, and with all due respect, mentioning Java in the context of TSX is ridiculous. Performance critical applications such as games are not written in Java, for good reason. Secondly, no, Java does not feature efficient fine-grained locking at all. The API you mention is only suitable for coarse-grained locking. Implementations exist which are over 10 times faster, and even those are much slower than TSX (which can even be free). So I seriously think you have to adjust your definition of fine-grained locking.

And in this previous post I showed you a link to an anandtech article explaining this and were you can clearly see that HLE has very limited benefit if you already use fine-grained locking. So you are plain wrong.

The graphs in that article are for illustration purposes only. Note that they have no scale! Also, just because it is theoretically possible to achieve almost the same performance without TSX, doesn't mean developers will go the length to achieve that. You need non-blocking algorithms for that, which are insanely hard to get right. Also, it requires assembly-level programming, so again any mention of Java is plain ridiculous. TSX helps a great deal to achieve true fine-grained performance with little developer effort.

Phynaz · Jun 5, 2013

BenchPress said:
Any multi-threaded application benefits from TSX.

Not correct. Here you go -
http://software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell

If multiple threads execute critical sections protected by the same lock but they do not perform any conflicting operations on each other’s data, then the threads can execute concurrently and without serialization...Intel TSX targets a certain class of shared-memory multi-threaded applications; specifically multi-threaded applications that actively share data. Intel TSX is about allowing programs to achieve fine-grain lock performance without requiring the complexity of reasoning about fine-grain locking.

Your arguments are invalid because your understanding of what TSX is is incorrect. TSX only applies in the above case. It's more a programmers tool than a performance tool.

Intel TSX provides hardware-supported transactional-execution extensions to ease the development and improve the performance of existing programming models.

Do you have any proof the above happens in a graphics driver?

Exophase · Jun 5, 2013

bronxzv said:
Haswell isn't changing much the situation, only packed shifts with independant count may help a bit, but it's something rarely used in practice, IMO

gather seems way too slow (the optimization guide [1], bottom of page C-5, reports around 10 clocks reciprocal thoughput for 256-bit 4 elements and 8 elements gather) to really make a difference, if you aren't careful it can even *slow down your code*, it was my 1st experience with gather

in fact the main "optimization" for my existing AVX2 code was to disable all the gather code paths (!)

[1] https://www-ssl.intel.com/content/d...4-ia-32-architectures-optimization-manual.pdf"

Well it's not my rhetoric, but something BenchPress (and others) has claimed it several times before, so I was curious as to how he could hold the two seemingly contradictory positions.

Opening that discussion up to why AVX2 in Haswell is not a game changer that enables vectorization on a large scale was the real bait of the post, which you took instead of BenchPress

beginner99 · Jun 5, 2013

BenchPress said:
Are you kidding? Everywhere. All the dependencies between draw calls have to be respected, while also allowing the application to update resources, from multiple threads, simultaneously, and servicing some asynchronous queries. Also, graphics drivers have only milliseconds to get things done to achieve a high frame rate and minimal lag. So a lot of things happen in a short time frame. Hence fine-grained locking is something that would certainly benefit graphics drivers.

It is no more data center specific than Hyper-Threading. Any multi-threaded application benefits from TSX.

BenchPress said:
First of all, and with all due respect, mentioning Java in the context of TSX is ridiculous. Performance critical applications such as games are not written in Java, for good reason. Secondly, no, Java does not feature efficient fine-grained locking at all. The API you mention is only suitable for coarse-grained locking. Implementations exist which are over 10 times faster, and even those are much slower than TSX (which can even be free). So I seriously think you have to adjust your definition of fine-grained locking.

The notion that Java is slow has been like disproved a decade ago. In fact some stuff is faster than in C++ nowadays...so another misguided belief.

ReadWriteLock is fine-grained because you can read in parallel. Only writes block. And they also do with TSX.

Link to said 10x more powerful implementation with according performance data? or are you again just coming up with random stuff?

bronxzv · Jun 5, 2013

Exophase said:
Well it's not my rhetoric

sure, I know, I got the joke, posts like the ones referenced below are truly hilarious (incredible hype for gather, the great vectorization enabler) in retrospect:

http://forums.anandtech.com/showpost.php?p=33607136&postcount=93

CPUArchitect, June 23, 2012 "Gather replaces 18 legacy instructions, so in the best case we're looking at an 18x speedup"

http://forums.anandtech.com/showpost.php?p=33657446&postcount=183

bronxzv, July 6, 2012 "it looks like it will be possible to reach the best throughput for most of the cases where all the lines are available in the L1D cache so that we were in fact too conservative assuming the best case with all elements in a single cache line"

Concillian · Jun 5, 2013

BenchPress said:
AVX2 and TSX are its main innovations relevant for a performance desktop system.

This is the primary point. Since the architecture has proven to be very similar to IBin terms of performance per watt and ultimate performance, AVX2 and TSX are the entire reason for existence of desktop Haswell. Excluding these from any Haswell chip ensures a delay in software written to take advantage of it.

If you're leaving it out from ANY Haswell SKU, you may as well just not have it in the first place.

Exophase · Jun 5, 2013

Concillian said:
Since the architecture has proven to be very similar to IBin terms of performance per watt and ultimate performance, AVX2 and TSX are the entire reason for existence of desktop Haswell. Excluding these from any Haswell chip ensures a delay in software written to take advantage of it.

I wouldn't say they're the entire reason. The modest performance improvement at stock is appealing to those who don't overclock, which is probably the majority. Integrated VRM could bring down motherboard costs slightly, at least to the extent that the OEMs building their own motherboards would prefer it. New chipset offers more features. Improved IGP performance is useful for some class of users even for a high-end desktop part.

It's hard to believe that ISA features that not a low of software will support for a while are a big motivator to go out and buy the CPU today. If that's all that's compelling many users will be of the mindset that they'd may as well stick with what they have until the software is there, at which point the CPU may have gone down in price or something better may be out.

SiliconWars · Jun 5, 2013

When was the last time Intel added a new feature and didn't use it for segmentation purposes? They've been doing it as far back as Nehalam from what I can remember, not sure why anyone is still surprised by it.

mrmt · Jun 5, 2013

SiliconWars said:
When was the last time Intel added a new feature and didn't use it for segmentation purposes? They've been doing it as far back as Nehalam from what I can remember, not sure why anyone is still surprised by it.

The problem is that with IVB and every previous Intel line up the top desktop was the more complete SKU they had on the market, now the top deskop part isn't the more complete part. As this is a desktop oriented forum, people are obvious complaining.

Exophase · Jun 5, 2013

bronxzv said:
sure, I know, I got the joke, posts like the ones referenced below are truly hilarious (incredible hype for gather, the great vectorization enabler) in retrospect:

http://forums.anandtech.com/showpost.php?p=33607136&postcount=93

CPUArchitect, June 23, 2012 "Gather replaces 18 legacy instructions, so in the best case we're looking at an 18x speedup"

http://forums.anandtech.com/showpost.php?p=33657446&postcount=183

bronxzv, July 6, 2012 "it looks like it will be possible to reach the best throughput for most of the cases where all the lines are available in the L1D cache so that we were in fact too conservative assuming the best case with all elements in a single cache line"

Good stuff, but I like this one even more:

The worst possible implementation I can imagine just uses the two scalar load ports to achieve a reciprocal throughput of 4 cycles, allowing Haswell's gather performance to keep up with the doubling of the arithmetic throughput. So the worst option is still good enough.

http://forums.anandtech.com/showpost.php?p=33622545&postcount=116

Must have been lacking in imagination

iFati · Jun 5, 2013

and the point of all is that dont need to go with K right?

bronxzv · Jun 5, 2013

Exophase said:
Good stuff, but I like this one even more:

http://forums.anandtech.com/showpost.php?p=33622545&postcount=116

Must have been lacking in imagination

indeed, good gatch! best one, so far...

BenchPress · Jun 5, 2013

blackened23 said:
You're the one stating that the 4770k is performance crippled. Since you obviously have zero quantifiable data showing the benefits of TSX, the burden of proof is on us to prove you wrong, yeah? Is that it? Am I understanding you here?

TSX is a performance feature, so the 4770K is peformance crippled compared to 4770, by definition. There's no two ways about it. Just how much it is crippled is up for debate, but not whether or not it is.

Next, quantifiable data has already been posted before: http://www.sisoftware.co.uk/?d=qa&f=ben_mem_hle
"HLE allows 4x more transactions than the basic classic lock - a massive increase! There is no question that HLE greatly improves transactional performance."
"HLE allows the rate of modify-only transactions increase by a massive 5x - blowing both classic and even R/W locks out of the water! Applications that use many threads and locks will see a huge increase in performance when changing their locking to HLE."

But even if we ignore that, absence of proof (for now) is not proof of absence! If you want to claim that the i7-4770K will not suffer from the lack of TSX, you'll have to prove that just as much as I have to provide additional proof that it will. There's a subtle but important difference between telling me I'm not right (because it's not conclusively proven yet, which falls on me), and telling me I'm wrong (for which the burden would be on you).

Edrick · Jun 5, 2013

+1 to what BenchPress just said.

I will see if I can post some examples soon.

willomz · Jun 5, 2013

I don't think you understand the standard definition of the word 'crippled' it implies a gigantic disadvantage.

Until we know the magnitude of the effect we cannot say whether the use of the word 'crippled' is justified.

cytg111 · Jun 5, 2013

bronxzv said:
if you have some code to contribute, I'll be interested

I would love that but 1. I dont have a haswell part and 2. swamped at work atm, all my spare time is being consumed by a black hole of workcrap(java). But lookingforward encog http://www.heatonresearch.com/encog has a c/c++ port that I'd like to mess with (if jeff doesnt beat me to it).
What are you specifically coding for that might benefit from TSX/AVXn ?
This is where it really gets interresting, getting the new ISA's down and dirty... numbers numbers numbers

cytg111 · Jun 5, 2013

BenchPress said:
TSX is a performance feature, so the 4770K is peformance crippled compared to 4770, by definition. There's no two ways about it. Just how much it is crippled is up for debate, but not whether or not it is.

Next, quantifiable data has already been posted before: http://www.sisoftware.co.uk/?d=qa&f=ben_mem_hle
"HLE allows 4x more transactions than the basic classic lock - a massive increase! There is no question that HLE greatly improves transactional performance."
"HLE allows the rate of modify-only transactions increase by a massive 5x - blowing both classic and even R/W locks out of the water! Applications that use many threads and locks will see a huge increase in performance when changing their locking to HLE."

But even if we ignore that, absence of proof (for now) is not proof of absence! If you want to claim that the i7-4770K will not suffer from the lack of TSX, you'll have to prove that just as much as I have to provide additional proof that it will. There's a subtle but important difference between telling me I'm not right (because it's not conclusively proven yet, which falls on me), and telling me I'm wrong (for which the burden would be on you).

There is no amount of forum meta-debate that is going to settle this .. we need more numbers and it is going to take some time.
I will say this, I had not predicted that you would assume the voice of a critic.. had you pretty much pinched as a intel-die-hard. Nice to be mistaken from time to time.

bronxzv · Jun 5, 2013

cytg111 said:
I would love that but 1. I dont have a haswell part and 2. swamped at work atm, all my spare time is being consumed by a black hole of workcrap(java). But lookingforward encog http://www.heatonresearch.com/encog has a c/c++ port that I'd like to mess with (if jeff doesnt beat me to it).
What are you specifically coding for that might benefit from TSX/AVXn ?
This is where it really gets interresting, getting the new ISA's down and dirty... numbers numbers numbers

I'll have an Haswell system in 2-3 days but no time for a well designed microbenchmark, now that the optimization guide is out it's also less interesting to do it

at the very least I'll post raw numbers I get from my code base (a 3D renderer which use gather for texture mapping among other), with/without hardware gather, I expect regressions on production CPU for the hardware gather path

cytg111 · Jun 5, 2013

bronxzv said:
I'll have an Haswell system in 2-3 days but no time for a well designed microbenchmark, now that the optimization guide is out it's also less interesting to do it

at the very least I'll post raw numbers I get from my code base (a 3D renderer which use gather for texture mapping among other), with/without hardware gather, I expect regressions on production CPU for the hardware gather path

Nice. As you can see there is some debate as to if games/engines will/can benefit from AVXn and TSX, whats your take on that?

bronxzv · Jun 5, 2013

cytg111 said:
Nice. As you can see there is some debate as to if games/engines will/can benefit from AVXn and TSX, whats your take on that?

they can clearly benefit from AVX and AVX2 (incl. FMA) IMO since I enjoy good speedups, again I'll wait for a prod CPU to post numbers, I can say that AVX vs. SSE2 speedup is up on Haswell vs. Ivy and that there is on top of that an extra speedup from AVX to AVX2

for TSX I wouldn't comment since I have zero experience with it so far, but BenchPress has a very good point, it makes it less interesting to learn and adopt if your top code path can't run on the top CPU

BenchPress · Jun 5, 2013

Exophase said:
Sure, AVX2 is a major feature on its own but when you don't have efficient multi-threading it can't be put to good use.

Click to expand...

I thought the popular rhetoric was that AVX2 enables all loops w/independent iterations can be vectorized now vs usually not vectorized before, and that most software spends a large amount of time in these loops. What about that would be contingent on multi-threading or TSX in particular?

Applications which benefit from AVX2 have parallel workloads so they typically also benefit from multi-threading. But when you have fine-grained tasks to evenly distribute the workload among the cores, then the overhead of the locks can be significant. I've seen cases where each thread spends up to 30% of time trying to get something to work on before another thread steals it and acquiring exclusive access to resources and making sure everything is processed/delivered in the right order. The alternative is more coarse-grained locking, but then the tasks don't get evenly distributed and some cores are out of work until another thread finishes (with smaller tasks it would have been able to already start working on portions that the other thread did finish already).

So the holy grail is fine-grained locking with low overhead per lock operation. That is precisely what TSX offers. Without TSX a lot of cycles go to waste, and hence fewer cycles are available for AVX2 to make a difference. So while they are in theory orthogonal features, in practice they can go hand-in-hand since they're both about parallelism.

Edrick · Jun 5, 2013

cytg111 said:
Nice. As you can see there is some debate as to if games/engines will/can benefit from AVXn and TSX, whats your take on that?

'Can' benefit and 'Will' benefit are 2 very different things.

Can they? I say yes, no question.

Will they? Not for awhile anyway.

BenchPress · Jun 5, 2013

Phynaz said:
BenchPress said:

Any multi-threaded application benefits from TSX.

Click to expand...

Not correct. Here you go -
http://software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell

If multiple threads execute critical sections protected by the same lock but they do not perform any conflicting operations on each others data, then the threads can execute concurrently and without serialization...Intel TSX targets a certain class of shared-memory multi-threaded applications; specifically multi-threaded applications that actively share data. Intel TSX is about allowing programs to achieve fine-grain lock performance without requiring the complexity of reasoning about fine-grain locking.

Click to expand...

Your arguments are invalid because your understanding of what TSX is is incorrect. TSX only applies in the above case. It's more a programmers tool than a performance tool.

We've had this discussion before. That's a very rough and very vague explanation of what TSX is for. My understanding of TSX is a lot more in-depth than that. It can be used as a fundamental building block for any kind of synchronization between threads. But that explanation from Intel isn't wrong: practically every multi-threaded application is a shared-memory one which actively shares data.

There's also nothing bad about it being a programmer's tool. Equally sharp tools already existed before TSX, but they were incredibly hard to use (easily resulting in hard to find race conditions). TSX is much easier to use and thus will result in more software to actually become optimized with efficient fine-grained locks.

Do you have any proof the above happens in a graphics driver?

Yes, look for 'mutex' in the Mesa source code. And that's probably very straightforward compared to NVIDIA and AMD's drivers.

Ken g6 · Jun 5, 2013

Huh, looks like the K processors aren't the only ones without TSX!

Core i7-4770K is performance crippled

Diamond Member

Senior member

Lifer

Diamond Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Platinum Member

Diamond Member

Diamond Member

Member

Senior member

Senior member

Golden Member

Senior member

Lifer

Lifer

Senior member

Lifer

Senior member

Senior member

Golden Member

Senior member

Programming Moderator, Elite Member