Core i7-4770K is performance crippled

Page 5 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

mrmt

Diamond Member
Aug 18, 2012
3,974
0
76
But jsut assume the socketed version was already here. Then why not give it a desktpo sku number and sell it for $500? you don't loose anything and if you get a couple if people to buy it who would less have bought a 4770k you profit.

But Intel *does* lose something. Intel would have a mainstream platform with prices of a premium platform competing against their own premium platform. And the chips for the premium platform are manufactured in a *very* mature process, while 4770 is manufactured in a more recent process. I'd bet that ROI for SNB-E would be bigger than for this SKU.

What you are seeing here is the limits of intel "one core fits all" policy. While they can go for big scale gains in R&D, it's obvious that they have to compromise somewhere. First they had to compromise on low power, so we had 2 hours battery notebooks. Now desktop has to compromise somewhere. Add to that the fact that most of the money is in mobile and you have the current landscape for desktop.
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
Fine-grained locking with low overhead per lock gets you the highest performance. But you need TSX for that.
No you don't, see my previous post in this same thread which you conveniently over-read. Java offers this functionality in it's standard library since Java 5, or 2004, almost 10 years. And you could "roll your own" before.
First of all, and with all due respect, mentioning Java in the context of TSX is ridiculous. Performance critical applications such as games are not written in Java, for good reason. Secondly, no, Java does not feature efficient fine-grained locking at all. The API you mention is only suitable for coarse-grained locking. Implementations exist which are over 10 times faster, and even those are much slower than TSX (which can even be free). So I seriously think you have to adjust your definition of fine-grained locking.
And in this previous post I showed you a link to an anandtech article explaining this and were you can clearly see that HLE has very limited benefit if you already use fine-grained locking. So you are plain wrong.
The graphs in that article are for illustration purposes only. Note that they have no scale! Also, just because it is theoretically possible to achieve almost the same performance without TSX, doesn't mean developers will go the length to achieve that. You need non-blocking algorithms for that, which are insanely hard to get right. Also, it requires assembly-level programming, so again any mention of Java is plain ridiculous. TSX helps a great deal to achieve true fine-grained performance with little developer effort.
 

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
Any multi-threaded application benefits from TSX.

Not correct. Here you go -
http://software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell

If multiple threads execute critical sections protected by the same lock but they do not perform any conflicting operations on each other’s data, then the threads can execute concurrently and without serialization...Intel TSX targets a certain class of shared-memory multi-threaded applications; specifically multi-threaded applications that actively share data. Intel TSX is about allowing programs to achieve fine-grain lock performance without requiring the complexity of reasoning about fine-grain locking.

Your arguments are invalid because your understanding of what TSX is is incorrect. TSX only applies in the above case. It's more a programmers tool than a performance tool.
Intel TSX provides hardware-supported transactional-execution extensions to ease the development and improve the performance of existing programming models.

Do you have any proof the above happens in a graphics driver?
 
Last edited:

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
Haswell isn't changing much the situation, only packed shifts with independant count may help a bit, but it's something rarely used in practice, IMO

gather seems way too slow (the optimization guide [1], bottom of page C-5, reports around 10 clocks reciprocal thoughput for 256-bit 4 elements and 8 elements gather) to really make a difference, if you aren't careful it can even *slow down your code*, it was my 1st experience with gather

in fact the main "optimization" for my existing AVX2 code was to disable all the gather code paths (!)

[1] https://www-ssl.intel.com/content/d...4-ia-32-architectures-optimization-manual.pdf"

Well it's not my rhetoric, but something BenchPress (and others) has claimed it several times before, so I was curious as to how he could hold the two seemingly contradictory positions.

Opening that discussion up to why AVX2 in Haswell is not a game changer that enables vectorization on a large scale was the real bait of the post, which you took instead of BenchPress ;)
 

beginner99

Diamond Member
Jun 2, 2009
5,210
1,580
136
Are you kidding? Everywhere. All the dependencies between draw calls have to be respected, while also allowing the application to update resources, from multiple threads, simultaneously, and servicing some asynchronous queries. Also, graphics drivers have only milliseconds to get things done to achieve a high frame rate and minimal lag. So a lot of things happen in a short time frame. Hence fine-grained locking is something that would certainly benefit graphics drivers.

It is no more data center specific than Hyper-Threading. Any multi-threaded application benefits from TSX.

First of all, and with all due respect, mentioning Java in the context of TSX is ridiculous. Performance critical applications such as games are not written in Java, for good reason. Secondly, no, Java does not feature efficient fine-grained locking at all. The API you mention is only suitable for coarse-grained locking. Implementations exist which are over 10 times faster, and even those are much slower than TSX (which can even be free). So I seriously think you have to adjust your definition of fine-grained locking.

The notion that Java is slow has been like disproved a decade ago. In fact some stuff is faster than in C++ nowadays...so another misguided belief.

ReadWriteLock is fine-grained because you can read in parallel. Only writes block. And they also do with TSX.

Link to said 10x more powerful implementation with according performance data? or are you again just coming up with random stuff?
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
Well it's not my rhetoric

sure, I know, I got the joke, posts like the ones referenced below are truly hilarious (incredible hype for gather, the great vectorization enabler) in retrospect:

http://forums.anandtech.com/showpost.php?p=33607136&postcount=93

CPUArchitect, June 23, 2012 "Gather replaces 18 legacy instructions, so in the best case we're looking at an 18x speedup"

http://forums.anandtech.com/showpost.php?p=33657446&postcount=183

bronxzv, July 6, 2012 "it looks like it will be possible to reach the best throughput for most of the cases where all the lines are available in the L1D cache so that we were in fact too conservative assuming the best case with all elements in a single cache line"
 
Last edited:

Concillian

Diamond Member
May 26, 2004
3,751
8
81
AVX2 and TSX are its main innovations relevant for a performance desktop system.

This is the primary point. Since the architecture has proven to be very similar to IBin terms of performance per watt and ultimate performance, AVX2 and TSX are the entire reason for existence of desktop Haswell. Excluding these from any Haswell chip ensures a delay in software written to take advantage of it.

If you're leaving it out from ANY Haswell SKU, you may as well just not have it in the first place.
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
Since the architecture has proven to be very similar to IBin terms of performance per watt and ultimate performance, AVX2 and TSX are the entire reason for existence of desktop Haswell. Excluding these from any Haswell chip ensures a delay in software written to take advantage of it.

I wouldn't say they're the entire reason. The modest performance improvement at stock is appealing to those who don't overclock, which is probably the majority. Integrated VRM could bring down motherboard costs slightly, at least to the extent that the OEMs building their own motherboards would prefer it. New chipset offers more features. Improved IGP performance is useful for some class of users even for a high-end desktop part.

It's hard to believe that ISA features that not a low of software will support for a while are a big motivator to go out and buy the CPU today. If that's all that's compelling many users will be of the mindset that they'd may as well stick with what they have until the software is there, at which point the CPU may have gone down in price or something better may be out.
 

SiliconWars

Platinum Member
Dec 29, 2012
2,346
0
0
When was the last time Intel added a new feature and didn't use it for segmentation purposes? They've been doing it as far back as Nehalam from what I can remember, not sure why anyone is still surprised by it.
 

mrmt

Diamond Member
Aug 18, 2012
3,974
0
76
When was the last time Intel added a new feature and didn't use it for segmentation purposes? They've been doing it as far back as Nehalam from what I can remember, not sure why anyone is still surprised by it.

The problem is that with IVB and every previous Intel line up the top desktop was the more complete SKU they had on the market, now the top deskop part isn't the more complete part. As this is a desktop oriented forum, people are obvious complaining.
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
sure, I know, I got the joke, posts like the ones referenced below are truly hilarious (incredible hype for gather, the great vectorization enabler) in retrospect:

http://forums.anandtech.com/showpost.php?p=33607136&postcount=93

CPUArchitect, June 23, 2012 "Gather replaces 18 legacy instructions, so in the best case we're looking at an 18x speedup"

http://forums.anandtech.com/showpost.php?p=33657446&postcount=183

bronxzv, July 6, 2012 "it looks like it will be possible to reach the best throughput for most of the cases where all the lines are available in the L1D cache so that we were in fact too conservative assuming the best case with all elements in a single cache line"

Good stuff, but I like this one even more:

The worst possible implementation I can imagine just uses the two scalar load ports to achieve a reciprocal throughput of 4 cycles, allowing Haswell's gather performance to keep up with the doubling of the arithmetic throughput. So the worst option is still good enough.

http://forums.anandtech.com/showpost.php?p=33622545&postcount=116

Must have been lacking in imagination :D
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
You're the one stating that the 4770k is performance crippled. Since you obviously have zero quantifiable data showing the benefits of TSX, the burden of proof is on us to prove you wrong, yeah? Is that it? Am I understanding you here?
TSX is a performance feature, so the 4770K is peformance crippled compared to 4770, by definition. There's no two ways about it. Just how much it is crippled is up for debate, but not whether or not it is.

Next, quantifiable data has already been posted before: http://www.sisoftware.co.uk/?d=qa&f=ben_mem_hle
"HLE allows 4x more transactions than the basic classic lock - a massive increase! There is no question that HLE greatly improves transactional performance."
"HLE allows the rate of modify-only transactions increase by a massive 5x - blowing both classic and even R/W locks out of the water! Applications that use many threads and locks will see a huge increase in performance when changing their locking to HLE."

But even if we ignore that, absence of proof (for now) is not proof of absence! If you want to claim that the i7-4770K will not suffer from the lack of TSX, you'll have to prove that just as much as I have to provide additional proof that it will. There's a subtle but important difference between telling me I'm not right (because it's not conclusively proven yet, which falls on me), and telling me I'm wrong (for which the burden would be on you).
 

Edrick

Golden Member
Feb 18, 2010
1,939
230
106
+1 to what BenchPress just said.

I will see if I can post some examples soon.
 

willomz

Senior member
Sep 12, 2012
334
0
0
I don't think you understand the standard definition of the word 'crippled' it implies a gigantic disadvantage.

Until we know the magnitude of the effect we cannot say whether the use of the word 'crippled' is justified.
 

cytg111

Lifer
Mar 17, 2008
23,174
12,835
136
if you have some code to contribute, I'll be interested

I would love that but 1. I dont have a haswell part and 2. swamped at work atm, all my spare time is being consumed by a black hole of workcrap(java). But lookingforward encog http://www.heatonresearch.com/encog has a c/c++ port that I'd like to mess with (if jeff doesnt beat me to it).
What are you specifically coding for that might benefit from TSX/AVXn ?
This is where it really gets interresting, getting the new ISA's down and dirty... numbers numbers numbers :)
 

cytg111

Lifer
Mar 17, 2008
23,174
12,835
136
TSX is a performance feature, so the 4770K is peformance crippled compared to 4770, by definition. There's no two ways about it. Just how much it is crippled is up for debate, but not whether or not it is.

Next, quantifiable data has already been posted before: http://www.sisoftware.co.uk/?d=qa&f=ben_mem_hle
"HLE allows 4x more transactions than the basic classic lock - a massive increase! There is no question that HLE greatly improves transactional performance."
"HLE allows the rate of modify-only transactions increase by a massive 5x - blowing both classic and even R/W locks out of the water! Applications that use many threads and locks will see a huge increase in performance when changing their locking to HLE."

But even if we ignore that, absence of proof (for now) is not proof of absence! If you want to claim that the i7-4770K will not suffer from the lack of TSX, you'll have to prove that just as much as I have to provide additional proof that it will. There's a subtle but important difference between telling me I'm not right (because it's not conclusively proven yet, which falls on me), and telling me I'm wrong (for which the burden would be on you).

There is no amount of forum meta-debate that is going to settle this .. we need more numbers and it is going to take some time.
I will say this, I had not predicted that you would assume the voice of a critic.. had you pretty much pinched as a intel-die-hard. Nice to be mistaken from time to time.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
I would love that but 1. I dont have a haswell part and 2. swamped at work atm, all my spare time is being consumed by a black hole of workcrap(java). But lookingforward encog http://www.heatonresearch.com/encog has a c/c++ port that I'd like to mess with (if jeff doesnt beat me to it).
What are you specifically coding for that might benefit from TSX/AVXn ?
This is where it really gets interresting, getting the new ISA's down and dirty... numbers numbers numbers :)

I'll have an Haswell system in 2-3 days but no time for a well designed microbenchmark, now that the optimization guide is out it's also less interesting to do it

at the very least I'll post raw numbers I get from my code base (a 3D renderer which use gather for texture mapping among other), with/without hardware gather, I expect regressions on production CPU for the hardware gather path
 

cytg111

Lifer
Mar 17, 2008
23,174
12,835
136
I'll have an Haswell system in 2-3 days but no time for a well designed microbenchmark, now that the optimization guide is out it's also less interesting to do it

at the very least I'll post raw numbers I get from my code base (a 3D renderer which use gather for texture mapping among other), with/without hardware gather, I expect regressions on production CPU for the hardware gather path

Nice. As you can see there is some debate as to if games/engines will/can benefit from AVXn and TSX, whats your take on that?
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
Nice. As you can see there is some debate as to if games/engines will/can benefit from AVXn and TSX, whats your take on that?

they can clearly benefit from AVX and AVX2 (incl. FMA) IMO since I enjoy good speedups, again I'll wait for a prod CPU to post numbers, I can say that AVX vs. SSE2 speedup is up on Haswell vs. Ivy and that there is on top of that an extra speedup from AVX to AVX2

for TSX I wouldn't comment since I have zero experience with it so far, but BenchPress has a very good point, it makes it less interesting to learn and adopt if your top code path can't run on the top CPU
 
Last edited:

BenchPress

Senior member
Nov 8, 2011
392
0
0
Sure, AVX2 is a major feature on its own but when you don't have efficient multi-threading it can't be put to good use.
I thought the popular rhetoric was that AVX2 enables all loops w/independent iterations can be vectorized now vs usually not vectorized before, and that most software spends a large amount of time in these loops. What about that would be contingent on multi-threading or TSX in particular?
Applications which benefit from AVX2 have parallel workloads so they typically also benefit from multi-threading. But when you have fine-grained tasks to evenly distribute the workload among the cores, then the overhead of the locks can be significant. I've seen cases where each thread spends up to 30% of time trying to get something to work on before another thread steals it and acquiring exclusive access to resources and making sure everything is processed/delivered in the right order. The alternative is more coarse-grained locking, but then the tasks don't get evenly distributed and some cores are out of work until another thread finishes (with smaller tasks it would have been able to already start working on portions that the other thread did finish already).

So the holy grail is fine-grained locking with low overhead per lock operation. That is precisely what TSX offers. Without TSX a lot of cycles go to waste, and hence fewer cycles are available for AVX2 to make a difference. So while they are in theory orthogonal features, in practice they can go hand-in-hand since they're both about parallelism.
 

Edrick

Golden Member
Feb 18, 2010
1,939
230
106
Nice. As you can see there is some debate as to if games/engines will/can benefit from AVXn and TSX, whats your take on that?

'Can' benefit and 'Will' benefit are 2 very different things.

Can they? I say yes, no question.

Will they? Not for awhile anyway.
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
Any multi-threaded application benefits from TSX.
Not correct. Here you go -
http://software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell
If multiple threads execute critical sections protected by the same lock but they do not perform any conflicting operations on each other’s data, then the threads can execute concurrently and without serialization...Intel TSX targets a certain class of shared-memory multi-threaded applications; specifically multi-threaded applications that actively share data. Intel TSX is about allowing programs to achieve fine-grain lock performance without requiring the complexity of reasoning about fine-grain locking.
Your arguments are invalid because your understanding of what TSX is is incorrect. TSX only applies in the above case. It's more a programmers tool than a performance tool.
We've had this discussion before. That's a very rough and very vague explanation of what TSX is for. My understanding of TSX is a lot more in-depth than that. It can be used as a fundamental building block for any kind of synchronization between threads. But that explanation from Intel isn't wrong: practically every multi-threaded application is a shared-memory one which actively shares data.

There's also nothing bad about it being a programmer's tool. Equally sharp tools already existed before TSX, but they were incredibly hard to use (easily resulting in hard to find race conditions). TSX is much easier to use and thus will result in more software to actually become optimized with efficient fine-grained locks.
Do you have any proof the above happens in a graphics driver?
Yes, look for 'mutex' in the Mesa source code. And that's probably very straightforward compared to NVIDIA and AMD's drivers.