When will Intel bring four threads per core to the Core architecture?

cbn · Apr 15, 2015

The Xeon Phi accelerators have four threads per atom core, so I was wondering when Intel would bring four threads per core to the big cores? (I have to imagine having more threads per core could allow Intel to widen the uarch without losing MT performance per watt)

P.S. I also think it would be ideal if a single big core with four thread SMT could replace quad core atom and 2C/2T big core as the lowest common denominator consumer processor. This would also allow all the major instruction sets (eg, advanced vector extensions) to be enabled on the 1C/4T big core processor without interfering with product segmentation.

NTMBK · Apr 16, 2015

They won't. Phi is throughput oriented (like a GPU), while the Core architecture is latency oriented with a few throughput tricks. They have fundamentally different goals.

ShintaiDK · Apr 16, 2015

NTMBK said:
They won't. Phi is throughput oriented (like a GPU), while the Core architecture is latency oriented with a few throughput tricks. They have fundamentally different goals.

^^ This.

cbn · Apr 16, 2015

NTMBK said:
They won't. Phi is throughput oriented (like a GPU), while the Core architecture is latency oriented with a few throughput tricks. They have fundamentally different goals.

I'm not exactly sure what you mean by latency?

Are you referring to the relative processing delay of running four threads thru one airmont atom core (Knight's Landing) vs running four threads thru four airmont cores (Braswell, etc).

If that is what you mean, then yes it does makes sense to me four threads thru such a narrow core is going to delay things.

But with a wide uarch (even wider than Haswell) developed for four threads is latency really going to be anywhere near that of four threads going thru a narrow airmont core?

NTMBK · Apr 16, 2015

cbn said:
I'm not exactly sure what you mean by latency?

Are you referring to the relative processing delay of running four threads thru one airmont atom core (Knight's Landing) vs running four threads thru four airmont cores (Braswell, etc).

If that is what you mean, then yes it does makes sense to me four threads thru such a narrow core is going to delay things.

But with a wide uarch (even wider than Haswell) developed for four threads is latency really going to be anywhere near that of four threads going thru a narrow airmont core?

Adding the extra circuitry to support four threads without increasing single threaded latency will burn a lot of transistors and power, transistors and powers which could instead have been used to reduce single threaded latency even further.

It would be similar to AMD's Bulldozer gamble- they used the extra transistors and power to beef up multithreaded performance in a core by adding a second integer cluster, instead of improving their single threaded performance. And we all know the results from that.

itsmydamnation · Apr 16, 2015

cbn said:
I'm not exactly sure what you mean by latency?

Are you referring to the relative processing delay of running four threads thru one airmont atom core (Knight's Landing) vs running four threads thru four airmont cores (Braswell, etc).

If that is what you mean, then yes it does makes sense to me four threads thru such a narrow core is going to delay things.

But with a wide uarch (even wider than Haswell) developed for four threads is latency really going to be anywhere near that of four threads going thru a narrow airmont core?

no you have what he means wrong.

knights landing has 4 threads because its going to be working on massively wide workloads that are very high throughput where cache wont be able to cover the amount of data that's needed by a singe thread. 4 threads is so cache misses don't stall the core. Just like GCN for example, memory access is slow execution is fast.

Now Core arch is all about having the data in the cache so you don't stall, lots of transistors to fetch, decode, predictors L/S system. If you share that with 4 threads vs say 1, you now have 1/4th of all those resources per thread. You can't just magically make all these things wider either or you pay in other metrics like power or max clocks.

cbn · Apr 16, 2015

NTMBK said:
It would be similar to AMD's Bulldozer gamble- they used the extra transistors and power to beef up multithreaded performance in a core by adding a second integer cluster, instead of improving their single threaded performance. And we all know the results from that.

I don't think this is the same as Bulldozer CMT because we are talking about an even more powerful core (using additional SMT to help gain back efficiency lost due to the increased core width), not one core split up into two weaker cores.

cbn · Apr 16, 2015

itsmydamnation said:
Now Core arch is all about having the data in the cache so you don't stall, lots of transistors to fetch, decode, predictors L/S system. If you share that with 4 threads vs say 1, you now have 1/4th of all those resources per thread. You can't just magically make all these things wider either or you pay in other metrics like power or max clocks.

I see your point about making things wider affecting power and max clocks (see Pollack's rule above).

However, Core is already at 2 threads per core and I don't think Intel can push clocks up any further without inducing other types of efficiency penalties.

As far as xtor budget goes for each core, yes that would increase.....but then adding additional cores also increases xtor budget.

Four wider cores with 4 thread SMT vs eight cores with 2 thread SMT? What is a better deal for the xtor budget?

itsmydamnation · Apr 16, 2015

Your ignoring things like bypass and forwarding networks. thinks that are n*(n-1)/2. your circuits used to move data start exploding. Have a look at the way the really high performance core that is doing SMT4 looks......

sm625 · Apr 16, 2015

I dont know about 4 threads but 3 makes a lot of sense. The Core M core is wide enough to support 3 threads already with its 7 execution units. They could push that up to 9 and then add a second HT thread. That would give us a 2C6T die which would only be fractionally larger than a 2C4T die. Mobile could force this if 4 threads cease to be sufficient whereas a 4C8T die would definitely be overkill. It probably depends on how thread-hungry mobile workloads become over the next couple years.

TuxDave · Apr 16, 2015

cbn said:
Four wider cores with 4 thread SMT vs eight cores with 2 thread SMT? What is a better deal for the xtor budget?

4 cores x 4T will probably consume fewer transistors than 8 cores x 2T (no comment on frequency of the two). However I suspect a large set of programs will either not benefit in performance or hit a lower performance/watt. We already have a set of programs that complete a workload faster in 1T vs 2T, imagine that list to grow when you have 1T/2T vs 4T.

sm625 said:
I dont know about 4 threads but 3 makes a lot of sense. The Core M core is wide enough to support 3 threads already with its 7 execution units. They could push that up to 9 and then add a second HT thread. That would give us a 2C6T die which would only be fractionally larger than a 2C4T die. Mobile could force this if 4 threads cease to be sufficient whereas a 4C8T die would definitely be overkill. It probably depends on how thread-hungry mobile workloads become over the next couple years.

Engineers dislike non-powers of 2.

tynopik · Apr 16, 2015

NTMBK said:
Adding the extra circuitry to support four threads without increasing single threaded latency will burn a lot of transistors and power, transistors and powers which could instead have been used to reduce single threaded latency even further.

since we seem to be reaching the limits of improving single-thread performance, I don't see the problem

a better argument is that the transistors could be used for graphics

PhIlLy ChEeSe · Apr 16, 2015

NEVER!
Till a new CPU competitor comes along you wont see much new coming, they are in business to make money. There gonna sop up all the market first, why produce something which has no one challenging them to begin with.

DrMrLordX · Apr 16, 2015

TuxDave said:
4 cores x 4T will probably consume fewer transistors than 8 cores x 2T (no comment on frequency of the two). However I suspect a large set of programs will either not benefit in performance or hit a lower performance/watt. We already have a set of programs that complete a workload faster in 1T vs 2T, imagine that list to grow when you have 1T/2T vs 4T.

It might make more sense for Intel to explore 4-way SMT for chips at the level of their current i3s. Why sell 2C/4T chips when you could do 1C/4T with a smaller die size? The modern software climate has already demonstrated that it will throw at least four "high priority" threads at the CPU on a fairly regular basis. There's no reason for them to go crazy and start putting 4C/16T chips onto consumer sockets.

Then they could do the classic Intel product segmentation, and move their Celeron and Pentium lines onto 1C/2T chips.

PhIlLy ChEeSe said:
NEVER!
Till a new CPU competitor comes along you wont see much new coming, they are in business to make money. There gonna sop up all the market first, why produce something which has no one challenging them to begin with.

See above, I think Intel might be willing to do it if it helped them to sell fewer transistors at the same price.

TuxDave · Apr 16, 2015

DrMrLordX said:
It might make more sense for Intel to explore 4-way SMT for chips at the level of their current i3s. Why sell 2C/4T chips when you could do 1C/4T with a smaller die size? The modern software climate has already demonstrated that it will throw at least four "high priority" threads at the CPU on a fairly regular basis. There's no reason for them to go crazy and start putting 4C/16T chips onto consumer sockets.

Then they could do the classic Intel product segmentation, and move their Celeron and Pentium lines onto 1C/2T chips.

See above, I think Intel might be willing to do it if it helped them to sell fewer transistors at the same price.

For the same price you better have a performance improvement or a power reduction at the close enough iso-performance. I'm thinking that a large set of traces will have a big performance regression moving from 2C4T to 1C4T.

cbn · Apr 17, 2015

TuxDave said:
We already have a set of programs that complete a workload faster in 1T vs 2T, imagine that list to grow when you have 1T/2T vs 4T.

Can you give me an example of that?

Originally I was thinking you meant something along the lines of a 3.6 Ghz Haswell 4C/4T being faster than a 3.6 GHz Haswell 2C/4T, but I am wondering if you mean something different?

P.S. The way I see things a 4C/16T (with bigger cores) would have better single thread than a 8C/16T.....but I don't know where the MT would lie?

IntelUser2000 · Apr 17, 2015

cbn said:
Four wider cores with 4 thread SMT vs eight cores with 2 thread SMT? What is a better deal for the xtor budget?

You can see what Intel is thinking on their MorphCore presentation.

You can get better multi-thread performance with SMT-4 but there's a sacrifice for area and single-thread performance.

If they'll do it, they are probably better off going SMT-8 with MorphCore.

TuxDave · Apr 17, 2015

cbn said:
Can you give me an example of that?

Originally I was thinking you meant something along the lines of a 3.6 Ghz Haswell 4C/4T being faster than a 3.6 GHz Haswell 2C/4T, but I am wondering if you mean something different?

P.S. The way I see things a 4C/16T (with bigger cores) would have better single thread than a 8C/16T.....but I don't know where the MT would lie?

Linpack is probably the first thing that comes to mind. I'm predicting that 4c/4t outperforms 4c/8t. And it will demolish 2c/4t.

DrMrLordX · Apr 17, 2015

Right, Linpack is just one example of a workload that really does not benefit from HT at all. I'm pretty sure it's been documented that turning off HT actually improves Linpack performance, and it's still like that on Haswell.

ShintaiDK · Apr 17, 2015

DrMrLordX said:
Right, Linpack is just one example of a workload that really does not benefit from HT at all. I'm pretty sure it's been documented that turning off HT actually improves Linpack performance, and it's still like that on Haswell.

The throughput in Linpack is higher with HT on if you avoid old CPUs.

TuxDave · Apr 17, 2015

ShintaiDK said:
The throughput in Linpack is higher with HT on if you avoid old CPUs.

Haswell isn't THAT old.

IntelUser2000 · Apr 19, 2015

ShintaiDK said:
The throughput in Linpack is higher with HT on if you avoid old CPUs.

That scenario is pretty low, if almost nonexistent for vast majority of users. One article I was reading stated how enabling Hyperthreading on the Pentium 4-based architectures were so bad, but on Nehalem-based chips, it was negligible(-1%) to a great gain(20%+).

Linpack is practically a synthetic benchmark to lots of people anyway. Who knows of a benchmark that goes so close to its theoretical Flops and performance?

Reality is that CPUs are lot less utilized and 2 threads come into play. Of course adding 2 more diminishes the gain greatly.

Lorne · Apr 20, 2015

Its not that LINPAC suffers from it, It uses averaging readings, IE 4C at 100pct = 100 pct and 4C HT (8 threads running 50pct depending on how you look at it) = 50pct even though the same work is done.
Don't forget that the more working threads that are open also requires them to be fed, So if there is not enough memory band width to dedicate you will choke a core, Which also will give a bad result in Linpack.

Suggestion, Dynamic Hyper Threading, Ether have it adjustable in bios or software by percentage and max thread count depending on how the system is used.

When will Intel bring four threads per core to the Core architecture?

Lifer

Lifer

Lifer

Lifer

Lifer

Diamond Member

Lifer

Lifer

Diamond Member

Diamond Member

Lifer

Diamond Member

Senior member

Lifer

Lifer

Lifer

Elite Member

Lifer

Lifer

Lifer

Lifer

Elite Member

Senior member