When will Intel bring four threads per core to the Core architecture?

cbn

Lifer
Mar 27, 2009
12,968
221
106
The Xeon Phi accelerators have four threads per atom core, so I was wondering when Intel would bring four threads per core to the big cores? (I have to imagine having more threads per core could allow Intel to widen the uarch without losing MT performance per watt)

P.S. I also think it would be ideal if a single big core with four thread SMT could replace quad core atom and 2C/2T big core as the lowest common denominator consumer processor. This would also allow all the major instruction sets (eg, advanced vector extensions) to be enabled on the 1C/4T big core processor without interfering with product segmentation.
 

NTMBK

Lifer
Nov 14, 2011
10,411
5,677
136
They won't. Phi is throughput oriented (like a GPU), while the Core architecture is latency oriented with a few throughput tricks. They have fundamentally different goals.
 

cbn

Lifer
Mar 27, 2009
12,968
221
106
They won't. Phi is throughput oriented (like a GPU), while the Core architecture is latency oriented with a few throughput tricks. They have fundamentally different goals.

I'm not exactly sure what you mean by latency?

Are you referring to the relative processing delay of running four threads thru one airmont atom core (Knight's Landing) vs running four threads thru four airmont cores (Braswell, etc).

If that is what you mean, then yes it does makes sense to me four threads thru such a narrow core is going to delay things.

But with a wide uarch (even wider than Haswell) developed for four threads is latency really going to be anywhere near that of four threads going thru a narrow airmont core?
 
Last edited:

NTMBK

Lifer
Nov 14, 2011
10,411
5,677
136
I'm not exactly sure what you mean by latency?

Are you referring to the relative processing delay of running four threads thru one airmont atom core (Knight's Landing) vs running four threads thru four airmont cores (Braswell, etc).

If that is what you mean, then yes it does makes sense to me four threads thru such a narrow core is going to delay things.

But with a wide uarch (even wider than Haswell) developed for four threads is latency really going to be anywhere near that of four threads going thru a narrow airmont core?

Adding the extra circuitry to support four threads without increasing single threaded latency will burn a lot of transistors and power, transistors and powers which could instead have been used to reduce single threaded latency even further.

It would be similar to AMD's Bulldozer gamble- they used the extra transistors and power to beef up multithreaded performance in a core by adding a second integer cluster, instead of improving their single threaded performance. And we all know the results from that.
 

itsmydamnation

Diamond Member
Feb 6, 2011
3,044
3,831
136
I'm not exactly sure what you mean by latency?

Are you referring to the relative processing delay of running four threads thru one airmont atom core (Knight's Landing) vs running four threads thru four airmont cores (Braswell, etc).

If that is what you mean, then yes it does makes sense to me four threads thru such a narrow core is going to delay things.

But with a wide uarch (even wider than Haswell) developed for four threads is latency really going to be anywhere near that of four threads going thru a narrow airmont core?

no you have what he means wrong.

knights landing has 4 threads because its going to be working on massively wide workloads that are very high throughput where cache wont be able to cover the amount of data that's needed by a singe thread. 4 threads is so cache misses don't stall the core. Just like GCN for example, memory access is slow execution is fast.

Now Core arch is all about having the data in the cache so you don't stall, lots of transistors to fetch, decode, predictors L/S system. If you share that with 4 threads vs say 1, you now have 1/4th of all those resources per thread. You can't just magically make all these things wider either or you pay in other metrics like power or max clocks.
 

cbn

Lifer
Mar 27, 2009
12,968
221
106
It would be similar to AMD's Bulldozer gamble- they used the extra transistors and power to beef up multithreaded performance in a core by adding a second integer cluster, instead of improving their single threaded performance. And we all know the results from that.

I don't think this is the same as Bulldozer CMT because we are talking about an even more powerful core (using additional SMT to help gain back efficiency lost due to the increased core width), not one core split up into two weaker cores.

05.jpg
 

cbn

Lifer
Mar 27, 2009
12,968
221
106
Now Core arch is all about having the data in the cache so you don't stall, lots of transistors to fetch, decode, predictors L/S system. If you share that with 4 threads vs say 1, you now have 1/4th of all those resources per thread. You can't just magically make all these things wider either or you pay in other metrics like power or max clocks.

I see your point about making things wider affecting power and max clocks (see Pollack's rule above).

However, Core is already at 2 threads per core and I don't think Intel can push clocks up any further without inducing other types of efficiency penalties.

As far as xtor budget goes for each core, yes that would increase.....but then adding additional cores also increases xtor budget.

Four wider cores with 4 thread SMT vs eight cores with 2 thread SMT? What is a better deal for the xtor budget?
 

itsmydamnation

Diamond Member
Feb 6, 2011
3,044
3,831
136
Your ignoring things like bypass and forwarding networks. thinks that are n*(n-1)/2. your circuits used to move data start exploding. Have a look at the way the really high performance core that is doing SMT4 looks......
 

sm625

Diamond Member
May 6, 2011
8,172
137
106
I dont know about 4 threads but 3 makes a lot of sense. The Core M core is wide enough to support 3 threads already with its 7 execution units. They could push that up to 9 and then add a second HT thread. That would give us a 2C6T die which would only be fractionally larger than a 2C4T die. Mobile could force this if 4 threads cease to be sufficient whereas a 4C8T die would definitely be overkill. It probably depends on how thread-hungry mobile workloads become over the next couple years.
 

TuxDave

Lifer
Oct 8, 2002
10,571
3
71
Four wider cores with 4 thread SMT vs eight cores with 2 thread SMT? What is a better deal for the xtor budget?

4 cores x 4T will probably consume fewer transistors than 8 cores x 2T (no comment on frequency of the two). However I suspect a large set of programs will either not benefit in performance or hit a lower performance/watt. We already have a set of programs that complete a workload faster in 1T vs 2T, imagine that list to grow when you have 1T/2T vs 4T.

I dont know about 4 threads but 3 makes a lot of sense. The Core M core is wide enough to support 3 threads already with its 7 execution units. They could push that up to 9 and then add a second HT thread. That would give us a 2C6T die which would only be fractionally larger than a 2C4T die. Mobile could force this if 4 threads cease to be sufficient whereas a 4C8T die would definitely be overkill. It probably depends on how thread-hungry mobile workloads become over the next couple years.

Engineers dislike non-powers of 2. :)
 
Last edited:

tynopik

Diamond Member
Aug 10, 2004
5,245
500
126
Adding the extra circuitry to support four threads without increasing single threaded latency will burn a lot of transistors and power, transistors and powers which could instead have been used to reduce single threaded latency even further.

since we seem to be reaching the limits of improving single-thread performance, I don't see the problem

a better argument is that the transistors could be used for graphics
 

PhIlLy ChEeSe

Senior member
Apr 1, 2013
962
0
0
NEVER!
Till a new CPU competitor comes along you wont see much new coming, they are in business to make money. There gonna sop up all the market first, why produce something which has no one challenging them to begin with.
 

DrMrLordX

Lifer
Apr 27, 2000
22,695
12,642
136
4 cores x 4T will probably consume fewer transistors than 8 cores x 2T (no comment on frequency of the two). However I suspect a large set of programs will either not benefit in performance or hit a lower performance/watt. We already have a set of programs that complete a workload faster in 1T vs 2T, imagine that list to grow when you have 1T/2T vs 4T.

It might make more sense for Intel to explore 4-way SMT for chips at the level of their current i3s. Why sell 2C/4T chips when you could do 1C/4T with a smaller die size? The modern software climate has already demonstrated that it will throw at least four "high priority" threads at the CPU on a fairly regular basis. There's no reason for them to go crazy and start putting 4C/16T chips onto consumer sockets.

Then they could do the classic Intel product segmentation, and move their Celeron and Pentium lines onto 1C/2T chips.

NEVER!
Till a new CPU competitor comes along you wont see much new coming, they are in business to make money. There gonna sop up all the market first, why produce something which has no one challenging them to begin with.

See above, I think Intel might be willing to do it if it helped them to sell fewer transistors at the same price.
 

TuxDave

Lifer
Oct 8, 2002
10,571
3
71
It might make more sense for Intel to explore 4-way SMT for chips at the level of their current i3s. Why sell 2C/4T chips when you could do 1C/4T with a smaller die size? The modern software climate has already demonstrated that it will throw at least four "high priority" threads at the CPU on a fairly regular basis. There's no reason for them to go crazy and start putting 4C/16T chips onto consumer sockets.

Then they could do the classic Intel product segmentation, and move their Celeron and Pentium lines onto 1C/2T chips.

See above, I think Intel might be willing to do it if it helped them to sell fewer transistors at the same price.

For the same price you better have a performance improvement or a power reduction at the close enough iso-performance. I'm thinking that a large set of traces will have a big performance regression moving from 2C4T to 1C4T.
 

cbn

Lifer
Mar 27, 2009
12,968
221
106
We already have a set of programs that complete a workload faster in 1T vs 2T, imagine that list to grow when you have 1T/2T vs 4T.

Can you give me an example of that?

Originally I was thinking you meant something along the lines of a 3.6 Ghz Haswell 4C/4T being faster than a 3.6 GHz Haswell 2C/4T, but I am wondering if you mean something different?

P.S. The way I see things a 4C/16T (with bigger cores) would have better single thread than a 8C/16T.....but I don't know where the MT would lie?
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
Four wider cores with 4 thread SMT vs eight cores with 2 thread SMT? What is a better deal for the xtor budget?

You can see what Intel is thinking on their MorphCore presentation.

You can get better multi-thread performance with SMT-4 but there's a sacrifice for area and single-thread performance.

If they'll do it, they are probably better off going SMT-8 with MorphCore.
 

TuxDave

Lifer
Oct 8, 2002
10,571
3
71
Can you give me an example of that?



Originally I was thinking you meant something along the lines of a 3.6 Ghz Haswell 4C/4T being faster than a 3.6 GHz Haswell 2C/4T, but I am wondering if you mean something different?



P.S. The way I see things a 4C/16T (with bigger cores) would have better single thread than a 8C/16T.....but I don't know where the MT would lie?


Linpack is probably the first thing that comes to mind. I'm predicting that 4c/4t outperforms 4c/8t. And it will demolish 2c/4t.
 

DrMrLordX

Lifer
Apr 27, 2000
22,695
12,642
136
Right, Linpack is just one example of a workload that really does not benefit from HT at all. I'm pretty sure it's been documented that turning off HT actually improves Linpack performance, and it's still like that on Haswell.
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
Right, Linpack is just one example of a workload that really does not benefit from HT at all. I'm pretty sure it's been documented that turning off HT actually improves Linpack performance, and it's still like that on Haswell.

The throughput in Linpack is higher with HT on if you avoid old CPUs.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
The throughput in Linpack is higher with HT on if you avoid old CPUs.

That scenario is pretty low, if almost nonexistent for vast majority of users. One article I was reading stated how enabling Hyperthreading on the Pentium 4-based architectures were so bad, but on Nehalem-based chips, it was negligible(-1%) to a great gain(20%+).

Linpack is practically a synthetic benchmark to lots of people anyway. Who knows of a benchmark that goes so close to its theoretical Flops and performance?

Reality is that CPUs are lot less utilized and 2 threads come into play. Of course adding 2 more diminishes the gain greatly.
 

Lorne

Senior member
Feb 5, 2001
873
1
76
Its not that LINPAC suffers from it, It uses averaging readings, IE 4C at 100pct = 100 pct and 4C HT (8 threads running 50pct depending on how you look at it) = 50pct even though the same work is done.
Don't forget that the more working threads that are open also requires them to be fed, So if there is not enough memory band width to dedicate you will choke a core, Which also will give a bad result in Linpack.

Suggestion, Dynamic Hyper Threading, Ether have it adjustable in bios or software by percentage and max thread count depending on how the system is used.