AMD summit today; Kaveri cuts out the middle man in Trinity.

Page 6 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

bronxzv

Senior member
Jun 13, 2011
460
0
71
Knights Corner has 4-way SMT so it can simply switch to another thread on a branch miss.

how can you do that? it's typically too late when you know it was a miss and you must flush (part of) the pipeline, so do you mean it is switching thread on all branches? in other words MIC cores block speculative execution when several threads are active on them, do you have a source to provide for this ?

according to [1] branch speculation is still very effective on SMT processors, see for example 3.1 page 8 with speedups ranging from 9% to 32% with speculative branches vs non-speculative

[1]: An Evaluation of Speculative Instruction Execution on Simultaneous Multithreaded Processors http://www.cs.washington.edu/research/smt/papers/smtspeculation.pdf
 
Last edited:

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
I said "more than 2 cycles", what is your estimate ?
1 cycle. And that's not an estimate but a fact, confirmed both by timing an unrolled loop containing only VMASKMOV instructions, and its uop decomposition by IACA. And it seems pretty obvious what each of those uops do by comparing its functionality against VMOVMSK and VBLEND.
If we don't agree on a methodology to assert the throughput of current instructions on actual hardware I'm afraid we will be not able to assert the speed of gather instructions when we will be able to test them, so I suppose the next step will be to define a common test framework that all peers can compile and run
Reciprocal throughput is a well defined quantity. It's how many cycles it takes to issue the entire instruction. But that doesn't mean you can just add them to determine the total cycle count of a sequence of instructions, since they could be using different execution ports. We have to look at the utilization of each port individually, and take critical path dependencies between instructions into account, to determine the peak reciprocal throughput of a sequence of instructions.

But I've also already shown that emulating a gather operation on current architectures takes 8 cycles both for port 0 and 5, and 6 for port 1. So there's practically no opportunity left for doing anything in parallel. With the proposed implementation, gather on Haswell only takes one cycle on port 0 and 5 each, and port 1 could be used to set the mask register. So the reciprocal throughput is just one cycle (when all data elements are within a single cache line). So there would be up to an eightfold increase in performance for gather operations.

how can you do that? it's typically too late when you know it was a miss and you must flush (part of) the pipeline, so do you mean it is switching thread on all branches? in other words MIC cores block speculative execution when several threads are active on them, do you have a source to provide for this ?
You're right, it can't wait for the branch result to switch threads. Instead it switches threads every cycle: Programming for the Intel Many Integrated Core Architecture (slide 10).
[1]: An Evaluation of Speculative Instruction Execution on Simultaneous Multithreaded Processors http://www.cs.washington.edu/research/smt/papers/smtspeculation.pdf
Interesting paper, thanks!
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
1 cycle. And that's not an estimate but a fact

a theoretical fact, maybe, in pratice it's clearly 1.5 clocks higher rcp throughput for a full loop in all the examples I have tested, masked stores are also fast in theory but in practice slower than VBLENDVPS + VMOVAPS, that's something easy to test and a well known optimization for Sandy Bridge, the rule of thumb is to use masked moves only to avoid protection violations

So there would be up to an eightfold increase in performance for gather operations.

OK, so, after all, your prediction is 8x best speedup and my prediction is 4x best speedup, we have now one year left to agree on a test methodology

You're right, it can't wait for the branch result to switch threads. Instead it switches threads every cycle: Programming for the Intel Many Integrated Core Architecture (slide 10).
I'm quite sure the P5 pipeline is deep enough to start at least one unecessary gather, now there is certainly an early out for the cases where the mask is 0x00
 
Last edited:

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
OK, so, after all, your prediction is 8x best speedup and my prediction is 4x best speedup...
Could you tell me the uop breakdown that would result in a 2 cycle reciprocal issue throughput?
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
Could you tell me the uop breakdown that would result in a 2 cycle reciprocal issue throughput?

no, I'm afraid I couldn't, like you I'm just a software guy, and, like you, I have no insider information

as already explained my estimate is based on the actual behavior of VMASKMOV which is the instruction the more similar to VGATHERDx that we can test today, just test it in a gather-like scenario where you compute the mask right before the load, let's practice a bit after all these walls of theoretical text!

now, man, seriously, neither of us come with new arguments, so I suggest to continue this discussion on a more technical forum like RWT [1] with more hardware oriented people, we will enjoy some welcome fresh input from EEs with a better insight than we have, but, please, please, try to avoid there clumsy comments such as: switch thread on branch miss [2], number of instructions = number of cycles [3], four theads are enough to avoid branch miss penalties with a dual pipeline CPU [4]

[1] http://www.realworldtech.com/forums/index.cfm?action=list&roomid=2
[2] http://forums.anandtech.com/showpost.php?p=33627135&postcount=123
[3] http://forums.anandtech.com/showpost.php?p=33607136&postcount=93
[4] http://forums.anandtech.com/showpost.php?p=33628049&postcount=127
 
Last edited:

Arkadrel

Diamond Member
Oct 19, 2010
3,681
2
0
This thread is so far above my understanding of tech, its almost non-sense to read.

But from what I gather...... thread is derailed by people like CPUarchitect,
who talks about:

"Intel AVX's" which is a instrution set for 256bit SSE codeing?
that can "help" certain tasks that are very floating point intensive.





While the read at hand (byTS) was about AMD's "HSA"
(Heterogeneous System Architecture).

Which is something about haveing the CPU do tasks it does best, and haveing the GPU do tasks it does best "together", so you get a speedup whenever things are coded to work this way.

"This approach, in which the CPU and GPU combine their efforts to boost overall performance, has previously been nigh-on impossible thanks to the separation between GPU and CPU in silicon. With AMD forging ahead with the architecture formerly known as Fusion, which bonds the two into a single cohesive whole, however, it becomes far simpler."

"This is more efficient because it allows CPUs and GPUs to do what they are good at. GPUs are good at performing computations. CPUs are good at making decisions and flexible data retrieval."

Using synthetic benchmarks, Zhou's team was able to show significant performance gains using the CPU-assisted GPU model. On average, benchmarks ran 21.4 per cent faster while some tasks were boosted by 113 per cent.

"Chip manufacturers are now creating processors that have a "fused architecture," meaning that they include CPUs and GPUs on a single chip. This approach decreases manufacturing costs and makes computers more energy efficient. However, the CPU cores and GPU cores still work almost exclusively on separate functions. They rarely collaborate to execute any given program, so they aren't as efficient as they could be,' explains Zhou. 'That's the issue we’re trying to resolve."

http://www.bit-tech.net/news/hardware/2012/02/08/gpgpu-performance-boost/1




So all this HSA stuff is about how to squeeze another 20-110% performance out of a APU (without it consumeing more power).
From what I can understand right? This is for general tasks I understand, which could work with more or less anything.


"Its about a open standart, more effecient processing, recognising that we have heterogeneous modern workloads that need to run at the lowest possible power, and the power saveings is only achieved if its easy for the applications to use the platforms and we 've drastically reduced the complexity of the programming model."

http://www.youtube.com/watch?v=UXeAGRbZroc



-------------------

From OPs LINK:

"..By allowing the individual parts to play to their strengths, the company estimates 2.5 times the performance and up to a 40% reduction in power usage versus running the algorithm on either the CPU or GPU only."

250% performance and a 40% reduction in power use.

That seems like pretty good benefits, from useing this HSA stuff, with facial detection algorithms programs.
 
Last edited:

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
no, I'm afraid I couldn't, like you I'm just a software guy, and, like you, I have no insider information
Why contest my theory when you can't defend your own?

And if I was "just a software guy" like you then why am I able to present a perfectly plausible uop breakdown of Haswell's gather support, while you can't? Please don't make such assumptions to try and get personal because you're out of technical arguments.
as already explained my estimate is based on the actual behavior of VMASKMOV
And as already explained you are making false conclusions about that instruction. VMOVAPS doesn't use any arithmetic ports, while VMASKMOV does, hence when you pair the latter with other instuctions which use the same arithmetic ports, the cycle count goes up. It doesn't mean VMASKMOV is a 2 cycle instruction. VMOVAPS is executing "for free" in your example because the load ports are available, so you shouldn't compare against that. You should be comparing against the emulation of gather, which also uses many arithmetic uops.
neither of us come with new arguments
I don't think I have to come up with new arguments. My theory is complete and still stands. So pardon me but if anyone need to come up with new arguments it's you. You say Haswell's gather will take at least 2 cycles, but you have yet to present a plausible uop breakdown.
...so I suggest to continue this discussion on a more technical forum like RWT [1] with more hardware oriented people, we will enjoy some welcome fresh input from EEs with a better insight than we have...
If my arguments don't convince you then by all means go ahead and ask people in that forum to come up with an implementation that is more plausible then mine. I get my arguments from multiple people too.
...but, please, please, try to avoid there clumsy comments such as: switch thread on branch miss [2], number of instructions = number of cycles [3], four theads are enough to avoid branch miss penalties with a dual pipeline CPU [4]
Sorry for my "clumsiness". This is complex matter and I merely attempt to get to the right conclusions about heterogeneous versus homogeneous computing. I'm figuring some of these advanced things out as I go and I try to correct my mistakes. After all it's still speculation. That said I do believe four threads is enough to avoid most branch miss penalties, if Knights Corner's branch predictor can track loop counts.

So if those aspects are settled, can we get back to the open issue of Haswell's gather implementation, which is going to be vital to the breakthrough of homogeneous computing? I'm eagerly awaiting your uop breakdown or arguments why mine isn't likely.
 

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
But from what I gather...... thread is derailed by people like CPUarchitect,
who talks about:

"Intel AVX's" which is a instrution set for 256bit SSE codeing?
that can "help" certain tasks that are very floating point intensive.
AVX2 is not just about floating-point performance. It also doubles the throughput of parallel integer workloads. I didn't mean to "derail" this thread at all. In fact the thread was split into an official AVX2 thread, but the discussion here still ended up being about how Kaveri and HSA will compare to competing technology, including AVX2. After all we have a pretty good understanding of Kaveri and there isn't much to say about it, but the exact implementation of AVX2 will determine whether HSA will still be meaningful.

Heterogeneous versus homogeneous computing is reminiscent of the "RISC versus CISC" debate, so you'll likely see a lot more of these discussions in the next few years when Kaveri and Haswell become available. You can't really discuss one without bringing up the other...
Which is something about haveing the CPU do tasks it does best, and haveing the GPU do tasks it does best "together", so you get a speedup whenever things are coded to work this way.
Yes, but coding such a heterogeneous architecture is quite complex. So the idea of homogeneous computing is to bring GPU technology within the CPU cores. Same benefits, but much easier for a wide range of applications to take advantage of!
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
And as already explained you are making false conclusions about that instruction.

vmaskmov is slow, it's well known since 3 years now [1], see this LLVM reference [2] if you're too lazy to test it yourself, after all you just discovered its very existence less than 1 week ago [3] so you have probably one thing or two to learn about it

[1] http://software.intel.com/en-us/forums/showthread.php?t=68554
[2] http://lists.cs.uiuc.edu/pipermail/llvmdev/2011-March/038657.html
[3] http://forums.anandtech.com/showpost.php?p=33614704&postcount=108
 

bronxzv

Senior member
Jun 13, 2011
460
0
71

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
vmaskmov is slow, it's well known since 3 years now [1], see this LLVM reference [2] if you're too lazy to test it yourself, after all you just discovered its very existence less than 1 week ago [3] so you have probably one thing or two to learn about it

[1] http://software.intel.com/en-us/forums/showthread.php?t=68554
[2] http://lists.cs.uiuc.edu/pipermail/llvmdev/2011-March/038657.html
[3] http://forums.anandtech.com/showpost.php?p=33614704&postcount=108
I couldn't find any evidence of VMASKMOV loads being unexpectedly slow in any way. And I'm not lazy, I tested it in practice too: it has a 1 cycle reciprocal throughput. Also in the Intel thread you're linking to, engineer Mark Buxton explains why they are "extremely useful". The only thing that's faster (when the arithmetic ports are already occupied) is a VMOVAPS, i.e. without a blend, but this requires ensuring sufficient data overrun padding as Mark calls it.

How it compares to VMOVAPS isn't relevant to the gather implementation since to emulate gather the arithmetic ports are almost completely occupied for 8 cycles. Hence an implementation which occupies all the arithmetic ports, but for just one cycle (like what I'm presenting), is up to eight times faster even though it's not as fast as a VMOVAPS when there's no load port contention. So it's all relative.

That LLVM discussion refers to the latency of VMOVMSK, which is 2 cycles (one for the actual compaction of the mask and one for moving the result from the SIMD domain to the scalar domain, as shown in table 2-18 of the Intel optimization manual). VMASKMOV consists of a uop to compact the mask, but it stays within the SIMD domain to perform a VBLEND uop. So it consists of more uops and has a higher total latency, but it's still a fast instruction compared to gather emulation since it has a 1 cycle reciprocal throughput.
as you said "The worst possible implementation I can imagine just uses the two scalar load ports to achieve a reciprocal throughput of 4 cycles" [1]
i.e. 2x speedup vs the 8 cycles emulation so my 4x speedup estimate is pretty much in your range [2x,8x]

[1] http://forums.anandtech.com/showpost.php?p=33622545&postcount=116
As I explained in that link, the 4 cycle implementation is not just "worst" in cycle count, it also has a high cost and thus it's a highly unlikely one to be chosen. Why would they send the whole 256-bit index register to both load ports, four times? That would require two new wider interconnects and two multiplexers, just to make gather a little faster. Instead they could send the index vector once to just one port, with one multiplexer to retrieve the first scalar index, and in the best case collect all the elements in a single cycle!

And you really can't just interpolate between those two implementations. Besides, what exactly are you trying to improve over the 1 cycle one?
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
I couldn't find any evidence of VMASKMOV loads being unexpectedly slow in any way.
so why the strange 1.4 cycles in port 5 here http://forums.anandtech.com/showpost.php?p=33625877&postcount=121
for example?

That LLVM discussion refers to the latency of VMOVMSK, which is 2 cycles

it looks like you missed this part
" The MOVMSKPS instruction is cheap (2 cycles). Not to be confused with VMASKMOV, the AVX masked move, which is expensive."

VMASKMOV consists of a uop to compact the mask, but it stays within the SIMD domain to perform a VBLEND uop
masked loads don't need a VBLEND but just a VAND, masked stores need a VBLEND though

Besides, what exactly are you trying to improve over the 1 cycle one?
where do I tell you that I want to improve something? we are just trying to guess the performance of chips already finalized
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
port 3 can do the actual gather load
and we have no idea about the timings for this "gather load" which is obviously the most complex operation in your breakdown, your argument isn't more convincing than saying something like "port DV can do the actual division" and using it to prove that VDIVPS rcp throughput = 1 cycle

assuming each load port can fetch half a cache line per cycle [1], the best possible rcp throughput will be 2 cycles for the cases we are discussing (all elements in the same cache line)

[1] http://forums.anandtech.com/showpost.php?p=33616759&postcount=113
 
Last edited:

Abwx

Lifer
Apr 2, 2011
11,851
4,825
136
This thread is so far above my understanding of tech, its almost non-sense to read.

But from what I gather...... thread is derailed by people like CPUarchitect,
who talks about:

"Intel AVX's" which is a instrution set for 256bit SSE codeing?
that can "help" certain tasks that are very floating point intensive.


While the read at hand (byTS) was about AMD's "HSA"
(Heterogeneous System Architecture).

The understanding is that he s more or less an intel employee
whose purpose is to downplay as much as possible AMD s
eventual propositions wich are in a collision course with
Intel s own plans wich can be summarized as desesperatly
maintening its grip on ISAs through proprietary instructions
sets extensions , a mean to render useless the preceding ISAs
that are progressively entering public domain.

As such , it doesnt matter what is discussed , the essential
being to trash the concurrence , as proved by the fact
that no evaluation/comparison between HSA and AVX2
is even discussed in these lines where the troll insist
on Intel s superior implementation but without providing
the slightest substancied explanations about advantages/drawbacks
of each proposition.....
 
Last edited:

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
The understanding is that he s more or less an intel employee
whose purpose is to downplay as much as possible AMD

Who's understanding is that? Have any proof of your accusations? If not, you would be best served to keep this kind of crap to yourself, or take it to the zone.
 

Abwx

Lifer
Apr 2, 2011
11,851
4,825
136
Who's understanding is that? Have any proof of your accusations? If not, you would be best served to keep this kind of crap to yourself, or take it to the zone.

When i ll need the POW of someone that publicly claimed that
he didnt sold a single AMD based server even in the pentium4
era and that come to brand people as fanboys i ll call you ,
be sure about it....

and your own contribution is?

To display evidence of technical trolling traps , in wich you
are easily falling....

There was a preceding thread about an AMD product (trinity?)
and it was purpotedly changed as the official Avx thread , since
this could perhaps lead to a sane discussion on that particular topic ,
mind you , but it seems that CPU troll prefer to use whatever
thread is about AMD as a convenient trash bin where he can
glorify Intels AVX speed , planck s time comparable , when it comes
to execute 256 instructions in a row that take a miserable
handfull of clock cycles.......

http://forums.anandtech.com/showthread.php?t=2252363&page=5
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
To display evidence of technical trolling traps , in wich you
are easily falling....

I don't feel really trapped, but, indeed, in retrospect I have to confess that I wasted way too much time in this "18x speedup thanks to gather" sub-thread
 

Abwx

Lifer
Apr 2, 2011
11,851
4,825
136
I don't feel really trapped, but, indeed, in retrospect I have to confess that I wasted way too much time in this "18x speedup thanks to gather" sub-thread

Whatever , there was some interessant pieces of technical
explanations , albeit i personnaly have difficulty to catch most
of the specific language.

Anyway , there was also some valuable contributors in the AVX thread,
yet it seems that they didnt follow in this one...

As for the said speed up , some share your point about
what is really feasible..

http://www.thingsstuff.com/2011/12/05/writing-maintainable-simd-intrinsics-using-c-templates/
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
As for the said speed up , some share your point about
what is really feasible..

http://www.thingsstuff.com/2011/12/05/writing-maintainable-simd-intrinsics-using-c-templates/

yes, the theory vs practice thing, if you have too high expectations for future technologies your code will probably always leave a lot of performance on the table since instead of optimizing it for today's targets you will keep dreaming on the next magical technology

concerning the SSE to AVX speedup talked about at your link, this post of mine http://software.intel.com/en-us/forums/showpost.php?p=186862 shows well how the size of the workload has a huge impact,
depending of the dataset size AVX maybe slower than SSE (!) or more than twice faster than SSE
 

Abwx

Lifer
Apr 2, 2011
11,851
4,825
136
concerning the SSE to AVX speedup talked about at your link, this post of mine http://software.intel.com/en-us/forums/showpost.php?p=186862 shows well how the size of the workload has a huge impact,
depending of the dataset size AVX maybe slower than SSE (!) or more than twice faster than SSE

So far , the few datas i did find here and there suggest that theorical
max throughput is reached only on highly parralelisable code , typicaly
in HPC where repetitive instructions can make full use of AVX or/and FMA.

This should be inherently efficient on HSA since it s useful mainly for such
highly redundant instructions , i dont see a Gpu capable of anything else
than simple but highly parralelized code.
 

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
so why the strange 1.4 cycles in port 5 here http://forums.anandtech.com/showpost.php?p=33625877&postcount=121
for example?
Because that's an averaged out result and the two arithmetic uops can execute on either port 0 or port 5. Also look at the result for the add instruction in your example: it's a single uop which can execute on port 0, 1, or 5, and hence the numbers for each port add up to 1.0.
it looks like you missed this part
" The MOVMSKPS instruction is cheap (2 cycles). Not to be confused with VMASKMOV, the AVX masked move, which is expensive."
No, that's exactly what I was referring to. Please read my response again.
where do I tell you that I want to improve something? we are just trying to guess the performance of chips already finalized
The only way a 2 cycle implementation of gather is more likely than the 1 cycle one, is when it improves the cost/gain ratio. I'm very open to that idea, but obviously there has to be a significantly lower cost to improve on the 1 cycle one. I have yet to see any indication of that.
and we have no idea about the timings for this "gather load" which is obviously the most complex operation in your breakdown
Sure we do. All the elements can be extracted from the cache line in parallel. So no latency added there. The only thing adding latency is to select a single index from the index vector for an element that has not been gathered yet. But that's a really trivial amount of logic. Note that a 64-bit LEA can execute and forward the result within a single cycle, while computing the cache line to fetch only requires computing the lower bits. So there's plenty of time to add a simple multiplexer into that path.
assuming each load port can fetch half a cache line per cycle [1], the best possible rcp throughput will be 2 cycles for the cases we are discussing (all elements in the same cache line)
I'm sorry but you're going to have to provide more details to defend that theory. How are the elements being extracted after each half of the cache line has been fetched? How does it support unaligned elements and elements straddling cache lines? How would it blend the result and update the mask register? And last but not least, how is that any better on any metric than the 1 cycle version?
 
Status
Not open for further replies.