AMD summit today; Kaveri cuts out the middle man in Trinity.

bronxzv · Jun 28, 2012

CPUarchitect said:
Knights Corner has 4-way SMT so it can simply switch to another thread on a branch miss.

how can you do that? it's typically too late when you know it was a miss and you must flush (part of) the pipeline, so do you mean it is switching thread on all branches? in other words MIC cores block speculative execution when several threads are active on them, do you have a source to provide for this ?

according to [1] branch speculation is still very effective on SMT processors, see for example 3.1 page 8 with speedups ranging from 9% to 32% with speculative branches vs non-speculative

[1]: An Evaluation of Speculative Instruction Execution on Simultaneous Multithreaded Processors http://www.cs.washington.edu/research/smt/papers/smtspeculation.pdf

CPUarchitect · Jun 28, 2012

bronxzv said:
I said "more than 2 cycles", what is your estimate ?

1 cycle. And that's not an estimate but a fact, confirmed both by timing an unrolled loop containing only VMASKMOV instructions, and its uop decomposition by IACA. And it seems pretty obvious what each of those uops do by comparing its functionality against VMOVMSK and VBLEND.

bronxzv said:
If we don't agree on a methodology to assert the throughput of current instructions on actual hardware I'm afraid we will be not able to assert the speed of gather instructions when we will be able to test them, so I suppose the next step will be to define a common test framework that all peers can compile and run

Reciprocal throughput is a well defined quantity. It's how many cycles it takes to issue the entire instruction. But that doesn't mean you can just add them to determine the total cycle count of a sequence of instructions, since they could be using different execution ports. We have to look at the utilization of each port individually, and take critical path dependencies between instructions into account, to determine the peak reciprocal throughput of a sequence of instructions.

But I've also already shown that emulating a gather operation on current architectures takes 8 cycles both for port 0 and 5, and 6 for port 1. So there's practically no opportunity left for doing anything in parallel. With the proposed implementation, gather on Haswell only takes one cycle on port 0 and 5 each, and port 1 could be used to set the mask register. So the reciprocal throughput is just one cycle (when all data elements are within a single cache line). So there would be up to an eightfold increase in performance for gather operations.

bronxzv said:
how can you do that? it's typically too late when you know it was a miss and you must flush (part of) the pipeline, so do you mean it is switching thread on all branches? in other words MIC cores block speculative execution when several threads are active on them, do you have a source to provide for this ?

You're right, it can't wait for the branch result to switch threads. Instead it switches threads every cycle: Programming for the Intel Many Integrated Core Architecture (slide 10).

[1]: An Evaluation of Speculative Instruction Execution on Simultaneous Multithreaded Processors http://www.cs.washington.edu/research/smt/papers/smtspeculation.pdf

Interesting paper, thanks!

bronxzv · Jun 28, 2012

CPUarchitect said:
1 cycle. And that's not an estimate but a fact

a theoretical fact, maybe, in pratice it's clearly 1.5 clocks higher rcp throughput for a full loop in all the examples I have tested, masked stores are also fast in theory but in practice slower than VBLENDVPS + VMOVAPS, that's something easy to test and a well known optimization for Sandy Bridge, the rule of thumb is to use masked moves only to avoid protection violations

CPUarchitect said:
So there would be up to an eightfold increase in performance for gather operations.

OK, so, after all, your prediction is 8x best speedup and my prediction is 4x best speedup, we have now one year left to agree on a test methodology

CPUarchitect said:
You're right, it can't wait for the branch result to switch threads. Instead it switches threads every cycle: Programming for the Intel Many Integrated Core Architecture (slide 10).

I'm quite sure the P5 pipeline is deep enough to start at least one unecessary gather, now there is certainly an early out for the cases where the mask is 0x00

CPUarchitect · Jun 29, 2012

OK, so, after all, your prediction is 8x best speedup and my prediction is 4x best speedup...

Could you tell me the uop breakdown that would result in a 2 cycle reciprocal issue throughput?

bronxzv · Jun 29, 2012

CPUarchitect said:
Could you tell me the uop breakdown that would result in a 2 cycle reciprocal issue throughput?

no, I'm afraid I couldn't, like you I'm just a software guy, and, like you, I have no insider information

as already explained my estimate is based on the actual behavior of VMASKMOV which is the instruction the more similar to VGATHERDx that we can test today, just test it in a gather-like scenario where you compute the mask right before the load, let's practice a bit after all these walls of theoretical text!

now, man, seriously, neither of us come with new arguments, so I suggest to continue this discussion on a more technical forum like RWT [1] with more hardware oriented people, we will enjoy some welcome fresh input from EEs with a better insight than we have, but, please, please, try to avoid there clumsy comments such as: switch thread on branch miss [2], number of instructions = number of cycles [3], four theads are enough to avoid branch miss penalties with a dual pipeline CPU [4]

[1] http://www.realworldtech.com/forums/index.cfm?action=list&roomid=2
[2] http://forums.anandtech.com/showpost.php?p=33627135&postcount=123
[3] http://forums.anandtech.com/showpost.php?p=33607136&postcount=93
[4] http://forums.anandtech.com/showpost.php?p=33628049&postcount=127

Arkadrel · Jun 30, 2012

This thread is so far above my understanding of tech, its almost non-sense to read.

But from what I gather...... thread is derailed by people like CPUarchitect,
who talks about:

"Intel AVX's" which is a instrution set for 256bit SSE codeing?
that can "help" certain tasks that are very floating point intensive.

While the read at hand (byTS) was about AMD's "HSA"
(Heterogeneous System Architecture).

Which is something about haveing the CPU do tasks it does best, and haveing the GPU do tasks it does best "together", so you get a speedup whenever things are coded to work this way.

"This approach, in which the CPU and GPU combine their efforts to boost overall performance, has previously been nigh-on impossible thanks to the separation between GPU and CPU in silicon. With AMD forging ahead with the architecture formerly known as Fusion, which bonds the two into a single cohesive whole, however, it becomes far simpler."

"This is more efficient because it allows CPUs and GPUs to do what they are good at. GPUs are good at performing computations. CPUs are good at making decisions and flexible data retrieval."

Using synthetic benchmarks, Zhou's team was able to show significant performance gains using the CPU-assisted GPU model. On average, benchmarks ran 21.4 per cent faster while some tasks were boosted by 113 per cent.

"Chip manufacturers are now creating processors that have a "fused architecture," meaning that they include CPUs and GPUs on a single chip. This approach decreases manufacturing costs and makes computers more energy efficient. However, the CPU cores and GPU cores still work almost exclusively on separate functions. They rarely collaborate to execute any given program, so they aren't as efficient as they could be,' explains Zhou. 'That's the issue we’re trying to resolve."

http://www.bit-tech.net/news/hardware/2012/02/08/gpgpu-performance-boost/1

So all this HSA stuff is about how to squeeze another 20-110% performance out of a APU (without it consumeing more power).
From what I can understand right? This is for general tasks I understand, which could work with more or less anything.

"Its about a open standart, more effecient processing, recognising that we have heterogeneous modern workloads that need to run at the lowest possible power, and the power saveings is only achieved if its easy for the applications to use the platforms and we 've drastically reduced the complexity of the programming model."

http://www.youtube.com/watch?v=UXeAGRbZroc

-------------------

From OPs LINK:

"..By allowing the individual parts to play to their strengths, the company estimates 2.5 times the performance and up to a 40% reduction in power usage versus running the algorithm on either the CPU or GPU only."

250% performance and a 40% reduction in power use.

That seems like pretty good benefits, from useing this HSA stuff, with facial detection algorithms programs.

CPUarchitect · Jun 30, 2012

bronxzv said:
no, I'm afraid I couldn't, like you I'm just a software guy, and, like you, I have no insider information

Why contest my theory when you can't defend your own?

And if I was "just a software guy" like you then why am I able to present a perfectly plausible uop breakdown of Haswell's gather support, while you can't? Please don't make such assumptions to try and get personal because you're out of technical arguments.

as already explained my estimate is based on the actual behavior of VMASKMOV

And as already explained you are making false conclusions about that instruction. VMOVAPS doesn't use any arithmetic ports, while VMASKMOV does, hence when you pair the latter with other instuctions which use the same arithmetic ports, the cycle count goes up. It doesn't mean VMASKMOV is a 2 cycle instruction. VMOVAPS is executing "for free" in your example because the load ports are available, so you shouldn't compare against that. You should be comparing against the emulation of gather, which also uses many arithmetic uops.

neither of us come with new arguments

I don't think I have to come up with new arguments. My theory is complete and still stands. So pardon me but if anyone need to come up with new arguments it's you. You say Haswell's gather will take at least 2 cycles, but you have yet to present a plausible uop breakdown.

...so I suggest to continue this discussion on a more technical forum like RWT [1] with more hardware oriented people, we will enjoy some welcome fresh input from EEs with a better insight than we have...

If my arguments don't convince you then by all means go ahead and ask people in that forum to come up with an implementation that is more plausible then mine. I get my arguments from multiple people too.

...but, please, please, try to avoid there clumsy comments such as: switch thread on branch miss [2], number of instructions = number of cycles [3], four theads are enough to avoid branch miss penalties with a dual pipeline CPU [4]

Sorry for my "clumsiness". This is complex matter and I merely attempt to get to the right conclusions about heterogeneous versus homogeneous computing. I'm figuring some of these advanced things out as I go and I try to correct my mistakes. After all it's still speculation. That said I do believe four threads is enough to avoid most branch miss penalties, if Knights Corner's branch predictor can track loop counts.

So if those aspects are settled, can we get back to the open issue of Haswell's gather implementation, which is going to be vital to the breakthrough of homogeneous computing? I'm eagerly awaiting your uop breakdown or arguments why mine isn't likely.

CPUarchitect · Jun 30, 2012

Arkadrel said:
But from what I gather...... thread is derailed by people like CPUarchitect,
who talks about:

"Intel AVX's" which is a instrution set for 256bit SSE codeing?
that can "help" certain tasks that are very floating point intensive.

AVX2 is not just about floating-point performance. It also doubles the throughput of parallel integer workloads. I didn't mean to "derail" this thread at all. In fact the thread was split into an official AVX2 thread, but the discussion here still ended up being about how Kaveri and HSA will compare to competing technology, including AVX2. After all we have a pretty good understanding of Kaveri and there isn't much to say about it, but the exact implementation of AVX2 will determine whether HSA will still be meaningful.

Heterogeneous versus homogeneous computing is reminiscent of the "RISC versus CISC" debate, so you'll likely see a lot more of these discussions in the next few years when Kaveri and Haswell become available. You can't really discuss one without bringing up the other...

Which is something about haveing the CPU do tasks it does best, and haveing the GPU do tasks it does best "together", so you get a speedup whenever things are coded to work this way.

Yes, but coding such a heterogeneous architecture is quite complex. So the idea of homogeneous computing is to bring GPU technology within the CPU cores. Same benefits, but much easier for a wide range of applications to take advantage of!

bronxzv · Jun 30, 2012

CPUarchitect said:
And as already explained you are making false conclusions about that instruction.

vmaskmov is slow, it's well known since 3 years now [1], see this LLVM reference [2] if you're too lazy to test it yourself, after all you just discovered its very existence less than 1 week ago [3] so you have probably one thing or two to learn about it

[1] http://software.intel.com/en-us/forums/showthread.php?t=68554
[2] http://lists.cs.uiuc.edu/pipermail/llvmdev/2011-March/038657.html
[3] http://forums.anandtech.com/showpost.php?p=33614704&postcount=108

bronxzv · Jun 30, 2012

CPUarchitect said:
I'm eagerly awaiting your uop breakdown or arguments why mine isn't likely.

as you said "The worst possible implementation I can imagine just uses the two scalar load ports to achieve a reciprocal throughput of 4 cycles" [1]
i.e. 2x speedup vs the 8 cycles emulation so my 4x speedup estimate is pretty much in your range [2x,8x]

[1] http://forums.anandtech.com/showpost.php?p=33622545&postcount=116

CPUarchitect · Jun 30, 2012

bronxzv said:
vmaskmov is slow, it's well known since 3 years now [1], see this LLVM reference [2] if you're too lazy to test it yourself, after all you just discovered its very existence less than 1 week ago [3] so you have probably one thing or two to learn about it

[1] http://software.intel.com/en-us/forums/showthread.php?t=68554
[2] http://lists.cs.uiuc.edu/pipermail/llvmdev/2011-March/038657.html
[3] http://forums.anandtech.com/showpost.php?p=33614704&postcount=108

I couldn't find any evidence of VMASKMOV loads being unexpectedly slow in any way. And I'm not lazy, I tested it in practice too: it has a 1 cycle reciprocal throughput. Also in the Intel thread you're linking to, engineer Mark Buxton explains why they are "extremely useful". The only thing that's faster (when the arithmetic ports are already occupied) is a VMOVAPS, i.e. without a blend, but this requires ensuring sufficient data overrun padding as Mark calls it.

How it compares to VMOVAPS isn't relevant to the gather implementation since to emulate gather the arithmetic ports are almost completely occupied for 8 cycles. Hence an implementation which occupies all the arithmetic ports, but for just one cycle (like what I'm presenting), is up to eight times faster even though it's not as fast as a VMOVAPS when there's no load port contention. So it's all relative.

That LLVM discussion refers to the latency of VMOVMSK, which is 2 cycles (one for the actual compaction of the mask and one for moving the result from the SIMD domain to the scalar domain, as shown in table 2-18 of the Intel optimization manual). VMASKMOV consists of a uop to compact the mask, but it stays within the SIMD domain to perform a VBLEND uop. So it consists of more uops and has a higher total latency, but it's still a fast instruction compared to gather emulation since it has a 1 cycle reciprocal throughput.

bronxzv said:
as you said "The worst possible implementation I can imagine just uses the two scalar load ports to achieve a reciprocal throughput of 4 cycles" [1]
i.e. 2x speedup vs the 8 cycles emulation so my 4x speedup estimate is pretty much in your range [2x,8x]

[1] http://forums.anandtech.com/showpost.php?p=33622545&postcount=116

As I explained in that link, the 4 cycle implementation is not just "worst" in cycle count, it also has a high cost and thus it's a highly unlikely one to be chosen. Why would they send the whole 256-bit index register to both load ports, four times? That would require two new wider interconnects and two multiplexers, just to make gather a little faster. Instead they could send the index vector once to just one port, with one multiplexer to retrieve the first scalar index, and in the best case collect all the elements in a single cycle!

And you really can't just interpolate between those two implementations. Besides, what exactly are you trying to improve over the 1 cycle one?

bronxzv · Jun 30, 2012

CPUarchitect said:
I couldn't find any evidence of VMASKMOV loads being unexpectedly slow in any way.

so why the strange 1.4 cycles in port 5 here http://forums.anandtech.com/showpost.php?p=33625877&postcount=121
for example?

CPUarchitect said:
That LLVM discussion refers to the latency of VMOVMSK, which is 2 cycles

it looks like you missed this part
" The MOVMSKPS instruction is cheap (2 cycles). Not to be confused with VMASKMOV, the AVX masked move, which is expensive."

CPUarchitect said:
VMASKMOV consists of a uop to compact the mask, but it stays within the SIMD domain to perform a VBLEND uop

masked loads don't need a VBLEND but just a VAND, masked stores need a VBLEND though

CPUarchitect said:
Besides, what exactly are you trying to improve over the 1 cycle one?

where do I tell you that I want to improve something? we are just trying to guess the performance of chips already finalized

bronxzv · Jun 30, 2012

CPUarchitect said:
port 3 can do the actual gather load

and we have no idea about the timings for this "gather load" which is obviously the most complex operation in your breakdown, your argument isn't more convincing than saying something like "port DV can do the actual division" and using it to prove that VDIVPS rcp throughput = 1 cycle

assuming each load port can fetch half a cache line per cycle [1], the best possible rcp throughput will be 2 cycles for the cases we are discussing (all elements in the same cache line)

[1] http://forums.anandtech.com/showpost.php?p=33616759&postcount=113

Abwx · Jul 1, 2012

Arkadrel said:
This thread is so far above my understanding of tech, its almost non-sense to read.

But from what I gather...... thread is derailed by people like CPUarchitect,
who talks about:

"Intel AVX's" which is a instrution set for 256bit SSE codeing?
that can "help" certain tasks that are very floating point intensive.

While the read at hand (byTS) was about AMD's "HSA"
(Heterogeneous System Architecture).

The understanding is that he s more or less an intel employee
whose purpose is to downplay as much as possible AMD s
eventual propositions wich are in a collision course with
Intel s own plans wich can be summarized as desesperatly
maintening its grip on ISAs through proprietary instructions
sets extensions , a mean to render useless the preceding ISAs
that are progressively entering public domain.

As such , it doesnt matter what is discussed , the essential
being to trash the concurrence , as proved by the fact
that no evaluation/comparison between HSA and AVX2
is even discussed in these lines where the troll insist
on Intel s superior implementation but without providing
the slightest substancied explanations about advantages/drawbacks
of each proposition.....

Phynaz · Jul 1, 2012

Abwx said:
The understanding is that he s more or less an intel employee
whose purpose is to downplay as much as possible AMD

Who's understanding is that? Have any proof of your accusations? If not, you would be best served to keep this kind of crap to yourself, or take it to the zone.

bronxzv · Jul 1, 2012

Abwx said:
without providing
the slightest substancied explanations about advantages/drawbacks
of each proposition.....

and your own contribution is?

kbafewx · Jul 1, 2012

so much denial. lmao... crazy

Abwx · Jul 1, 2012

Phynaz said:
Who's understanding is that? Have any proof of your accusations? If not, you would be best served to keep this kind of crap to yourself, or take it to the zone.

When i ll need the POW of someone that publicly claimed that
he didnt sold a single AMD based server even in the pentium4
era and that come to brand people as fanboys i ll call you ,
be sure about it....

bronxzv said:
and your own contribution is?

To display evidence of technical trolling traps , in wich you
are easily falling....

There was a preceding thread about an AMD product (trinity?)
and it was purpotedly changed as the official Avx thread , since
this could perhaps lead to a sane discussion on that particular topic ,
mind you , but it seems that CPU troll prefer to use whatever
thread is about AMD as a convenient trash bin where he can
glorify Intels AVX speed , planck s time comparable , when it comes
to execute 256 instructions in a row that take a miserable
handfull of clock cycles.......

http://forums.anandtech.com/showthread.php?t=2252363&page=5

bronxzv · Jul 1, 2012

Abwx said:
To display evidence of technical trolling traps , in wich you
are easily falling....

I don't feel really trapped, but, indeed, in retrospect I have to confess that I wasted way too much time in this "18x speedup thanks to gather" sub-thread

Abwx · Jul 1, 2012

bronxzv said:
I don't feel really trapped, but, indeed, in retrospect I have to confess that I wasted way too much time in this "18x speedup thanks to gather" sub-thread

Whatever , there was some interessant pieces of technical
explanations , albeit i personnaly have difficulty to catch most
of the specific language.

Anyway , there was also some valuable contributors in the AVX thread,
yet it seems that they didnt follow in this one...

As for the said speed up , some share your point about
what is really feasible..

http://www.thingsstuff.com/2011/12/05/writing-maintainable-simd-intrinsics-using-c-templates/

bronxzv · Jul 1, 2012

Abwx said:
As for the said speed up , some share your point about
what is really feasible..

http://www.thingsstuff.com/2011/12/05/writing-maintainable-simd-intrinsics-using-c-templates/

yes, the theory vs practice thing, if you have too high expectations for future technologies your code will probably always leave a lot of performance on the table since instead of optimizing it for today's targets you will keep dreaming on the next magical technology

concerning the SSE to AVX speedup talked about at your link, this post of mine http://software.intel.com/en-us/forums/showpost.php?p=186862 shows well how the size of the workload has a huge impact,
depending of the dataset size AVX maybe slower than SSE (!) or more than twice faster than SSE

Abwx · Jul 1, 2012

bronxzv said:
concerning the SSE to AVX speedup talked about at your link, this post of mine http://software.intel.com/en-us/forums/showpost.php?p=186862 shows well how the size of the workload has a huge impact,
depending of the dataset size AVX maybe slower than SSE (!) or more than twice faster than SSE

So far , the few datas i did find here and there suggest that theorical
max throughput is reached only on highly parralelisable code , typicaly
in HPC where repetitive instructions can make full use of AVX or/and FMA.

This should be inherently efficient on HSA since it s useful mainly for such
highly redundant instructions , i dont see a Gpu capable of anything else
than simple but highly parralelized code.

happysmiles · Jul 2, 2012

Whoever wins, we benefit in the end. (except early adopters!)

CPUarchitect · Jul 2, 2012

bronxzv said:
so why the strange 1.4 cycles in port 5 here http://forums.anandtech.com/showpost.php?p=33625877&postcount=121
for example?

Because that's an averaged out result and the two arithmetic uops can execute on either port 0 or port 5. Also look at the result for the add instruction in your example: it's a single uop which can execute on port 0, 1, or 5, and hence the numbers for each port add up to 1.0.

it looks like you missed this part
" The MOVMSKPS instruction is cheap (2 cycles). Not to be confused with VMASKMOV, the AVX masked move, which is expensive."

No, that's exactly what I was referring to. Please read my response again.

where do I tell you that I want to improve something? we are just trying to guess the performance of chips already finalized

The only way a 2 cycle implementation of gather is more likely than the 1 cycle one, is when it improves the cost/gain ratio. I'm very open to that idea, but obviously there has to be a significantly lower cost to improve on the 1 cycle one. I have yet to see any indication of that.

bronxzv said:
and we have no idea about the timings for this "gather load" which is obviously the most complex operation in your breakdown

Sure we do. All the elements can be extracted from the cache line in parallel. So no latency added there. The only thing adding latency is to select a single index from the index vector for an element that has not been gathered yet. But that's a really trivial amount of logic. Note that a 64-bit LEA can execute and forward the result within a single cycle, while computing the cache line to fetch only requires computing the lower bits. So there's plenty of time to add a simple multiplexer into that path.

assuming each load port can fetch half a cache line per cycle [1], the best possible rcp throughput will be 2 cycles for the cases we are discussing (all elements in the same cache line)

I'm sorry but you're going to have to provide more details to defend that theory. How are the elements being extracted after each half of the cache line has been fetched? How does it support unaligned elements and elements straddling cache lines? How would it blend the result and update the mask register? And last but not least, how is that any better on any metric than the 1 cycle version?

bronxzv · Jul 3, 2012

CPUarchitect said:
Sure we do. All the elements can be extracted from the cache line in parallel.

where? somewhere between the L1D cache and the load buffer?

AMD summit today; Kaveri cuts out the middle man in Trinity.

Senior member

Senior member

Senior member

Senior member

Senior member

Diamond Member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Lifer

Lifer

Senior member

Junior Member

Lifer

Senior member

Lifer

Senior member

Lifer

Senior member

Senior member

Senior member