AMD summit today; Kaveri cuts out the middle man in Trinity.

CPUarchitect · Jun 24, 2012

bronxzv said:
sure, thus the 4x best speedup figure, otherwise it will be less

No, you appear to be forgetting about the extract/insert instructions. With the conservative Haswell architecture of one gather port and one regular load port, it can ideally sustain one gather and one regular load each cycle, right? Without gather that's 9 regular load operations and you are simply dividing by two loads to conclude it can sustain this at ~4 cycles... But there's also 9 extract instructions and 9 insert instructions. It needs 9 cycles for that (leaving the load ports underutilized).

Granted, it's not 18x sustained, but it's better than 4x.

bronxzv · Jun 24, 2012

CPUarchitect said:
No, you appear to be forgetting about the extract/insert instructions.

nope, that's why I talk about *actual timings*, based on production code using software synthetized gather, I basically build in parallel 2 *independent* XMM registers with a final XMM + XMM => YMM merge with 3 insert and 1 scalar load per XMM, maybe do you have another instruction sequence in mind ? I was assuming your were basing your 18 instructions figure on previous discussions with me (at the Intel AVX forum where you post as "c0d1f1ed" AFAIK), if it isn't the case please provide a link to your source (btw you do generally a good job at providing pointers to your sources)

the rcp throughput of vgather will be not less than 2, actual timings on Ivy Bridge of the fastest backward compatible solution are less than 8 clock rcp throughput => best actual speedups of vgather in isolation will be less than 4x i.e. actual gather intensive kernels will enjoy less than 2x speedup overall for the cache friendly cases

bronxzv · Jun 24, 2012

CPUarchitect said:
... But there's also 9 extract instructions and 9 insert instructions.
Granted, it's not 18x sustained, but it's better than 4x.

I just checked my code, it's 6 extract instructions and 6 load-op insert instructions (no port conflict according to IACA), I use moves for the low order word

CPUarchitect · Jun 25, 2012

bronxzv said:
I just checked my code, it's 6 extract instructions and 6 load-op insert instructions (no port conflict according to IACA), I use moves for the low order word

If this is what you're referring to, then I'm counting 7 extract instructions and 7 insert instructions. But yes, using movd for the lower ones is pretty clever and I didn't realize they can execute on any arithmetic port. Still, all 18 instructions will be replaced with a single one on Haswell, with fewer temporary registers being used too.

at the Intel AVX forum where you post as "c0d1f1ed" AFAIK

I'm not him, but I know him very well.

the rcp throughput of vgather will be not less than 2

Why won't it be less than 2? Gather requires turning the mask register into a mask immediate (1 uop), issuing to the gather port (1 uop), and blending the result based on the mask (1 uop). All of those can be executed by another port, so the peak throughput would be 1 each cycle.

They wouldn't implement it in a way that 18 instructions could be faster than 1.

bronxzv · Jun 25, 2012

CPUarchitect said:
If this is what you're referring to, then I'm counting 7 extract instructions and 7 insert instructions.

yes that's it, it's quite old now, good catch!

note that vextractf128, vinsertf128 impact on timings is much less than vpextrd/vinsertps, also on real code some ports will be free to execute in parallel other code (typically arithmetic instructions) so the difference in rcp throughput due to gather is typically less than 8 clocks compared to vector loads

CPUarchitect said:
replaced with a single one on Haswell, with fewer temporary registers being used too.

note that depending on the actual register pressure the Intel compiler use far less intermediate register than the example linked above

CPUarchitect said:
I'm not him, but I know him very well.

so you must now him very very well indeed, out of curiosity I have googled for "c0d1f1ed" and I found the similarity with your posts here truly striking

CPUarchitect said:
Why won't it be less than 2

it's an estimate based on the actual behavior of VMASKMOVPS on Sandy Bridge, gather will be more complex with a least one stage in the generic permute unit (speculation of mine), more generally I'll not count on the first implementation of gather to reach 100% of its potential, much like 256-bit unaligned loads aren't top speed on SNB/IVB

CPUarchitect · Jun 25, 2012

bronxzv said:
yes that's it, it's quite old now, good catch!

note that vextractf128, vinsertf128 impact on timings is much less than vpextrd/vinsertps, also on real code some ports will be free to execute in parallel other code (typically arithmetic instructions) so the difference in rcp throughput due to gather is typically less than 8 clocks compared to vector loads

That would be true if you could freely increase the arithmetic workload. In reality the ratio between loading data and processing it is practically fixed by the algorithm. It looks like Haswell should be capable of sustaining one gather instruction and one FMA each cycle. That way things like polynomials with table lookups and matrix multiplications can execute at high performance. The floating-point throughput for such algorithms would be considerably lower without gather. Not eightfold in practice, but still enough of a difference for gather support to be a game changer.

so you must now him very very well indeed, out of curiosity I have googled for "c0d1f1ed" and I found the similarity with your posts here truly striking

We're partners. He's in graphics, I'm in physics, and we both use throughput computing technology so we exchange ideas regularly.

it's an estimate based on the actual behavior of VMASKMOVPS on Sandy Bridge

It's probably VMOVMSKPS you want to be looking at, not VMASKMOVPS.

gather will be more complex with a least one stage in the generic permute unit (speculation of mine)

So you imagine it to load 256-bit parts each sent to the permute unit? That's an interesting theory but how would it handle all the address generations VSIB supports, and unaligned elements?

Note that the load units already contain shifter logic to select an element from a cache line, and it already has logic for dealing with data split over multiple cache lines. These shifters are relatively cheap; they've existed for ages. Gather simply has to extend them to eight elements. Keep in mind that reusing logic should not be one of Intel's goals. Achieving high performance/Watt is, so highly dedicated logic in the load unit makes the most sense.

more generally I'll not count on the first implementation of gather to reach 100% of its potential, much like 256-bit unaligned loads aren't top speed on SNB/IVB

Knights Corner has 50+ cores capable of 512-bit gather. Surely Haswell can have four dedicated 256-bit ones.

bronxzv · Jun 25, 2012

CPUarchitect said:
That would be true if you could freely increase the arithmetic workload.

its true in all the use cases I'm aware of, you always gather data for some processing, if you have a relevant counter example: post it

CPUarchitect said:
It's probably VMOVMSKPS you want to be looking at, not VMASKMOVPS.

no,no, I really mean masked loads

CPUarchitect said:
We're partners. He's in graphics, I'm in physics, and we both use throughput computing technology so we exchange ideas regularly.

IIRC you were mentioning in some old threads that you know well graphics and that you have published books about Direct X, I suppose I will be convinced if you provide a pointer to these books and that's not a reference to c0d1f1ed's paper in Shader X3

CPUarchitect · Jun 25, 2012

bronxzv said:
its true in all the use cases I'm aware of, you always gather data for some processing, if you have a relevant counter example: post it

As explained, I was referring to the ratio of loading data and processing it. You typically can't increase the arithmetic workload without also loading more data. Hence the fact that emulating gather with extract/insert leaves some execution ports underutilized shouldn't be regarded as a useful benefit. More often than not it represents a loss.

no,no, I really mean masked loads

My bad, MASKMOV used to be a store instruction only. I didn't realize AVX added V(P)MASKMOV variants which support both load and store. I should test how they behave... What exactly did you want me to pay attention to which could be related to Haswell's gather implementation?

bronxzv · Jun 25, 2012

CPUarchitect said:
As explained, I was referring to the ratio of loading data and processing it.

me too, there is typically several instructions to process the data you have gathered, these instructions can be scheduled in parallel with your next gather since several iterations of your loops are typically in flight at the same moment, it shows very well when you study the timings since you can add basically for free instructions using non conflicting ports up to a point where you pass beyond the gather critical path

CPUarchitect said:
test how they behave... What exactly did you want me to pay attention to which could be related to Haswell's gather implementation?

you can compare its throughput with an equivalent vmovaps + vblendvps (2 uops) sequence for ex., I'll advise to also test it for masked stores, another interesting experiment is to compare the timings of 256-bit movups (loads & stores) with unaligned memory vs. the same split in 2 128-bit loads/stores, all these examples are faster on SNB/IVB with several instructions than the single instruction they replace

iCyborg · Jun 25, 2012

CPUarchitect said:
To solve that dilemma you need a homogeneous architecture that can execute both sequential and parallel code efficiently. AVX2 is a massive leap in that direction. And with AVX-1024 they would basically be able to 'morph' the CPU into behaving much more like a GPU architecture, fully dynamically, with zero effort from software developers. That is the future of computing.

It is the only future for computing.

So why did Intel develop Xeon Phi? It comes with the same PCIe limitations of GPUs, and instead of going very wide, they've opted for a many-core approach with 512-bit wide SIMD. Pretty much goiing against what you say is the future for computing. Why not 15 cores that are 2048-bit wide or 30 x 1024-bit? Or why bother spending resources at all on something that CPUs should be making obsolete in the near future?

Idontcare · Jun 25, 2012

iCyborg said:
So why did Intel develop Xeon Phi? It comes with the same PCIe limitations of GPUs, and instead of going very wide, they've opted for a many-core approach with 512-bit wide SIMD. Pretty much goiing against what you say is the future for computing. Why not 15 cores that are 2048-bit wide or 30 x 1024-bit? Or why bother spending resources at all on something that CPUs should be making obsolete in the near future?

Don't you foresee the evolution here for Intel of the eventual merging of Phi and Core on-die in their own big.LITTLE version of heterogeneous hardware processing?

14nm, keep 4 or 6 "big" cores and then bolt on say 8 or 10 "Phi" cores. Make the OS be "core type" aware the same as they do now when scheduling hyperthreading or CMT loadings so the user sees performance from their big cores first, and then the small cores get loaded only when called upon.

Then for laptops/mobile you do the reverse, powering down all the big cores, and using the little cores to do what needs doing while sipping power depending on the power-profile set by the user/OS/bios.

sefsefsefsef · Jun 26, 2012

How fast can an AVX2 gather instruction complete, start to finish, assuming all L1 hits? You can only get at maximum a couple words from the L1 per cycle, so it's going to take several cycles to complete, right? How much faster is this in practice compared to doing multiple independent loads (which can execute in parallel as per the out-of-order engine)?

I see that once all the registers are loaded, then the high vector degree will help a lot, but I'm still not seeing the advantage of the scatter/gather instructions, other than slightly reducing code size (maybe).

bronxzv · Jun 26, 2012

sefsefsefsef said:
How fast can an AVX2 gather instruction complete, start to finish, assuming all L1 hits?

start to finish (execution latency) isn't very useful for practical purposes, that's why we talk about the reciprocal throughput (~ average number of cycles for a sustained use) here, my estimate is 2 cycles best case (all elements from the same cache line)

sefsefsefsef said:
You can only get at maximum a couple words from the L1 per cycle,

On Haswell for loads 64B per cycle (1 cache line or 2 YMM registers) is likely based on the SSE vs AVX2 actual speedups disclosed at IDF Spring

sefsefsefsef said:
so it's going to take several cycles to complete, right? How much faster is this in practice compared to doing multiple independent loads

my estimate is that it will be roughly 4x faster best case, CPUarchitect think it will be more but he doesn't provided a revised value so far after is overly optimistic 18x speedup

sefsefsefsef · Jun 26, 2012

bronxzv said:
start to finish (execution latency) isn't very useful for practical purposes, that's why we talk about the reciprocal throughput (~ average number of cycles for a sustained use) here, my estimate is 2 cycles best case (all elements from the same cache line)

On Haswell for loads 64B per cycle (1 cache line or 2 YMM registers) is likely based on the SSE vs AVX2 actual speedups disclosed at IDF Spring

my estimate is that it will be roughly 4x faster best case, CPUarchitect think it will be more but he doesn't provided a revised value so far after is overly optimistic 18x speedup

Thanks for the insight. Again, my understanding was that the point of scatter/gather was that the data could be in arbitrary locations in memory, and you didn't have to worry about co-locating it within the same cache line. Am I missing or misunderstanding something here? If you have to co-locate everything in the same cache line, is that very different from existing vector extensions? Just wider?

bronxzv · Jun 26, 2012

sefsefsefsef said:
Thanks for the insight. Again, my understanding was that the point of scatter/gather was that the data could be in arbitrary locations in memory, and you didn't have to worry about co-locating it within the same cache line. Am I missing or misunderstanding something here?

elements can be anywhere in memory, even gigabytes appart with 64-bit indices (some elements cached, other not yet cached, etc.), the single cache line scenario is only for the sake of argument when discussing the best potential speedup scenario

now, in a lot of use cases the indices have low variance within a vector so it's not that uncommon to touch only one cache line or two with a 8-element gather

CPUarchitect · Jun 27, 2012

Alright, I've collected some hard data to get a better picture of how homogeneous and heterogeneous computing will compare...

Sandy/Ivy Bridge's emulation of a gather operation currently takes:

8 uops on port 0 (ALU)
6 uops on port 1 (ALU)
4 uops on port 2 (load)
4 uops on port 3 (load)
0 uops on port 4 (store)
8 uops on port 5 (ALU)

Note that all the ALU ports are pretty swamped. The maximum reciprocal throughput is 8 cycles, and there's only two unused cycles on port 1. In theory this means you should be able to squeeze in something like two additions, but in practice everything I tested caused the throughput to go down significantly. Also in practice it would be very rare to be able to freely add more arithmetic workload, even more so things that only use port 1.

So this irrefutably proves that emulating gather has bad throughput and no matter how Haswell implements the dedicated gather support it will be a profound improvement. The worst possible implementation I can imagine just uses the two scalar load ports to achieve a reciprocal throughput of 4 cycles, allowing Haswell's gather performance to keep up with the doubling of the arithmetic throughput. So the worst option is still good enough.

But given that Intel focuses on performance/Watt it doesn't make sense to always fetch 4096-bit worth of cache lines for a 256-bit gather. Also it wouldn't make sense to send 256-bit index registers data to each load port, four times. Also Larrabee/MIC was presented as getting support for collecting any number of elements from one cache line each cycle. Hence I'm sticking with the widely supported speculation that Haswell will have a single 256-bit gather port (and a 256-bit regular load port) capable of sustaining one gather instruction each cycle when the elements are in the same cache line. It could use the second load port to pass back the updated mask register!

I've seriously considered the possibility that they'd use the permute unit (currently bound to port 5), but there's no clear way how it could support unaligned elements, how it would write back the mask register, and how it could issue a varying number of load operations.

So feel free to prove me wrong with other hard data or by presenting an alternative gather implementation which supports all its features, but there are clearly strong arguments to support that Haswell's gather implementation will be much more efficient than emulating it as done today.

bronxzv · Jun 27, 2012

CPUarchitect said:
Haswell's gather implementation will be much more efficient than emulating it as done today.

indeed, as stated in my post above [1] I estimate it will be up to 4x faster with the new instruction than with the legacy code path, that's a very strong enhancement in my book, what's your estimate ?

don't forget that we have to initialize the mask before each gather so my 2 clocks rcp throughput may well be too optimistic

[1]: http://forums.anandtech.com/showpost.php?p=33616759&postcount=113

bronxzv · Jun 27, 2012

CPUarchitect said:
Sandy/Ivy Bridge's emulation of a gather operation currently takes:

8 uops on port 0 (ALU)

6 uops on port 1 (ALU)

4 uops on port 2 (load)

4 uops on port 3 (load)

0 uops on port 4 (store)

8 uops on port 5 (ALU)

is it something you have actually tested ? or is it simply based on my IACA dump we have discussed in the past ? note that IACA [1] has been recently updated with a proper SNB profile + a new IVB profile, don't forget to use the latest version

[1]: IACA download http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/

bronxzv · Jun 27, 2012

CPUarchitect said:
Also Larrabee/MIC was presented as getting support for collecting any number of elements from one cache line each cycle.

the exact phrasing in [1], page 61 is:

"

Note that, depending on the implementation,

only one 64-byte memory access is performed for a variable number of vector elements located in that region.
"

it tells nothing about the nr of cycles

moreover, in [1], all the gather instructions have this comment in their description:

"

Programmers should always enforce the execution of a gather/scatter instruction to be
re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the
gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are

zero).
"

it definitely doesn't look like something you can do with a single cycle throughput

[1]: MIC ref. manual June 6, 2012 http://software.intel.com/file/44500

CPUarchitect · Jun 28, 2012

bronxzv said:
indeed, as stated in my post above [1] I estimate it will be up to 4x faster with the new instruction than with the legacy code path, that's a very strong enhancement in my book, what's your estimate ?

don't forget that we have to initialize the mask before each gather so my 2 clocks rcp throughput may well be too optimistic

After the detailed analysis I concluded it will likely be capable of sustaining a peak throughput of one gather operation each cycle. The mask register can be initialized using vcmpeq on port 1, it can then be compacted by a vmovmsk on port 0, then port 3 can do the actual gather load, port 2 writes back the empty mask register, and port 5 performs a vblend on the destination register.

Note that this uses the port mappings already present in Sandy Bridge. They just have to widen the load ports to 256-bit and make one capable of extracting 8 elements from a cache line.

bronxzv said:
is it something you have actually tested ? or is it simply based on my IACA dump we have discussed in the past ? note that IACA [1] has been recently updated with a proper SNB profile + a new IVB profile, don't forget to use the latest version

I couldn't find any discussion about the throughput of a generic gather emulation, so yes I've run it through the latest IACA and verified the timings on an i7-2600 with Turbo Boost disabled.

bronxzv said:
the exact phrasing in [1], page 61 is:

"Note that, depending on the implementation, only one 64-byte memory access is performed for a variable number of vector elements located in that region."

it tells nothing about the nr of cycles

moreover, in [1], all the gather instructions have this comment in their description:

"Programmers should always enforce the execution of a gather/scatter instruction to be re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are zero)."

it definitely doesn't look like something you can do with a single cycle throughput

Someone actually pointed me to that same text as an indication that Knights Corner can gather any number of elements from a whole cache line each cycle. It needs a JKNZD instruction to loop around when the elements are spread over multiple cache lines, but this executes in parallel on the scalar pipeline so the peak throughput is one cycle.

bronxzv · Jun 28, 2012

CPUarchitect said:
AVX added V(P)MASKMOV variants which support both load and store. I should test how they behave...

out of curiosity I have measured precisely the difference in timings between a kernel using VMOVAPS and another one using VMASKMOVPS, all other things being equal, the difference I get is (workload entirely in the L1D cache):

1.55 +/- 0.08 cycles

in other words the total rcp throughput of a 256-bit masked load is more than 2 cycles on Ivy Bridge

test config: Core i7 3770 K @ 3.5 GHz, enhanced speedstep off, turbo off

the IACA v2.0.1 dump is as follows:

Code:

[FONT=Courier New]| Num Of |              Ports pressure in cycles               |    |[/FONT]
[FONT=Courier New]|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |[/FONT]
[FONT=Courier New]---------------------------------------------------------------------[/FONT]
[FONT=Courier New]|   3    | 0.6       |     | 0.5   1.0 | 0.5   1.0 |     | 1.4 | CP | vmaskmovps ymm1, ymm0, ymmword ptr [ebp+ecx*1-0x458][/FONT]
[FONT=Courier New]|   2^   | 1.0       |     | 0.5   1.0 | 0.5   1.0 |     |     | CP | vmulps ymm2, ymm1, ymmword ptr [0x1238520][/FONT]
[FONT=Courier New]|   2^   |           | 1.0 | 0.5   1.0 | 0.5   1.0 |     |     | CP | vaddps ymm3, ymm2, ymmword ptr [0x1238540][/FONT]
[FONT=Courier New]|   2    |           |     | 0.5       | 0.5       | 2.0 |     |    | vmovups ymmword ptr [ebp+ecx*1-0x858], ymm3[/FONT]
[FONT=Courier New]|   1    | 0.1       | 0.6 |           |           |     | 0.3 |    | add ecx, 0x20[/FONT]
[FONT=Courier New]|   1    |           |     |           |           |     | 1.0 |    | cmp edx, 0x100[/FONT]
[FONT=Courier New]|   0F   |           |     |           |           |     |     |    | jb 0xffffffbf[/FONT]

bronxzv · Jun 28, 2012

CPUarchitect said:
It needs a JKNZD instruction to loop around when the elements are spread over multiple cache lines

and the impact of branch misses is ?

CPUarchitect · Jun 28, 2012

bronxzv said:
out of curiosity I have measured precisely the difference in timings between a kernel using VMOVAPS and another one using VMASKMOVPS, all other things being equal, the difference I get is (workload entirely in the L1D cache):

1.55 +/- 0.08 cycles

in other words the total rcp throughput of a 256-bit masked load is more than 2 cycles on Ivy Bridge

I don't think that's a correct conclusion. VMASKMOV consists of more uops than VMOVAPS, so if you're already occupying the ports for the extra uops then it adds to the critical path. It's just a coincidence that your VMOVAPS can use underutilized ports so you basically got it for free. That doesn't mean VMASKMOV takes 2 cycles nor that VMOVAPS takes 0 cycles.

It's no surprise that VMASKMOV adds one uop for port 0 and one for port 5. It compacts the mask register into an immediate operand (like VMOVMSK) on port 0, then loads the data on ports 2+3, and then uses a VBLEND uop on port 5 to write the result into the destination register (using the immediate operand mask).

It fully supports the theory that VGATHER would use port 0 and 5 for the mask compaction and blend respectively. That only leaves gather load and writing back the full mask register, which can be achieved by widening ports 2 and 3 to 256-bit and having one collect all elements within a cache line. Hence just like VMASKMOV it would have a reciprocal throughput of 1 cycle.

bronxzv said:
and the impact of branch misses is ?

Nothing. Knights Corner has 4-way SMT so it can simply switch to another thread on a branch miss.

bronxzv · Jun 28, 2012

CPUarchitect said:
That doesn't mean VMASKMOV takes 2 cycles

I said "more than 2 cycles", what is your estimate ?

bronxzv · Jun 28, 2012

CPUarchitect said:
I don't think that's a correct conclusion.

If we don't agree on a methodology to assert the throughput of current instructions on actual hardware I'm afraid we will be not able to assert the speed of gather instructions when we will be able to test them, so I suppose the next step will be to define a common test framework that all peers can compile and run

AMD summit today; Kaveri cuts out the middle man in Trinity.

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Golden Member

Elite Member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member