Rumor: Intel to delay releasing Ivy Bridge

Page 7 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Tuna-Fish

Golden Member
Mar 4, 2011
1,645
2,464
136
Can anyone explain to me WHAT gather and scatter instructions are?! Are they supposed to make things multithreaded automagically?

Scatter gather = vector indirect memory access. Gather eats a vector of memory addresses, and spits out a vector of values retrieved from those addresses. Scatter eats a vector of values and a vector of addresses, and stores those values into those addresses in memory. Potentially with various more complex addressing modes.

The primary thing that makes vectorizing code hard is memory access, and scatter-gather is a proven solution that has worked in many other platforms. For some problems, it just makes life easier for programmers and compiler writers, while some other problems simply cannot be solved with a sensible performance without it (for instance, texture fetch).

While scatter-gather would be a very major advancement, it would not be a magic bullet that makes all code vectorizable -- it just makes it a lot easier. It is, however, quite expensive to implement -- for example, the present L1 caches would have to get a lot more complex to support 8 simultaneous memory access.
 

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
It is, however, quite expensive to implement -- for example, the present L1 caches would have to get a lot more complex to support 8 simultaneous memory access.
The way it's supposedly implemented in Larrabee isn't overly complex. The only restriction is that it can only gather elements from one cache line at a time. But typically the data locality is quite high so it only needs to access a few cache lines and thus merely takes a few clock cycles (instead of tens of clock cycles when emulating it with serial loads and extract/insert type instructions).

That said, Sandy Bridge already uses the cache banking technique to support two simultaneous load operations. So even a fairly straightforward implementation could gather 8 elements in 4 clock cycles. By combining this with Larrabee's technique this could even be lowered to 1 cycle. It would still be able to access two cache lines per clock, so even with less ideal data locality this would perform really well.

Anyway, there are many trade-offs to be made, but it doesn't have to be very complex per say. They can also initially use a straightforward implementation (it would already increase throughput and reduce power consumption considerably), and when the transistor budget increases and developers start to depend on it more, they can go for an advanced implementation...
 

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
Scatter gather = vector indirect memory access. Gather eats a vector of memory addresses, and spits out a vector of values retrieved from those addresses. Scatter eats a vector of values and a vector of addresses, and stores those values into those addresses in memory. Potentially with various more complex addressing modes.

The primary thing that makes vectorizing code hard is memory access, and scatter-gather is a proven solution that has worked in many other platforms. For some problems, it just makes life easier for programmers and compiler writers, while some other problems simply cannot be solved with a sensible performance without it (for instance, texture fetch).

While scatter-gather would be a very major advancement, it would not be a magic bullet that makes all code vectorizable -- it just makes it a lot easier. It is, however, quite expensive to implement -- for example, the present L1 caches would have to get a lot more complex to support 8 simultaneous memory access.

SB L1 cache already allows for 1024 bit. As shown in the links already posted . Thats why intel told us that AVX scales from 256bit to 1024 bit easily also as already posted Prefix of Vex allows for 5 operands. 1 of which can be a memory operand . This is why prefix of vec is so important in intels future plans . Intel has a game plan and their following that plan . Also in links already posted to intel forum they cover scatter gather pretty well.
 
Last edited:

Tuna-Fish

Golden Member
Mar 4, 2011
1,645
2,464
136
SB L1 cache already allows for 1024 bit. As shown in the links already posted . Thats why intel told us that AVX scales from 256bit to 1024 bit easily
Umm no. The width of the cache access on SNB has practically nothing to do with how well AVX will scale. Also, just accessing long blocks of contiguous memory is a very different thing to vector indirect accessing.


also as already posted Prefix of Vex allows for 5 operands. 1 of which can be a memory operand.

Immediate memory operands have nothing to do with scatter-gather. They take a vector register operand, which is used indirectly to access memory. Hence, "vector indirect access."
 

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
Also in links already posted to intel forum they cover scatter gather pretty well.
Gather/scatter is a hot topic but I don't know of any kind of confirmation from an Intel engineer or official that it's actively in development. Given that AVX focuses on high throughput, and accessing irregular memory patterns has become a serious bottleneck (even more so once FMA support is added), I have no doubt it's taken into consideration by their research team, but if you have some additional information which suggests it has entered the development stage or it's targeted for a specific architecture (Haswell or Skylake) please share. ^_^
 

WhoBeDaPlaya

Diamond Member
Sep 15, 2000
7,414
402
126
Atom is better now? I'm using an Asus 1015PN netbook with Atom N550 (1.5GHz dual core) and NV ION2. With Intel graphics, battery is 6+ hours (brightness turned down) of browsing/video playback. With ION2 on I can play 1080p x264 for about 3-4 hours, more if just outputting on HDMI. It does get hot though.
My M11x R1 gets the same (or slightly more) battery life on a SU7300 and GMA4500/GT335M.
 

podspi

Golden Member
Jan 11, 2011
1,982
102
106
Thanks for the explanation, Tuna!

Will be very interesting to see what Intel does.