The big deal with AVX2 is that it offers a vector equivalent of every scalar instruction. And this is the principle that makes GPUs so fast. It allows any code loop with independent iterations to run in an
SPMD fashion. So with AVX2 a fair bit of code could run up to 8 times faster.
AVX1 didn't make much of an impact because it widened only the floating-point instructions to 256-bit. Doubling the theoretical floating-point performance looked good on paper but this extension is very awkward to use when you need to mix in a few integer operations. It requires processing the upper and lower 128-bit separately and using additional instructions to transfer the data to and from the upper part of the vector register. Furthermore, Sandy Bridge doesn't increase the cache bandwidth so AVX1 code is often starved for data. But it set the stage for AVX2 by introducing the VEX encoding format, widening the registers, using two execution stacks, etc. Basically AVX2 was too big to implement all at once so they had to make a compromise and this became AVX1. So the announcement of AVX was pretty huge in the sense that Intel made the bold move to widen the SIMD instructions, although it quickly became clear that the first version would lack some vital pieces.
Thankfully Haswell will fix all the shortcomings of Sandy Bridge at once. The integer vector instructions are widened to 256-bit (meaning a single instruction can do what you previously needed 3-4 for), it will bring the long-awaited gather support (allowing to load up to 8 data elements in parallel instead of sequentially), it will add vector-vector shifts (a relatively simple instruction that nonetheless has been lacking from the SIMD instruction set for 10 years), it doubles the theoretical floating-point performance again with FMA, and it will inevitably double the cache bandwidth.