That wouldn't work and it's not that difficult to see why

I'll try to explain but I'm not good at that...
But he said L2 and I have no idea how L2 hits are supposed to be representative of anything instruction throughput related. I guess by "intentionally hitting" he's saying to have a big stream that blows up L1 icache. I'd be really alarmed if the core can sustain whatever the maximum fetch rate is while streaming from L2 cache..
The idea is to avoid L1d, not L1i.
You can measure issue width by intentionally letting your decode run ahead by repeatedly missing L1d.
Basically, produce a large block of pointers, where each of them points to a random one within the block, sized to fit in L2, pointer chase through all of them like:
a = start;
for i in count {
a = *a;
}
measure time taken, this is the time it takes to just pointer chase.
Then add arithmetic to each iteration, so that all the operations in the iteration depend on the previous memory load, and the next load depends on the arithmetic. Measure time taken, deduct the time taken just for the pointer chasing, and you have a reasonably close measurement of the time it took to do the arithmetic, all of it running from the ROB post-decode.
The point is not to get average IPC above decode, it's to make sure average IPC is pinned at 0 for a known amount of time, so you can deduce the peak ipc from the average.
For a beginner, what are the downsides of making a cpu wider? It is harder to increase clock speed? Without the proper software design the extra wide cpu won't be utilized and thus it will be wasted die space? Any other problems?
Getting results from units to each other gets more complex and expensive the more there are of them. Having more units means you need to spread them out further away from your register file to fit them. These both cost clock speed and area. If you have variable-width instructions, like x86, fast decode gets really expensive and hard really quickly with added width. All software has some ILP limit, and it varies a lot, so adding more width gives diminishing utility.
Note that these penalties don't count in some situations, specifically when you let the FP units and the integer units issue in parallel -- as they don't typically forward data between each other, you don't need fast paths, and they operate on independent register files.