You're pretty close. Just because a processor is "pipeliend" doesn't mean every single one has to be 5 stages. Plus, prefetch is when you get instructions you don't need -- so the first general step is just "fetch".
And you're right on about it taking 5 cycles (if it has a pipeline length of 5 stages). And the idea is exactly as you said -- to be able to issue another instruction every cycle evn though it takes 5 cycles to get it all done. Just like an assembly line. (Hennesy and Patterson like the "washing machine" analogy).
The Pentium pros is something like 14 stages (I think that's right -- I don't have the number memorized

). It's actually slower to have more instructions in at once, due to something call a
pipeline flush. When there's a branch in code, (like an "if this, do this" "else, do that" section in a program), it will guess. Why wait 14 cycles when you can guess? That's what all modern processors do, most with ~90%+ correct "gussing" rate. This is called "branch prediction" Whell, I'd explain more, but you're getting the ideas correct from aceshardware, and they explain that later on

Just keep reading
So it's technically slower to have more instructions in the 'pipeline' at once, because when it guesses wrong, there's more work that has to be thrown away and restarted. But, the more stages (all else equal), the higher the chip will be able to clock in the same process.
Yep, superscalar means that it can issue more than one instruction at a time, so it can have, say, 2, 3 4, etc.... insturctions running through the processor at once.
You're getting there....
Welcome to computer architecture.