It takes time for a logic signal to propagate. Say for instance you have a sequence of inverters (NOT gates) and a single input. There is some delay associated with each gate, call it T_inv.
So if you have N inverters, after changing the input bit, it will take N * T_inv time for the output to change. If this is a clocked system and you need to correctly determine the output between changes of the input, you have to wait N * T_inv seconds, meaning your clock frequency is limited to 1 / (N * T_inv).
Now let's say you split the chain so there are N/2 inverters followed by a register following by N/2 inverters. Now, after the input changes, the register in the middle only has to wait half the time for the signal in the middle to be correct so it can be sampled. Similarly, the second set of N/2 inverters only has to wait half the time from when the middle register last changed its value.
Of course this means that the output value is always delayed by 1 clock cycle from what it previously was (i.e., we've added a 1 clock cycle of latency). However, we get the run the clock twice as fast. There is a direct trade-off here. There is a limit to how much you can pipeline, though, where adding more pipelines doesn't improve performance any more.