It's about dependencies - not total execution time. When lower 16 bits of alu operation is completed it can forwarded to next dependent alu operation within same clock cycle so dependency latency drops to half - two dependent alu operations per clock cycle can calculated throughput wise. It's well explained in that Intel document in part low latency integer alu.
I understand the concept, but overall it doesnt bring that much better throughput even if it can cope with dependencies, they are still bound by latencies, among others of L/S unit.
And related to Zen4 - if you read that Pentium4 Intel document you can find Intel describes how their FPU uses 128 bit registers and ports but uses 64 bit arithmetic hardware which completes full 128 bit SSE-operations in two clock cycles. It's absolutely same approach that AMD uses for their Zen4 but with 512 bit registers and execution ports with 256 bit arithmetic units.
I dont know exactly AMD s all approaches, but operands are indeed 64 bits whatever the width, 128b is just 2 x 64b in a row.
Wether to put enough arithmetic units to execute everything in a cycle or to re use fewer ones and execute the ops in several passes is a up to the designers choice, it can be efficient enough for mixed code, guess that AMD use this approach to save some complexity, and the inherent power and added silicon, so far this seems to work well for AVX512 comparatively to larger units for Intel, at least on a perf/watt point of view.