I don't have simulation tools to verify my claims - but if Apple would switch to 32/32/512 cache arrangement to that low-clock scheme they could make L1D to 2-cycle latency with that 512KB L2 backing it up with sub 10 cycle latency.
I don't have simulation tools either, but applying common sense would work too in this case. Why would Apple trade 192+128KB of excellent L1 for 576KB of tiered cache?
Apples L1I is very busy, as they don't have uOP cache and at least 8 decoders, so requirements for that hypothetical L2 would be huge. Ports and bandwidth would be eaten by instruction streams and that would either result into energy waste or increased latency ( or both ).
Sub 10 cycle 512KB L2 is frankly a fantasy. Intel/AMD were doing 256-512KB L2 caches and they've always came out ~12-13 cycles, so shawing a cycle for faster L1 vs AMD + secret sauce => I think 10 cycles is what is achievable, some other ARM stuff had 11 cycle 512LB L2, so maybe Apple could pull it and still target 4Ghz?
L1D with 2-cycle latency ? Now that's where we get into land of "Was done on last time on 130nm Pentium 4 and not done ever since territory". Ofc P4 was doing it with 8K sized L1D and Apple would have 32KB instead. Overall i don't think this is good tradeoff and would put huge pressure on L2 as well.
Insertion of another cache level currently only makes sense for Intel currently, where its 32KB + 48KB at 5 cycles and 2nd level is 16 cycles ( and probably getting to 17-18 cycles with 3MB ).
So it makes sense for Intel to insert ~256KB sized private L2 at some 10-11 cycles ( and combine it with 2 P-Cores into cluster that shares "current L2" maybe and looses even more latency but saves on power? ).
Still even for Intel, i wonder if after L1I expansion to 64KB, would so small private L2 make much sense, probably 512KB is minimum.