Any time you break up data chunks and transmit into smaller packets you add time to re-assemble those chunks back into their original state. That's fine for slower access pathways. But if you're aiming for performance then keep data in its original state during transmission. Trace lengths are obviously critical for each cache level. As trace length increases the time it takes to transmit and receive signals across that trace is going to increase accordingly. Lengths of traces and the sheer number of the traces also creates timing skew in parallel interfaces.
I find the notion that creating onboard memory attached to the main board - in such a way to minimize trace distances - as 'being no advantage' is ridiculous. Apple obviously saw the advantage. Game consoles do it. Video cards do it. It cuts latency and decreases access time. You cannot compete with its latency by simply using serializing data over longer trace lengths. Re-assembly of data always pays an additional overhead cost. SSDs moved to serial interfaces because they simply do not require the latency figures that something like an L3 cache would need to improve performance.
We also shouldn't assume that a serialized device is transmitted through the same narrow number of traces from the device to the CPU. We often see localized caching or root hosts close to the device attached to wider interfaces. So from device to local caching it will be a narrow interface by nature. From that device-attached cache to CPU it moves data through stages via a bus architecture. The M.2 interface is 67 pins. The M.2 connects to PCIe 4x. The PCIe 4x connects to the PCIe x16 controller. The PCIe x16 interface is 82 pins. Pin counts general increase with each incremental interface closer to the CPU; these less complex interfaces eventually reach the CPU through a common southbridge interface.
A bottleneck eventually emerges due to limitations caused by data transmission between the CPU and southbridge. The solution wasn't more pins on the main board but rather by cutting parts from the southbridge out and transferring them to the CPU. Pin counts continue to increase to feed these components that moved because they not only shrunk, but the decrease in trace lengths dramatically decreased their timings and therefore require more lines of information to feed them. So as southbridge lost functions they continue to grow pin counts. CPU pin counts grew, too. Last generation AM4 is 1,331 pins. Next generation Zen 4 chips are going for 1,718 pins.
AMD is talking about 64MB L3 caches stacked to form up to a sum of 192 MB on their next generation chips. Caches are great but eventually you need access to RAM. While integrated memory will never become as fast as L3 cache, the aggressive move to decrease trace lengths to the first 8-16 GB of RAM will certainly help to keep these future CPUs fed. To suggest it would just another thing to break on the main board minimizes the fact that main boards are evolving complexity by progress. Systems that use integrated graphics have the most to gain.
I am still not sure what you are arguing for or against here. Game consoles mostly use unified memory with just graphics memory. This is due to needing the bandwidth for the gpu and it is cheaper not to have separate system memory. I don't think I said anything about trace lengths being irrelevant. They are definitely important for L1 cache but signal propagation speed becomes less important as you move out the memory hierarchy. I doubt that the absolute value of the signal propagation time in a DDR connection is any significant amount of the latency. Most of the latency will be the sense amps reading the DRAM cells. Since it is a parallel interface, having it as close as possible for all traces is important and keeping traces short for signal integrity is important. I tried to get an old cpu to work while attached through a ribbon cable in the electronics lab once. It was nearly impossible to get it to work since the ribbon cable was essentially an antenna for interference.
Serial interfaces have length restrictions also. It has been an issue with pci-express 4. It will be more of an issue with pci-express 5. The trace length is important in different ways though. Each line is independent with an embedded clock with serial connections, so the traces do not need to be matched. As far as accessing DRAM via serial interfaces, we already do that. The higher latency on pci-express devices doesn't really have that much to do with the physical level interface. It has to do with it being a high level, software driven protocol. On AMD systems and Intel systems with multiple processors, memory is already accessed over a serialized interface. For AMD, the connection between the cpu chiplet and the IO die with the memory controller is infinity fabric, which is based on a physical layer very similar, if not the same, as pci-express. It is a multi-layer protocol, but it is all in hardware. In multi-socket systems, Intel or AMD, remote memory accesses go over a serialized link. AMD has their system matched very well. For an AMD IO die, you have 128-bit (2x64 channel) memory controller. The connection to the cpu chiplets is 32 bits wide, but at more than 4x the memory clock speed. Connection to another socket in Epyc is 16-bit wide, but at more than 8x the the memory clock.
Anyway, as you move further out the memory hierarchy, latency goes up and speed goes down. You also generally handle larger and larger chunks. The virtual memory system generally works with 4k pages. I saw something about Apple using 16k pages on their systems, which might give them a performance advantage. AMD treats HBM memory on their GPUs as cache with a fully virtualized memory system, but I don't know what the page or cache line size is. If you have a massive amout of DRAM on die, then you probably don't need more external DRAM. The bandwidth would be wasted. The consoles are already like this. Instead of on-package DRAM, they have unified graphics memory backed up by a fast SSD. If someone made a laptop like that, then I guess it is something I might consider, if it was a lot of graphics memory. We generally don't get an APU with graphics memory though, they have slow system memory.