(continuing the discussion of a possible "4 DRAM channels configured as one channel per die")
When you consider the base, certified configuration for TR1, which had a rated "speed" of DDR4-2667, then doing a single channel at 3200 does seem like it could significantly restrict single thread local memory performance for heavily memory bound applications.
However, as the ship has sailed on the TR2 configuration, I'll focus my thoughts on TR3.
The rough idea out there is that TR3 will be largely based on EPYC 2, featuring 4 dies, each at 7nm. It is speculated that Epyc 2 could have as many as 64 cores with 4 X 16 core dies. In the process of shrinking everything, they will either expand the current CCXs or just "paste" in two additional CCX units. Either way, it seems logical that the L3 cache per die would double to 32MB (8 per CCX in an x4, or 16 per ccx in an x2 config). That's a lot of L3 cache to mitigate immediate DRAM channel demands. It is also possible that they could make a significant change in direction and use the CCX from Raven Ridge for compactness, configuring 4 X CCX with 4MB L3 cache, but then using the extra space on the die to create an L4 cache. Either way, there will likely be more cache per die in TR3 and EPYC 2. That additional cache should help with keeping demands on the DRAM channels under control. That being the case, a slight reduction in DRAM bandwidth per die may not be as apparent in system performance metrics.
If most of the above is true, when compared to the original certified TR1 configuration, would a TR3 that has a single enabled DDR4-3600 or even approaching 4000 spec channel per die be significantly hampered in performance? Local cores have a mountain of cache to help with memory demands and a pretty good amount of local DRAM bandwidth. If we make the assumption that the IF throughput will be increased in EPYC2/TR3, then access to remote DRAM channels will also be at or near full bandwidth with only initial transaction setup latencies to contend with. Given how spread out the access to remote DRAM would be, inter die data transfers wouldn't be as disrupted as they would be in the current TR2 setup where remote die memory calls can heavily saturate the IF links between the die. Increased IF bandwidth can also serve to reduce the latency penalty for remote die memory calls. Looking at the number for EPYC 1 from the latency and bandwidth testing that Serve The Home did, you can see a relative drop in latency from going to 2667 DRAM from 2400 of around 11ns. That's 11ns for 266mhz. While the scaling is not linear, going from 2667 to 3200 is a jump of 533mhz. That should be good enough to shave another 18ns or so off of remote DRAM access latencies. The local die has a latency for memory calls (when configured as an EPYC core for what that's worth) of 81 ns at 2667 and remote die DRAM at ~135ns. Running at 3200, you'd expect that latency to be, again, roughly, 117ns. If the trend line continues, then at 3600, it should be vaguely 105ns and at 4000 the latency should be around 100 or less ns. Now, none of us have any idea is AMD is capable of getting the IF links between cores to run that fast on their MMC. However, if they can, then those are NOT bad memory latency numbers for remote node DRAM accesses. I suspect, however, that the addition of more cores in the 7nm die will incur some sort of latency penalty as the routing of system calls will have a bigger table to look through for each new transaction, so those numbers are likely optimistic.
The idea with having a local DRAM channel per die is to make sure that there is always an opportunity for a die to make a memory call with the lowest latency. At DDr-3600-4000 (again, I'm speculating on the certified DRAM speed for TR3 here, using AMD's demonstrated DDR-3200 on TR2 as a baseline), that's a latency that can be as low as the low 60-high 50 ns range as compared to a remote latency that's easily twice that. I think that, for a task that can be heavily threaded and can be managed to keep it's working memory local to its own NUMA node, this can make a noticeable difference, especially in cases where there are lots of small transactions instead of streaming large blocks in bulk. In cases where one die needs maximum bandwidth, having a single channel that's at the end of each inter-die IF link means that it should be better able to sustain maximum bandwidth across those links as that data transfer should not saturate the link (where as it can now).
Remember, this is not an EPYC processor where we're going all out for maximum performance in all cases. TR is in the middle of the stack. Having reduced bandwidth is OK. It's about having a LOT of cores to throw at a problem, but without some of the unneeded server features, validation, etc that can make the platform too expensive. I just think that this works out better for general use cases.