Quoting myself here
- 8 Core CCX (Also mentioned by S|A, patent drawings indicate so but it is exemplary)
- Single L3 in one CCX (from #20180239708, #20180143829, #20180165202) same as Zen 1
- Memory Controller located in another chiplet connected by an interconnect (called bridge chiplet by AMD) (from #20180239708, #20180143829, #20180165202)
- Data compression across IF (from Patents see #20180167082 (across sockets) and #20180052631 (across dies)) not in Zen 1. If compressed data is lesser than bus width the extra bits are not even signalled. (#20180314655)
- Directory Controller for L3 sync across dies ( from Patents see #20180239708) which is not the case in Zen 1
- According to David Schor/gcc patches Load/Store costs for(>=256 bit SSE) are halved. I don't know if it is definitive but this is a significant improvement.
- Many improvements related to cache if Patents are to be believed. Something like 8-10 patents in last year.
Those patent applications were for real after all.
I add some new ones below , cache related only
20190179758 CACHE TO CACHE DATA TRANSFER ACCELERATION TECHNIQUES
Systems, apparatuses, and methods for accelerating cache to cache data transfers are disclosed. A system includes at least a plurality of processing nodes and prediction units, an interconnect fabric, and a memory. A first prediction unit is configured to receive memory requests generated by a first processing node as the requests traverse the interconnect fabric on the path to memory. When the first prediction unit receives a memory request, the first prediction unit generates a prediction of whether data targeted by the request is cached by another processing node. The first prediction unit is configured to cause a speculative probe to be sent to a second processing node responsive to predicting that the data targeted by the memory request is cached by the second processing node. The speculative probe accelerates the retrieval of the data from the second processing node if the prediction is correct.
20190179760 CACHE CONTROL AWARE MEMORY CONTROLLER
Systems, apparatuses, and methods for accelerating cache to cache data transfers are disclosed. A system includes at least a plurality of processing nodes and prediction units, an interconnect fabric, and a memory. A first prediction unit is configured to receive memory requests generated by a first processing node as the requests traverse the interconnect fabric on the path to memory. When the first prediction unit receives a memory request, the first prediction unit generates a prediction of whether data targeted by the request is cached by another processing node. The first prediction unit is configured to cause a speculative probe to be sent to a second processing node responsive to predicting that the data targeted by the memory request is cached by the second processing node. The speculative probe accelerates the retrieval of the data from the second processing node if the prediction is correct.
20190095330 PREEMPTIVE CACHE WRITEBACK WITH TRANSACTION SUPPORT
A method of preemptive cache writeback includes transmitting, from a first cache controller of a first cache to a second cache controller of a second cache, an unused bandwidth message representing an unused bandwidth between the first cache and the second cache during a first cycle. During a second cycle, a cache line containing dirty data is preemptively written back from the second cache to the first cache based on the unused bandwidth message. Further, the cache line in the second cache is written over in response to a cache miss to the second cache
20190108154 METHOD AND APPARATUS FOR POWER REDUCTION FOR DATA MOVEMENT
A method of and device for transferring data is provided. The method includes determining a difference between a data segment that was transferred last relative to each of one or more data segments available to be transferred next. In some embodiments, for so long as no data segment available to be sent has been waiting too long, the data segment chosen to be sent next is the data segment having the smallest difference relative to the data segment transferred last. The chosen data segment is then transmitted as the next data segment transferred.
20190188137 REGION BASED DIRECTORY SCHEME TO ADAPT TO LARGE CACHE SIZES
Systems, apparatuses, and methods for maintaining a region-based cache directory are disclosed. A system includes multiple processing nodes, with each processing node including a cache subsystem. The system also includes a cache directory to help manage cache coherency among the different cache subsystems of the system. In order to reduce the number of entries in the cache directory, the cache directory tracks coherency on a region basis rather than on a cache line basis, wherein a region includes multiple cache lines. Accordingly, the system includes a region-based cache directory to track regions which have at least one cache line cached in any cache subsystem in the system. The cache directory includes a reference count in each entry to track the aggregate number of cache lines that are cached per region. If a reference count of a given entry goes to zero, the cache directory reclaims the given entry.
20190196974 TAG ACCELERATOR FOR LOW LATENCY DRAM CACHE
Systems, apparatuses, and methods for implementing a tag accelerator cache are disclosed. A system includes at least a data cache and a control unit coupled to the data cache via a memory controller. The control unit includes a tag accelerator cache (TAC) for caching tag blocks fetched from the data cache. The data cache is organized such that multiple tags are retrieved in a single access. This allows hiding the tag latency penalty for future accesses to neighboring tags and improves cache bandwidth. When a tag block is fetched from the data cache, the tag block is cached in the TAC. Memory requests received by the control unit first lookup the TAC before being forwarded to the data cache. Due to the presence of spatial locality in applications, the TAC can filter out a large percentage of tag accesses to the data cache, resulting in latency and bandwidth savings.
20190163632 REDUCING CACHE FOOTPRINT IN CACHE COHERENCE DIRECTORY
A method includes monitoring, at a cache coherence directory, states of cachelines stored in a cache hierarchy of a data processing system using a plurality of entries of the cache coherence directory. Each entry of the cache coherence directory is associated with a corresponding cache page of a plurality of cache pages, and each cache page representing a corresponding set of contiguous cachelines. The method further includes selectively evicting cachelines from a first cache of the cache hierarchy based on cacheline utilization densities of cache pages represented by the corresponding entries of the plurality of entries of the cache coherence directory.