Bulldozer patents had everything AMD needed. CSMT is hinted and a much larger core is hinted at as well.
Bulldozer patents say:
(Quad AGLU)
- In one embodiment, the ALU 220 and the AGU 222 are implemented as the same unit.
(Multiple FPUs)
- By utilizing multiple integer execution units that share an FPU (or share multiple FPUs) and that share a single pre-processing front-end unit, increase processing bandwidth afforded by multiple execution units can be achieved while reducing or eliminating the design complexity and power consumption attendant with conventional designs that utilize a separate pre-processing front-end for each integer execution unit. Further, because in many instances it is the execution units that result in bottlenecks in processing pipelines, the use of a single shared front-end may introduce little, if any, delay in the processing bandwidth as the fetch, decode, and dispatch operations of the front-end unit often can be performed at a higher instruction-throughput than the instruction-throughput of two or more execution units combined.
(Variant rSMT)
- Eager execution is a technique frequently to improve single threaded execution by concurrently pursuing both paths of possible execution following a conditional branch. Many branches are difficult to predict and it may be advantageous to fetch and execute down both branch paths rather than making a prediction and continuing with fetch and execution down only the predicted branch path. This mode of execution naturally creates two “streams” of integer operation execution that could each individually be directed to one of the clusters of execution. One path (e.g. the “not-taken” path) could continue to execute on the original cluster, while the “taken” path could begin execution on the other cluster. When the branch is resolved, one path is terminated while the other continues. The difficulty with this use of the previously independent clusters is that they now need to communicate architectural state in order to “fork” two streams from the initial single thread. It is also advantageous to have any cached microarchitectural state (L1 data caches, L1 translation lookaside buffers (TLBs), etc.) be present in both clusters for improved performance for both the taken and not-taken paths.
(CSMT/Virtual Core)
- Each pipeline stage can independently select between threads such that, at any given pipeline cycle, the pipeline stage can have instruction data from different threads distributed among its substages. This independent selection at each pipeline stage can facilitate more even progress between threads. In at least one embodiment, the first selected thread and the second selected thread can be the same thread or different threads. The selection of the first selected thread and the selection of the second selected thread can be performed based on thread priority, based on a comparative amount of instruction data buffered for one thread versus another (e.g., based on a ratio of the amount of buffered instruction data for one thread to the amount of buffered instruction data for another thread), based on a round-robin method, or a combination thereof.
(Individualistic Dispatch)
- A front-end unit coupled to the first execution unit via a first dispatch bus and coupled to the second execution unit via a second dispatch bus separate from the first dispatch bus, the first dispatch bus configured to concurrently transmit a first dispatch group of up to N instruction operations from the front-end unit to the first execution unit for a dispatch cycle and the second dispatch bus configured to concurrently transmit a second dispatch group of up to N instruction operations from the front-end unit to the second execution unit for the dispatch cycle.
(Dispatch 2.0 CSMT)
- Alternately, integer instruction operations can be dispatched to the integer execution units 212 and 214 opportunistically. To illustrate, assume again that two threads To and T1 are being processed by the processing pipeline 200. In this example, the instruction dispatch module 210 can dispatch integer instruction operations from the threads T0 and T1 to either of the integer execution units 212 and 214 depending on thread priority, loading, forward progress requirements, and the like.
All patents: 2007~2009.