Cliff's Notes: the article is making a big deal about something a computer architect wouldn't find surprising, but is a pretty good read anyway. They used a pathological access pattern which you wouldn't see in the real world.
I would like to stress that this article is the world?s first detailed discussion of the replay feature and its functioning peculiarities.
That's odd, I've read about it in lots of academic papers.
Here is a paper from 2003 (which I have not read, but the title looks promising).
This paragraph is actually mostly covered in later pages of the article, which I hadn't read when I wrote it 😉. The basic idea is this: some operations take variable amounts of time - for example, accessing memory. Ideally, you'd know ahead of time how many cycles a memory access is going to take, and then schedule the next dependent operation to issue at that moment. However, the scheduler can become very complicated when you add in all the corner cases, and is usually part of the critical path of the chip anyway (that is to say, making it more complicated is going to directly impact cycle time). Additionally, for long pipelines, you can have multiple cycles between the time the scheduler has to make a decision and the time the instructions it picks actually execute. To maximize performance, the scheduler has to just
assumes that the operation will take a certain amount of time, and fixes mistakes later. This isn't as bad as you might think, because for normal workloads, you probably won't encounter a huge number of accesses that miss in the L1 and hit in the L2 - the L1 is supposed to cover a very large number of accesses. A theme my computer architecture profs discussed incessantly was, "make the common case fast and the uncommon case correct". The common case, an L1 hit, needs to be fast - you need to have the scheduler not slow down the who chip, and you need to issue a dependent instruction the very next cycle. The uncommon case (
this document shows L1s hitting over 95% of the time, sometimes 99%) of an L1 miss just needs to be correct.
Regarding performance counters:
True, how can we debug and polish off the branch prediction algorithms if we have no idea how many times the processor made an incorrect prediction and what type of predictions turned out the most difficult? How do we find out if the processor needs a larger cache if we don?t know the cache miss statistics? Does the CPU have enough execution units? Or maybe we could do with fewer execution units, because most of them are idling anyway?
I don't know this for sure, but I would suspect those sort of decisions are made with simulations done YEARS in advance of the release (or even first silicon) for a processor. If you don't realize you have a problem like these (bad predictor, not enough execution units, etc) until you get back first silicon, you're in a VERY tough spot.
Besides the L2 cache is usually much bigger than the L1 data cache, which makes the probability of the requested data being in the L2 cache very high.
Actually, that's often not true. IIRC, L2 caches usually have hit rates in the 80% range. You might wonder why that is, given that an L2 might be 1MB while the L1 is 64KB... but what's happening is that the L1 is filtering out all the accesses with really good locality. For example, if we have a program that reads 128 contiguous bytes, jumps somewhere completely random, and reads another block of 128 bytes, the L1 cache is probably going to hit on 127 of the 128 accesses, and the L2 is only going to see the first access from each group - which is somewhere random, and thus the L2 can't even hope to prefetch it. If an access misses in the L1, there's a good chance it has poor locality, and won't be in the (huge) L2 either.
Note that the fines imposed by replay do not depend on the quality of the program code or on the number of branches in it.
I don't think I agree with that - good code is going to hit in the L1 enough that replay doesn't have to happen for every memory access. The pathological access patterns used to measure L2 latency are just that: pathological. Even OLTP workloads, which tend to be awful for caches aren't going to have this kind of access pattern.
In the terms of semiconductor electronic hole conduction, it will be ?a hole?.
:roll: I've usually heard it called a "bubble" in the term of computer architecture.
. From Chapter VI we remember that there are FIVE schedulers like that in the Pentium 4 core. They are all independent of one another, each of them has its own queue, and it means?
It means that each scheduler has its own replay system. In other words, there are 10 fictitious pipelines hidden from the user in Northwood core!
That's not necessarily true - if there's a scheduler for just regular integer/boolean operations, it probably doesn't need replay. You only need to support replay if you have to speculate about latencies.
1. In the overwhelming majority of computational algorithms, the data locality space is much bigger than the L1 cache.
2. Any type of data prefetch commands cannot load the cache line directly into L1
1. Correct, your complete working set for a whole program is usually larger than the L1, but well written apps usually handle chunks of data at a time, where each chunk is sized
specifically to fit in the L1.
2. That's not true - there are a variety of schemes to prefetch into the L1. I'm pretty sure the P4 has ~4 stride prefetchers to fetch into the L1D.
This particular quote comes from his discussion of floating point-related memory accesses, and for what it's worth, FP apps are the ones that usually have
very good locality - i.e., they have very high L1 hit rates.
The worst thing is that there is nothing software developers can do to prevent replay: the CPU very aggressively reorders all instructions inside, that is why any data loading command may change its position in relation to other instructions in the program code, even if it used to be placed very far way initially (and as we know new independent command threads usually start with the data loading).
I would be surprised if the P4 won't slip some independant operations in. VERY surprised... with 2-way issue, if you have a load followed by a few instructions that depend on the load, you can stick some other independant operations in the second pipe while the load and its dependent operations execute/stall/whatever.
What would be really nice is if somebody could run some benchmarks on a CPU simulator modelling a real Northwood, and again modelling a Northwood with an oracle scheduler (one that sees the future and thus makes completely perfect decisions) - this is a technique commonly used to see how far from the unattainable ideal a particular design is. I would expect that in real-world applications, you won't be picking up anywhere near the 50-100% performance improvement this article might lead you to expect.
For what it's worth, future designs are likely to have MORE kinds of things like replay - more and more things have to be done speculatively to get further performance boosts. A really far-out proposal is
value prediction - when a load misses in the L1, you continue execution and guess what the value would have been. If you guess wrong? Replay. Another proposal related to memory has to do with handling loads and stores to the same address - in an out-of-order processor, you need to make sure that loads only see the value written by the latest earlier store... since you don't always know the addresses loads and stores use until very late in their execution, you can make guesses as to whether or not a given load needs to wait for an earlier store before executing. Again, if you guess wrong, you have to use some sort of replay.