pasted from another site.
Steve Wone said:Denormalising to improve performance? It sounds convincing, but it doesn't hold water.
Chris Date, which advanced the relational data model along with Dr Ted Codd, its creator, got tired of misinformed arguments against normalisation and systematically demolished them using scientific method - he got large databases and tested these assertions. I think he wrote it up in Relational Database Writings 1988-1991 but this book was later rolled into edition six of Introduction to Database Systems. This is the definitive text on database theory and design, currently in its eighth edition. Chris Date was an expert in this field when most of us were still running around barefoot.
He found that:
Some of them hold for special cases
All of them fail to pay off for general use
All of them are significantly worse for other special cases
It all comes back to mitigating the size of the working set. Joins involving properly selected keys with correctly set up indexes are cheap, not expensive, because they allow significant pruning of the result before the rows are materialised.
Materialising the result involves bulk disk reads which are the most expensive aspect of the exercise by an order of magnitude. Performing a join, by contrast, logically requires retrieval of only the keys. In practice, not even the key values are fetched: the key hash values are used for join comparisons, mitigating the cost of multi-column joins and radically reducing the cost of joins involving string comparisons. Not only will vastly more fit in cache, there's a lot less disk reading to do.
Moreover, a good optimiser will choose the most restrictive condition and apply it before it performs a join, very effectively leveraging the high selectivity of joins on indexes with high cardinality.
Admittedly this type of optimisation can also be applied to denormalised databases, but the sort of people who want to denormalise a schema typically don't think about cardinality when (if) they set up indexes.
It is important to understand that table scans (examination of every row in a table in the course of producing a join) are rare in practice. A query optimiser will choose a table scan only when one or more of the following holds.
There are fewer than 200 rows in the relation (in this case a scan will be cheaper)
There are no suitable indexes on the join columns (if it's meaningful to join on these columns then why aren't they indexed? fix it)
A type coercion is required before the columns can be compared (WTF?! fix it or go home)
One of the arguments of the comparison is an expression (no index)
Performing an operation is more expensive than not performing it. However, performing the wrong operation, being forced into pointless disk I/O and then discarding the dross prior to performing the join you really need, is much more expensive. Even when the "wrong" operation is precomputed and indexes have been sensibly applied, there remains significant penalty. Denormalising to precompute a join - notwithstanding the update anomalies entailed - is a commitment to a particular join. If you need a different join, that commitment is going to cost you big.
If anyone wants to remind me that it's a changing world, I think you'll find that bigger datasets on gruntier hardware just exaggerates the spread of Date's findings.
For all of you who work on billing systems or junk mail generators (shame on you) and are indignantly setting hand to keyboard to tell me that you know for a fact that denormalisation is faster, sorry but you're living in one of the special cases - specifically, the case where you process all of the data, in-order. It's not a general case, and you are justified in your strategy.
You are not justified in falsely generalising it. See the end of the notes section for more information on appropriate use of denormalisation in data warehousing scenarios.
I'd also like to respond to
Joins are just cartesian products with some lipgloss
What a load of bollocks. Restrictions are applied as early as possible, most restrictive first. You've read the theory, but you haven't understood it. Joins are treated as cartesian products to which predicates apply only by the query optimiser. This is a symbolic representation (a normalisation, in fact) to facilitate symbolic decomposition so the optimiser can produce all the equivalent transformations and rank them by cost and selectivity so that it can select the best query plan.
The only way you will ever get the optimiser to produce a cartesian product is to fail to supply a predicate: SELECT * FROM A,B