• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

SQL structure critique

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
pasted from another site.
Steve Wone said:
Denormalising to improve performance? It sounds convincing, but it doesn't hold water.

Chris Date, which advanced the relational data model along with Dr Ted Codd, its creator, got tired of misinformed arguments against normalisation and systematically demolished them using scientific method - he got large databases and tested these assertions. I think he wrote it up in Relational Database Writings 1988-1991 but this book was later rolled into edition six of Introduction to Database Systems. This is the definitive text on database theory and design, currently in its eighth edition. Chris Date was an expert in this field when most of us were still running around barefoot.

He found that:

Some of them hold for special cases
All of them fail to pay off for general use
All of them are significantly worse for other special cases
It all comes back to mitigating the size of the working set. Joins involving properly selected keys with correctly set up indexes are cheap, not expensive, because they allow significant pruning of the result before the rows are materialised.

Materialising the result involves bulk disk reads which are the most expensive aspect of the exercise by an order of magnitude. Performing a join, by contrast, logically requires retrieval of only the keys. In practice, not even the key values are fetched: the key hash values are used for join comparisons, mitigating the cost of multi-column joins and radically reducing the cost of joins involving string comparisons. Not only will vastly more fit in cache, there's a lot less disk reading to do.

Moreover, a good optimiser will choose the most restrictive condition and apply it before it performs a join, very effectively leveraging the high selectivity of joins on indexes with high cardinality.

Admittedly this type of optimisation can also be applied to denormalised databases, but the sort of people who want to denormalise a schema typically don't think about cardinality when (if) they set up indexes.

It is important to understand that table scans (examination of every row in a table in the course of producing a join) are rare in practice. A query optimiser will choose a table scan only when one or more of the following holds.

There are fewer than 200 rows in the relation (in this case a scan will be cheaper)
There are no suitable indexes on the join columns (if it's meaningful to join on these columns then why aren't they indexed? fix it)
A type coercion is required before the columns can be compared (WTF?! fix it or go home)
One of the arguments of the comparison is an expression (no index)
Performing an operation is more expensive than not performing it. However, performing the wrong operation, being forced into pointless disk I/O and then discarding the dross prior to performing the join you really need, is much more expensive. Even when the "wrong" operation is precomputed and indexes have been sensibly applied, there remains significant penalty. Denormalising to precompute a join - notwithstanding the update anomalies entailed - is a commitment to a particular join. If you need a different join, that commitment is going to cost you big.

If anyone wants to remind me that it's a changing world, I think you'll find that bigger datasets on gruntier hardware just exaggerates the spread of Date's findings.

For all of you who work on billing systems or junk mail generators (shame on you) and are indignantly setting hand to keyboard to tell me that you know for a fact that denormalisation is faster, sorry but you're living in one of the special cases - specifically, the case where you process all of the data, in-order. It's not a general case, and you are justified in your strategy.

You are not justified in falsely generalising it. See the end of the notes section for more information on appropriate use of denormalisation in data warehousing scenarios.

I'd also like to respond to

Joins are just cartesian products with some lipgloss

What a load of bollocks. Restrictions are applied as early as possible, most restrictive first. You've read the theory, but you haven't understood it. Joins are treated as cartesian products to which predicates apply only by the query optimiser. This is a symbolic representation (a normalisation, in fact) to facilitate symbolic decomposition so the optimiser can produce all the equivalent transformations and rank them by cost and selectivity so that it can select the best query plan.

The only way you will ever get the optimiser to produce a cartesian product is to fail to supply a predicate: SELECT * FROM A,B
 
Articles (and interview questions) that support different types of denormalization:


There are lots of articles and trains of thought either way. Every database is different, and there isn't a cut and dray time to do it, it depends on different factors and database usage.

I can sit here and quote chapter and verse the different things I have done to normalize or denormalize data (the ones that aren't confidential anyway) but it is immaterial to the discussion. RedSquirrel asked for opinions, I gave one. You can have a different opinion, and RedSquirrel can do whatever he wants. It doesn't effect me or you in any way, so I could care less. I'm just posting to give out knowledge I have gained, because I wish people would have done it when I was learning years ago.
 
There are lots of articles and trains of thought either way. Every database is different, and there isn't a cut and dray time to do it, it depends on different factors and database usage.

Which is why I said benchmark it in my very first post. Its the ONLY way to know. Like what in I pasted, scientific method to the rescue. Myths busted.

Theres a reason "Premature Optimization is the root of all evil" is thrown around so frequently. Because 9/10, the optimization wasnt needed in the first place, so why bother?
 
Well, I for instance understand logical database table design very well. Normalization, denormalization, etc.. What datatypes to use, etc.

Many of the higher end tuning skills i know nothing about. I just haven't had the experience or need. I'd love to learn about it though.
 
Which is why I said benchmark it in my very first post. Its the ONLY way to know. Like what in I pasted, scientific method to the rescue. Myths busted.

Theres a reason "Premature Optimization is the root of all evil" is thrown around so frequently. Because 9/10, the optimization wasnt needed in the first place, so why bother?

I guess it depends.

Are you building a database? or a database AND the app that's going to use it?

In some situations, you may be building a database that many different developers may be working on, for different apps, so you can't take their app building approach into consideration. You have to stick to the basics to build a solid database structure. In this case, you can't really denormalize because you don't know how the apps' selects are going to be asking for data.

I've never had the urge to denormalize UNTIL after I was building an app and I said to myself "you know, every select I've written is asking for all the data from both tables, I wonder if I would be better off just combining them".

Even then, I think benchmarking is important because you never really know. Every DBMS can handle things differently.

But I don't really know much either so I'm just speaking from my limited experience.
 
I've never had the urge to denormalize UNTIL after I was building an app and I said to myself "you know, every select I've written is asking for all the data from both tables, I wonder if I would be better off just combining them

If a query gets all data from two tables theres something wrong with the schema. But even then, denormalizaing later on is easier than trying to normalize, therefore you should always start normalized, then work backwards when tests warrant it.
 
Last edited:
If a query gets all data from two tables theres something wrong with the schema.

:confused; No it doesn't There may be optimization opportunities, but that is a horizontally partitioned table, and can be a way to speed up searches, especially on heaps.
 
Sorry, I said horizontally partitioned (rows), I meant vertically partitioned (columns). Either type of partition requires >1 tables anyway, that is why it is called partitioned. MySQL (new version anyway) does vertical partitioning physicly, but combines logicly. all other DB programs I know of require it to be done manually, where the tables look like >1 table logically as well.
 
I guess it depends.

Are you building a database? or a database AND the app that's going to use it?

In some situations, you may be building a database that many different developers may be working on, for different apps, so you can't take their app building approach into consideration. You have to stick to the basics to build a solid database structure. In this case, you can't really denormalize because you don't know how the apps' selects are going to be asking for data.

I've never had the urge to denormalize UNTIL after I was building an app and I said to myself "you know, every select I've written is asking for all the data from both tables, I wonder if I would be better off just combining them".

Even then, I think benchmarking is important because you never really know. Every DBMS can handle things differently.

But I don't really know much either so I'm just speaking from my limited experience.

In my case I am actually writing an app that uses a DB. Only this app will be accessing the DB. This is actually what has been happening, I normalized it in a way that seemed the most logical for expandability, then while writing some of the code I realized I was constantly joining and getting all data. There are a few situations where I only grab from the main table.

Think I will just go ahead and keep it the way it is now. The way I'm coding things, it should be fairly easy to change it in the future if I really need to. Eventually I want to add an "object explorer" so my existing structure will be needed for that.
 
I've seen a number of enterprise applications with database structures similar to the way you've described it. Typically, the application metadata (e.g. the objects or entities used by the applications) is highly configurable, and this requires a very flexible underlying data table structure.

To that end, unless you are giving the end user a high level of customization/configuration, you may find this type of structure limiting down the road.
 
Back
Top