SQL structure critique

Red Squirrel · Feb 23, 2010

I posted something similar a while back where I was thinking of going with a very generic structure where my program takes care of everything, it would be highly normalized, but create very complex queries. I decided to simplify it where I have a main object table with generic fields such as ID, parent ID, type etc... and each object type had it's own table with the first field always being the object ID.

When I query, I query both the main object table, and the proper table that matches that object's type. This is working out nicely so far.

Now, I was thinking, the only advantage this has is that I'm not constantly repeating the same generic fields for each object type, and if I add a new generic field later on I only have to edit one table. Other then that, querying is still a tad more complex, and I'm wondering if it will be slower in the future when there's tons of objects.

Would you suggest combining these into one table per object type?

Here is my current structure:

Code:

dataobject table (main table, everything "derives" from this)

CREATE TABLE `dataobject` (
  `oid` bigint(20) unsigned NOT NULL auto_increment,
  `opid` bigint(20) unsigned NOT NULL,
  `otype` char(255) NOT NULL,
  `oownerid` bigint(20) unsigned NOT NULL,
  `oserverid` bigint(20) unsigned NOT NULL,
  `oslaveserverid` bigint(20) unsigned NOT NULL,
  UNIQUE KEY `oid` (`oid`),
  KEY `opid` (`opid`,`oownerid`)
) ENGINE=MyISAM AUTO_INCREMENT=207 DEFAULT CHARSET=latin1 AUTO_INCREMENT=207 ;

A typical object type table (each object type has it's own table with it's own fields)

Code:

CREATE TABLE `dataobject_user` (
  `objectid` bigint(20) NOT NULL,
  `uname` varchar(255) default NULL,
  `upass` varchar(255) default NULL,
  `uhomedir` text NOT NULL,
  `ushell` text NOT NULL,
  PRIMARY KEY  (`objectid`)
) ENGINE=MyISAM AUTO_INCREMENT=4 DEFAULT CHARSET=latin1;

Right now if I want to list all the users, or do a search, I do a SELECT from both the dataobject table with a join on dataobject_user using the oid and objectid fields.

Is this a decent setup (less fields per table, overall) or am I better off just repeating the main fields in each table so I'm not doing joins all the time?

nickbits · Feb 23, 2010

I like that all of the ids come from the same pool. Sometimes that is useful.

Evadman · Feb 24, 2010

Sometimes, it is useful to denormalize a table for speed reasons. It is possible to normalize a database too much. There is not a hard and fast rule on when to denormalize, it depends on many factors, the biggest one being how busy the server is.

If I understand correctly, you are going to be reading from the dataobject table on every query correct? You will have lock issues if your disk isn't up to task, or if you don't have enough memory to pin the dataobject table.

Red Squirrel · Feb 24, 2010

Evadman said:
Sometimes, it is useful to denormalize a table for speed reasons. It is possible to normalize a database too much. There is not a hard and fast rule on when to denormalize, it depends on many factors, the biggest one being how busy the server is.

If I understand correctly, you are going to be reading from the dataobject table on every query correct? You will have lock issues if your disk isn't up to task, or if you don't have enough memory to pin the dataobject table.

What would happen in the event of a lock issue, will it wait till it is released, or will it actually crash? I had the impression locking was not really an issue with mysql or most other dbmses.

I am doing my best to keep the number of queries down, but it could potentially get high depending on how many users are on the app. ex: in the future if there are 100+ users managing their site at once. (this is for a control panel system)

I've even thought of just load everything into memory and work with it that way but that's very inefficient. It's the lazy way to getting performance imo. As your DB grows so does your ram usage.

Markbnj · Feb 24, 2010

RedSquirrel said:
What would happen in the event of a lock issue, will it wait till it is released, or will it actually crash? I had the impression locking was not really an issue with mysql or most other dbmses.

I am doing my best to keep the number of queries down, but it could potentially get high depending on how many users are on the app. ex: in the future if there are 100+ users managing their site at once. (this is for a control panel system)

I've even thought of just load everything into memory and work with it that way but that's very inefficient. It's the lazy way to getting performance imo. As your DB grows so does your ram usage.

Locking is an intrinsic issue for SQL databases. Typically if a query cannot complete because it can't acquire access to a table or row it will timeout return an error, and the data-access layer will then throw an exception.

Train · Feb 24, 2010

RedSquirrel said:
Is this a decent setup (less fields per table, overall) or am I better off just repeating the main fields in each table so I'm not doing joins all the time?

One of the most famous computer scientists of all time is oft quoted:

Donald Knuth said:
Premature optimization is the root of all evil

And for good reason. If you were to actually do a performance test on the above scenario, you would likely not see a speed benifit until you hit Google sized databases. anyone telling you other wise it full of it.

The downside being you denormalized your DB and made inserts/updates harder and more error prone. which, in the real world, FAR FAR FAR outweighs the premature optimization you are thiking of.

"Joins all the time" is nothing for a modern RDBMS, dont worry about it at all. As long as the joined columns are indexed, people far smarter than you or I have already taken care of this when they wrote the RDBMS.

Red Squirrel · Feb 24, 2010

Train said:
One of the most famous computer scientists of all time is oft quoted:

And for good reason. If you were to actually do a performance test on the above scenario, you would likely not see a speed benifit until you hit Google sized databases. anyone telling you other wise it full of it.

The downside being you denormalized your DB and made inserts/updates harder and more error prone. which, in the real world, FAR FAR FAR outweighs the premature optimization you are thiking of.

"Joins all the time" is nothing for a modern RDBMS, dont worry about it at all. As long as the joined columns are indexed, people far smarter than you or I have already taken care of this when they wrote the RDBMS.

I'd like to half disagree with the optimization statement. You're most likely right I wont see a difference now. But I rather do it properly now, so when/if I do hit google size I don't start running into issues. I see this way too often in IT environments where something is way undersized, then you start to run into issues a year later.

Good to know that the joins arn't such a big deal though, I was really not sure if that was putting lot of stress on the db or not.

Though the locking issue is what scares me, I really did not know this was a problem with mysql. What are proper ways to prevent this from happening, or at least, catch the error so I can try again? using mysql++ would it throw some kind of exception I can catch so I just try the query again? Also how probable is a lock scenario? I've abused mysql DBs quite a lot and have never run into issues where a query outright crashes due to too much load.

Train · Feb 24, 2010

RedSquirrel said:
Though the locking issue is what scares me, I really did not know this was a problem with mysql. What are proper ways to prevent this from happening, or at least, catch the error so I can try again? using mysql++ would it throw some kind of exception I can catch so I just try the query again? Also how probable is a lock scenario? I've abused mysql DBs quite a lot and have never run into issues where a query outright crashes due to too much load.

I'm not as familiar with mySql, but with SQL Server, I have beaten up servers with tens of millions of inserts/updates/queries in a day and never run into an actual locking issue. If I ever DID see a lock in SQL Logs that lasted longer than a few milliseconds, it was due to some buggy code, not the load on the DB.

What kind of locks are you talking about, specifically? inserts and updates? How many rows per lock? How many tables per lock? If its just a row or two at a time to do some updates, DON'T EVEN BOTHER.

The only real consideration with this sort of thing is if you want certain reads, such as REPEATABLE READ (google that) which can get hairy.

Before you do anything, benchmark it. Most likely you wont need some hair brained optimization.

TechBoyJK · Feb 24, 2010

how come your ID's are bigint's and not just int ?? Do you really need that many possibilities? INT offers approx 2.1 BILLION uniques.

Red Squirrel · Feb 24, 2010

TechBoyJK said:
how come your ID's are bigint's and not just int ?? Do you really need that many possibilities? INT offers approx 2.1 BILLION uniques.

I use bigint more or less to play it safe, as I want to avoid from having any kind of potential program limitations. With that said, how much of a performance impact do bigints have over ints? Maybe I can use ints instead. Now are these ints 32-bit? So I should have 32^2 uniques right? What about on a 64-bit platform, are they actually 64-bit?

I can't really see my app actually hitting these limits so maybe I should in fact convert to int.

Some of these may get big mind you, like for example when I get into graphing, each graph sample will be a single data object. But even then... billions is quite a lot.

Evadman · Feb 25, 2010

how much of a performance impact do bigints have over ints?

bigint is twice the size of int, so in theory it will take twice as long to scan an index. In reality, that isn't actually the case due to how hard disks work.

Train, If you haven't seen a locking issue on MS SQL, then you are working with some low user, low query databases. Over normalization is a real issue, and needs to be taken into account when creating the DB, no matter what backend you deal with. If every query is joining to a half dozen tables, concurrency will suck because the disk will not be able to keep up without some decent hardware.

Crusty · Feb 25, 2010

http://dev.mysql.com/doc/refman/5.0/en/numeric-types.html

TechBoyJK · Feb 25, 2010

RedSquirrel said:
I use bigint more or less to play it safe, as I want to avoid from having any kind of potential program limitations. With that said, how much of a performance impact do bigints have over ints? Maybe I can use ints instead. Now are these ints 32-bit? So I should have 32^2 uniques right? What about on a 64-bit platform, are they actually 64-bit?

I can't really see my app actually hitting these limits so maybe I should in fact convert to int.

Some of these may get big mind you, like for example when I get into graphing, each graph sample will be a single data object. But even then... billions is quite a lot.

Yea, billions is quite high. Even if you get to 1.1 billion records, you still need to add another billion to hit the ceiling. Your adding time to search the index, which if you DO get close to that billion mark, could have an impact.

Seriously do you really think you'll need more than 1 billion id's? That'd mean that table would have over 1 billion records... So, if you do get close to 1 billion, just the fact that the BIGINT takes 8 bytes instead of int's 4 bytes (4byte difference x 1 billion), could be an issue, along with the size of 1billion x all other columns. Seriously, if my math is correct, the difference in table size of an ID with int type, vs an ID with bigint, with 1 billion records is about 4GigaBytes. That's a huge amount for a single table, especially since that 4GB chunk is absolutely unneccessary, because at 1billion records (the cause of extra 4GB of space) you still haven't proven the need for the BIGINT type.

I'd do everything I can to squeeze as many records in a single page as possible. I'm thinking on MS SQL Server each page is about 8kb.

Another thing, yea, don't over normalize. For instance, if most of your selects are joining two tables, it might be better off just to make them one big table.

When you get to the point of 1 billion or more records in a table, I'd say that's bad design. At that point, you probably want to split that into two tables.

Train · Feb 25, 2010

Evadman said:
bigint is twice the size of int, so in theory it will take twice as long to scan an index. In reality, that isn't actually the case due to how hard disks work.

That's not how indexes work either. Ever heard of a binary search? Indexes are set up in binary trees. c'mon man this is CS 101. Even with a billion records, there are a maximum of 30 comparisons to find a record in a binary index. The avg would be half that, so 15 comparisons. The difference (assuming a 64 bit CPU) of 15 comparisons of 32 bit vs 64 bit is incredibly insignificant. Hell even on 32 bit, a couple of cycles when your talking about a CPU that can do 2-3 billion cycles per second * x cores, is insiginificant. That being said, OP, stick to Int for indexes, the real issue here is using these ints in your application code, which will prefer 32 bit ints by default.

Train, If you haven't seen a locking issue on MS SQL, then you are working with some low user, low query databases. Over normalization is a real issue, and needs to be taken into account when creating the DB, no matter what backend you deal with. If every query is joining to a half dozen tables, concurrency will suck because the disk will not be able to keep up without some decent hardware.

Train said:
I have beaten up servers with tens of millions of inserts/updates/queries in a day and never run into an actual locking issue. If I ever DID see a lock in SQL Logs that lasted longer than a few milliseconds, it was due to some buggy code, not the load on the DB.

And yes, this includes queries joining a dozen tables.

Indexes are everything on large DB's

Train · Feb 25, 2010

TechBoyJK said:
When you get to the point of 1 billion or more records in a table, I'd say that's bad design. At that point, you probably want to split that into two tables.

How so? What if he legitimately has 1 billion of something? Like car parts? I've seen DB's of every part that can fit on every car ever made, those can get big, and with dozens of new models coming out every year, and more and more 3rd party or aftermarket parts, keep getting bigger.

Normalization is to split up columns, not rows.

Even if you look at the uber db's like googles big table, they dont LOGICALLY split tables, they only PHYSICALLY do so, across servers. SQL Server can do the same thing, split tables across files or even HD's, but logically, keep it as one table.

Crusty · Feb 25, 2010

Train said:
How so? What if he legitimately has 1 billion of something? Like car parts? I've seen DB's of every part that can fit on every car ever made, those can get big, and with dozens of new models coming out every year, and more and more 3rd party or aftermarket parts, keep getting bigger.

Normalization is to split up columns, not rows.

Even if you look at the uber db's like googles big table, they dont LOGICALLY split tables, they only PHYSICALLY do so, across servers. SQL Server can do the same thing, split tables across files or even HD's, but logically, keep it as one table.

I have a table with 900 million records does that make me special? ()

TechBoyJK · Feb 25, 2010

Train said:
How so? What if he legitimately has 1 billion of something? Like car parts? I've seen DB's of every part that can fit on every car ever made, those can get big, and with dozens of new models coming out every year, and more and more 3rd party or aftermarket parts, keep getting bigger.

Normalization is to split up columns, not rows.

Even if you look at the uber db's like googles big table, they dont LOGICALLY split tables, they only PHYSICALLY do so, across servers. SQL Server can do the same thing, split tables across files or even HD's, but logically, keep it as one table.

I wasn't referring to normalization. I understand what it is. I scored a perfect on the logical table design section on the SQL 2000 Design exam.

What I was saying is that if you actually let a single table get 1billion records or more, that's probably bad design. 'probably'

Now if a particular use justifies it, and you have the resources to manage a table with 1billion+ records, you take that into design consideration...

To me it seems like it would be wise to consider alternatives in design if a single table is going to have over 1 billion records.

Evadman · Feb 25, 2010

Train, CPU is almost never the limiting factor on large database, it is disk read speed. Comparing a few hundred records (or a few hundred thousand) is almost inconsequential, just like you mentioned. Database speed is all about disk latency, and the disk subsystem. Looking though a B tree for a half dozen records takes time, especially when you are when you are going though an index that is larger than it needs to be.

In databases, first you optimize for read speed, then you optimize for CPU speed. In some cases, that means denormalizing. this design is a perfect candidate for denormalization. If you don't recognize that, it is probably because you are overbuilding your hardware, or you haven't run across this in a production environment.

Also, like TechBoy said, a table with a billion records is almost never the best solution. At minimum it should be horizontally partitioned. MySQL makes that easy, it is more difficult with other engines.

Train · Feb 25, 2010

Evadman said:
Train, CPU is almost never the limiting factor on large database, it is disk read speed. Comparing a few hundred records (or a few hundred thousand) is almost inconsequential, just like you mentioned. Database speed is all about disk latency, and the disk subsystem. Looking though a B tree for a half dozen records takes time, especially when you are when you are going though an index that is larger than it needs to be.

more made up bs, besides, thats not what you said:

Evadman said:
bigint is twice the size of int, so in theory it will take twice as long to scan an index. In reality, that isn't actually the case due to how hard disks work.

OP, like I said before, optimize only after benchmarks show you there is actually a need. The rest is a bunch of hot air people want to blow up your ass.

Trust the facts!

Keep it civil. This is not P&N.
Markbnj
Programming moderator

Evadman · Feb 25, 2010

Train, at this very moment I am working on a database that has more data than you could believe. I think you are taking my comments in some different fashion then how I am stating them. I do know what I am talking about.

select sum(maxperm) as usedbytes from dbc.DiskSpace

usedbytes
-----------
173433215376766

(1 row(s) affected)

Train · Feb 25, 2010

Evadman said:
I do know what I am talking about.

Then the benchmarks would prove it.

Evadman · Feb 25, 2010

If he had access to benchmarking tools, why would he be asking us to critique his design? We are acting as his benchmarking tools. Changing a database structure after an application is utilizing it is much more difficult then doing it now.

Train · Feb 25, 2010

Evadman said:
If he had access to benchmarking tools, why would he be asking us to critique his design?

Because he didnt realize (as many DB coders don't) That benchmarking is the only real test you need.

If he can write code to access a DB, he can benchmark it. Generating a large set of test data and timing queries is trivial in any language.

Evadman · Feb 25, 2010

You would be surprised how many DBA's apply for jobs I have, and yet have no idea what a cross join is. Generating a bunch of data isn't as trivial for everyone as you expect. Same with generating a workload that will be accurate to what a user base will do. I have 2 people on my team, and all they do is generate test workloads for different applications. It is a very specific skill that needs much better definition (and a lot of practice) that I usually end up teaching as very few people have the skillset.

Train · Feb 25, 2010

Evadman said:
You would be surprised how many DBA's apply for jobs I have, and yet have no idea what a cross join is. Generating a bunch of data isn't as trivial for everyone as you expect. Same with generating a workload that will be accurate to what a user base will do. I have 2 people on my team, and all they do is generate test workloads for different applications. It is a very specific skill that needs much better definition (and a lot of practice) that I usually end up teaching as very few people have the skillset.

Well if he cant even time a query, he DEFINETLY shouldn't be trying to denormalize a DB. Not sure what point you're trying to make here.

SQL structure critique

No Lifer

Diamond Member

Administrator Emeritus<br>Elite Member

No Lifer

Elite Member <br>Moderator Emeritus

Lifer

No Lifer

Lifer

Lifer

No Lifer

Administrator Emeritus<br>Elite Member

Lifer

Lifer

Lifer

Lifer

Lifer

Lifer

Administrator Emeritus<br>Elite Member

Lifer

Administrator Emeritus<br>Elite Member

Lifer

Administrator Emeritus<br>Elite Member

Lifer

Administrator Emeritus<br>Elite Member

Lifer