"Per-user" relational model? Need some conceptual advice.

slugg

Diamond Member
Feb 17, 2002
4,723
80
91
Hey guys. So I'm tinkering with an idea for an app, but I'm a bit stumped for a good approach for the back-end. I have a good schema/data model already, but it's only set up for one user. For this app to work using my existing techniques, algorithms, and data model, each user needs its own database, basically. For scalability reasons, giving each user it own micro-db like SQLite is out of the question (i.e. I don't want one instance per user).

The naive approach is to simply add the user ID to each table's primary key, which would also be a foreign key to a parent user table. I'm worried about the performance implications of this, however. I'm not really sure how this would affect real-world DB performance - is there some kind of clever partitioning and indexing we could do to mitigate the performance penalty?

Another approach is to normalize a bit more, where only one table has a reference to the parent user table, then each child references it. I would be worried about the join calculus here since we'd automatically be introducing two joins for any query. And really, each user's records are completely unrelated. A textbook would say that this is the right answer, but I just don't buy it.

What other approaches would you recommend?

If it matters, I'm planning on targeting MS SQL Server 2012 and the whole thing would be written using Entity Framework.

Thanks!
 

Markbnj

Elite Member <br>Moderator Emeritus
Moderator
Sep 16, 2005
15,682
14
81
www.markbetz.net
I guess I don't know enough about what you mean when you say each user needs his or her own "database." Just on the surface I would approach it from a normalized, relational schema until I proved that doing that is insufficient for some reason. Unless the stuff you're storing in the user "database" has some odd characteristics I can't see why it would be. I wouldn't add the user ID to the primary key for the tables. I would have the user ID be a foreign key on the dependent tables, again until such time as I could show why that wouldn't work.
 

slugg

Diamond Member
Feb 17, 2002
4,723
80
91
I guess I don't know enough about what you mean when you say each user needs his or her own "database." Just on the surface I would approach it from a normalized, relational schema until I proved that doing that is insufficient for some reason. Unless the stuff you're storing in the user "database" has some odd characteristics I can't see why it would be. I wouldn't add the user ID to the primary key for the tables. I would have the user ID be a foreign key on the dependent tables, again until such time as I could show why that wouldn't work.

So what I'm reading from you is "preoptimization is the root of all evil."

Let me try to clarify what I meant about a user needing its own database - I really have no idea what I was thinking. My coffee had not kicked in yet! Forget it.

Let's talk storage and indices, shall we? But first, let's talk about the data.

The type of data structure required by the program is essentially a disconnected, stateless, directed graph. So basically, no single node is considered a root, and each node in the same graph may or may not be connected to the rest of the graph. The only thing they have in common is that they're in the same graph. There is a one-to-one relationship between user and graph: one user represents one graph.

Suppose we simply have a user table, where the node table has a foreign key to it. What I'm worried about is performance due to fragmented data. The algorithms that'll be acting on this structure are either very recursive or very pseudo-random within that user's graph. In other words, they require locality for performance. Knowing this, would clustering on the foreign key column solve this problem?
 

BrightCandle

Diamond Member
Mar 15, 2007
4,762
0
76
There are a couple of other options beyond using a database (which basically necessitates a design where user data is separated by a compound key with the user unique identifier). The first is that there are actually graph storage systems, you might find one of those (a NoSQL DB) is highly appropriate to your design. The other possibility is a document based NoSQL solution, where you store the entire graph/users data in one or more documents. This in effect would have the users entirely unrelated to each other and user ID could be the key into the graph storage and nothing more.

Bare in mind people have been doing what you are suggesting with SQL databases for over 30 years. With indexes on the user ID they can be remarkably fast even with incredibly large tables (10's of millions of rows). If you really have a problem that is going to break a SQL solution then sure go ahead and try something else, but if you are talking less than 1 million rows you aren't likely to have an issue.
 

slugg

Diamond Member
Feb 17, 2002
4,723
80
91
There are a couple of other options beyond using a database (which basically necessitates a design where user data is separated by a compound key with the user unique identifier). The first is that there are actually graph storage systems, you might find one of those (a NoSQL DB) is highly appropriate to your design. The other possibility is a document based NoSQL solution, where you store the entire graph/users data in one or more documents. This in effect would have the users entirely unrelated to each other and user ID could be the key into the graph storage and nothing more.

Bare in mind people have been doing what you are suggesting with SQL databases for over 30 years. With indexes on the user ID they can be remarkably fast even with incredibly large tables (10's of millions of rows). If you really have a problem that is going to break a SQL solution then sure go ahead and try something else, but if you are talking less than 1 million rows you aren't likely to have an issue.

Due to other constraints, this must be in MS SQL Server, so any NoSQL implementation is out, unfortunately. I do think we'll see less than 1 million rows per user. What are the performance implications of larger data sets?
 

uclabachelor

Senior member
Nov 9, 2009
448
0
71
You haven't provided any valid reason as to why you need a database per user.

You can normalize the schema and include the user id as a key. JOIN performance won't take much of a hit (if any) if indexed and queried properly.

If the reason is that your users each have unique data and need to store it without serializing the data, you can always utilize the entity-attribute approach and still maintain a single database across all users.
 
Last edited:

BrightCandle

Diamond Member
Mar 15, 2007
4,762
0
76
Below a million rows pretty much anything basically works within reasonable time scales. Once tables start going beyond that it requires more careful thought about how queries run and how you join them. Joining 2 tables both of 10 million rows in ways that can sometimes produce a lot of results or can't utilise good indexes on both sides can be really slow. Databases do tend to get trickier to optimise once you pass 1 million rows in tables that have to be joined on todays hardware, it tends to be the point at which you have to start thinking about denormalising.
 

Cogman

Lifer
Sep 19, 2000
10,286
145
106
Below a million rows pretty much anything basically works within reasonable time scales. Once tables start going beyond that it requires more careful thought about how queries run and how you join them. Joining 2 tables both of 10 million rows in ways that can sometimes produce a lot of results or can't utilise good indexes on both sides can be really slow. Databases do tend to get trickier to optimise once you pass 1 million rows in tables that have to be joined on todays hardware, it tends to be the point at which you have to start thinking about denormalising.

I don't know that I totally agree. You can get a lot done with good indexes on table. At my company, we have one table that is used very heavily with over 50 million rows in it. It is pretty responsive and like the OP is commonly joined against a second table.

We haven't changed the structure in a long time. Mostly, it is just good indices that have gotten us this far.

Either way, I would suggest building out a good clean denormalized structure first and then worry about performance issues when they come up. Like I said, my company regularly works with tables > 50 million rows.
 

Markbnj

Elite Member <br>Moderator Emeritus
Moderator
Sep 16, 2005
15,682
14
81
www.markbetz.net
So what I'm reading from you is "preoptimization is the root of all evil."

Let me try to clarify what I meant about a user needing its own database - I really have no idea what I was thinking. My coffee had not kicked in yet! Forget it.

Let's talk storage and indices, shall we? But first, let's talk about the data.

The type of data structure required by the program is essentially a disconnected, stateless, directed graph. So basically, no single node is considered a root, and each node in the same graph may or may not be connected to the rest of the graph. The only thing they have in common is that they're in the same graph. There is a one-to-one relationship between user and graph: one user represents one graph.

Suppose we simply have a user table, where the node table has a foreign key to it. What I'm worried about is performance due to fragmented data. The algorithms that'll be acting on this structure are either very recursive or very pseudo-random within that user's graph. In other words, they require locality for performance. Knowing this, would clustering on the foreign key column solve this problem?

Perhaps not the root of all evil, but at least not the best place to start. More to the point it's probably not the main issue here. Seems to me the big decision is how to represent the graph efficiently in SQL Server, not how to key a set of nodes to a specific user.

I've never dealt with anything harrier than a simple hierarchy in SQL Server. I know it can be done, and I know the queries get exponentially less efficient as the graph gets deeper.

Here's a good piece I came across while reading up before replying to this:

http://www.neotechnology.com/relational-database-graph-database/
 

Crusty

Lifer
Sep 30, 2001
12,684
2
81
I don't know that I totally agree. You can get a lot done with good indexes on table. At my company, we have one table that is used very heavily with over 50 million rows in it. It is pretty responsive and like the OP is commonly joined against a second table.

We haven't changed the structure in a long time. Mostly, it is just good indices that have gotten us this far.

Either way, I would suggest building out a good clean denormalized structure first and then worry about performance issues when they come up. Like I said, my company regularly works with tables > 50 million rows.

I've pulled rows out of a MySQL table with almost 1 billion records in milliseconds. A good index goes a long way.
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
SQL Server has several features to make hierarchies in it work efficiently, but you'll probably need to make the DB itself a bit less normalized, wrapping the edge cases that entails into the app, sprocs, triggers, etc.. One potentially useful trick would be to separate graph roots or leaves, rather than one uber-generic M:M linker.
Code:
root (parent_branch_data, child_branch_data)
leaf (branch_data, leaf_data)
Same data, but if most queries are going to look bottom-up or top-down, it can simplify things.

Dealing with a smaller and simpler, but similar problem with graphs (making them user-editabe in a web interface, without sacrificing normalization, and still allowing point-in-time historical queries), it is a mental PITA, because there isn't a way that perfectly handles every corner case, and computer-generated keys end up working there way up from physical implementation to logical design.

Either way, I would suggest building out a good clean denormalized structure first and then worry about performance issues when they come up.
That's a contradiction in terms. A denormalized structure is by definition less good and clean than a normalized one.

Looking stuff up in 50 million rows should be no big deal. Joining across millions of filtered rows can be. That's something to worry about when you get there, though.
 
Last edited:

Cogman

Lifer
Sep 19, 2000
10,286
145
106
That's a contradiction in terms. A denormalized structure is by definition less good and clean than a normalized one.

Looking stuff up in 50 million rows should be no big deal. Joining across millions of filters row can be. That's something to worry about when you get there, though.

Sorry, I mistyped. I really did mean a normalized table model.