"Per-user" relational model? Need some conceptual advice.

slugg · Sep 2, 2014

Hey guys. So I'm tinkering with an idea for an app, but I'm a bit stumped for a good approach for the back-end. I have a good schema/data model already, but it's only set up for one user. For this app to work using my existing techniques, algorithms, and data model, each user needs its own database, basically. For scalability reasons, giving each user it own micro-db like SQLite is out of the question (i.e. I don't want one instance per user).

The naive approach is to simply add the user ID to each table's primary key, which would also be a foreign key to a parent user table. I'm worried about the performance implications of this, however. I'm not really sure how this would affect real-world DB performance - is there some kind of clever partitioning and indexing we could do to mitigate the performance penalty?

Another approach is to normalize a bit more, where only one table has a reference to the parent user table, then each child references it. I would be worried about the join calculus here since we'd automatically be introducing two joins for any query. And really, each user's records are completely unrelated. A textbook would say that this is the right answer, but I just don't buy it.

What other approaches would you recommend?

If it matters, I'm planning on targeting MS SQL Server 2012 and the whole thing would be written using Entity Framework.

Thanks!

Markbnj · Sep 2, 2014

I guess I don't know enough about what you mean when you say each user needs his or her own "database." Just on the surface I would approach it from a normalized, relational schema until I proved that doing that is insufficient for some reason. Unless the stuff you're storing in the user "database" has some odd characteristics I can't see why it would be. I wouldn't add the user ID to the primary key for the tables. I would have the user ID be a foreign key on the dependent tables, again until such time as I could show why that wouldn't work.

slugg · Sep 2, 2014

Markbnj said:
I guess I don't know enough about what you mean when you say each user needs his or her own "database." Just on the surface I would approach it from a normalized, relational schema until I proved that doing that is insufficient for some reason. Unless the stuff you're storing in the user "database" has some odd characteristics I can't see why it would be. I wouldn't add the user ID to the primary key for the tables. I would have the user ID be a foreign key on the dependent tables, again until such time as I could show why that wouldn't work.

So what I'm reading from you is "preoptimization is the root of all evil."

Let me try to clarify what I meant about a user needing its own database - I really have no idea what I was thinking. My coffee had not kicked in yet! Forget it.

Let's talk storage and indices, shall we? But first, let's talk about the data.

The type of data structure required by the program is essentially a disconnected, stateless, directed graph. So basically, no single node is considered a root, and each node in the same graph may or may not be connected to the rest of the graph. The only thing they have in common is that they're in the same graph. There is a one-to-one relationship between user and graph: one user represents one graph.

Suppose we simply have a user table, where the node table has a foreign key to it. What I'm worried about is performance due to fragmented data. The algorithms that'll be acting on this structure are either very recursive or very pseudo-random within that user's graph. In other words, they require locality for performance. Knowing this, would clustering on the foreign key column solve this problem?

BrightCandle · Sep 2, 2014

There are a couple of other options beyond using a database (which basically necessitates a design where user data is separated by a compound key with the user unique identifier). The first is that there are actually graph storage systems, you might find one of those (a NoSQL DB) is highly appropriate to your design. The other possibility is a document based NoSQL solution, where you store the entire graph/users data in one or more documents. This in effect would have the users entirely unrelated to each other and user ID could be the key into the graph storage and nothing more.

Bare in mind people have been doing what you are suggesting with SQL databases for over 30 years. With indexes on the user ID they can be remarkably fast even with incredibly large tables (10's of millions of rows). If you really have a problem that is going to break a SQL solution then sure go ahead and try something else, but if you are talking less than 1 million rows you aren't likely to have an issue.

slugg · Sep 2, 2014

BrightCandle said:
There are a couple of other options beyond using a database (which basically necessitates a design where user data is separated by a compound key with the user unique identifier). The first is that there are actually graph storage systems, you might find one of those (a NoSQL DB) is highly appropriate to your design. The other possibility is a document based NoSQL solution, where you store the entire graph/users data in one or more documents. This in effect would have the users entirely unrelated to each other and user ID could be the key into the graph storage and nothing more.

Bare in mind people have been doing what you are suggesting with SQL databases for over 30 years. With indexes on the user ID they can be remarkably fast even with incredibly large tables (10's of millions of rows). If you really have a problem that is going to break a SQL solution then sure go ahead and try something else, but if you are talking less than 1 million rows you aren't likely to have an issue.

Due to other constraints, this must be in MS SQL Server, so any NoSQL implementation is out, unfortunately. I do think we'll see less than 1 million rows per user. What are the performance implications of larger data sets?

uclabachelor · Sep 2, 2014

You haven't provided any valid reason as to why you need a database per user.

You can normalize the schema and include the user id as a key. JOIN performance won't take much of a hit (if any) if indexed and queried properly.

If the reason is that your users each have unique data and need to store it without serializing the data, you can always utilize the entity-attribute approach and still maintain a single database across all users.

BrightCandle · Sep 2, 2014

Below a million rows pretty much anything basically works within reasonable time scales. Once tables start going beyond that it requires more careful thought about how queries run and how you join them. Joining 2 tables both of 10 million rows in ways that can sometimes produce a lot of results or can't utilise good indexes on both sides can be really slow. Databases do tend to get trickier to optimise once you pass 1 million rows in tables that have to be joined on todays hardware, it tends to be the point at which you have to start thinking about denormalising.

Cogman · Sep 2, 2014

BrightCandle said:
Below a million rows pretty much anything basically works within reasonable time scales. Once tables start going beyond that it requires more careful thought about how queries run and how you join them. Joining 2 tables both of 10 million rows in ways that can sometimes produce a lot of results or can't utilise good indexes on both sides can be really slow. Databases do tend to get trickier to optimise once you pass 1 million rows in tables that have to be joined on todays hardware, it tends to be the point at which you have to start thinking about denormalising.

I don't know that I totally agree. You can get a lot done with good indexes on table. At my company, we have one table that is used very heavily with over 50 million rows in it. It is pretty responsive and like the OP is commonly joined against a second table.

We haven't changed the structure in a long time. Mostly, it is just good indices that have gotten us this far.

Either way, I would suggest building out a good clean denormalized structure first and then worry about performance issues when they come up. Like I said, my company regularly works with tables > 50 million rows.

Markbnj · Sep 2, 2014

slugg said:
So what I'm reading from you is "preoptimization is the root of all evil."

Let me try to clarify what I meant about a user needing its own database - I really have no idea what I was thinking. My coffee had not kicked in yet! Forget it.

Let's talk storage and indices, shall we? But first, let's talk about the data.

The type of data structure required by the program is essentially a disconnected, stateless, directed graph. So basically, no single node is considered a root, and each node in the same graph may or may not be connected to the rest of the graph. The only thing they have in common is that they're in the same graph. There is a one-to-one relationship between user and graph: one user represents one graph.

Suppose we simply have a user table, where the node table has a foreign key to it. What I'm worried about is performance due to fragmented data. The algorithms that'll be acting on this structure are either very recursive or very pseudo-random within that user's graph. In other words, they require locality for performance. Knowing this, would clustering on the foreign key column solve this problem?

Perhaps not the root of all evil, but at least not the best place to start. More to the point it's probably not the main issue here. Seems to me the big decision is how to represent the graph efficiently in SQL Server, not how to key a set of nodes to a specific user.

I've never dealt with anything harrier than a simple hierarchy in SQL Server. I know it can be done, and I know the queries get exponentially less efficient as the graph gets deeper.

Here's a good piece I came across while reading up before replying to this:

http://www.neotechnology.com/relational-database-graph-database/

Crusty · Sep 3, 2014

Cogman said:
I don't know that I totally agree. You can get a lot done with good indexes on table. At my company, we have one table that is used very heavily with over 50 million rows in it. It is pretty responsive and like the OP is commonly joined against a second table.

We haven't changed the structure in a long time. Mostly, it is just good indices that have gotten us this far.

Either way, I would suggest building out a good clean denormalized structure first and then worry about performance issues when they come up. Like I said, my company regularly works with tables > 50 million rows.

I've pulled rows out of a MySQL table with almost 1 billion records in milliseconds. A good index goes a long way.

Cerb · Sep 3, 2014

SQL Server has several features to make hierarchies in it work efficiently, but you'll probably need to make the DB itself a bit less normalized, wrapping the edge cases that entails into the app, sprocs, triggers, etc.. One potentially useful trick would be to separate graph roots or leaves, rather than one uber-generic M:M linker.

Code:

root (parent_branch_data, child_branch_data)
leaf (branch_data, leaf_data)

Same data, but if most queries are going to look bottom-up or top-down, it can simplify things.

Dealing with a smaller and simpler, but similar problem with graphs (making them user-editabe in a web interface, without sacrificing normalization, and still allowing point-in-time historical queries), it is a mental PITA, because there isn't a way that perfectly handles every corner case, and computer-generated keys end up working there way up from physical implementation to logical design.

Cogman said:
Either way, I would suggest building out a good clean denormalized structure first and then worry about performance issues when they come up.

That's a contradiction in terms. A denormalized structure is by definition less good and clean than a normalized one.

Looking stuff up in 50 million rows should be no big deal. Joining across millions of filtered rows can be. That's something to worry about when you get there, though.

Cogman · Sep 3, 2014

Cerb said:
That's a contradiction in terms. A denormalized structure is by definition less good and clean than a normalized one.

Looking stuff up in 50 million rows should be no big deal. Joining across millions of filters row can be. That's something to worry about when you get there, though.

Sorry, I mistyped. I really did mean a normalized table model.

Search

"Per-user" relational model? Need some conceptual advice.

slugg

Diamond Member

Markbnj

Elite Member <br>Moderator Emeritus

slugg

Diamond Member

BrightCandle

Diamond Member

slugg

Diamond Member

uclabachelor

Senior member

BrightCandle

Diamond Member

Cogman

Lifer

Markbnj

Elite Member <br>Moderator Emeritus

Crusty

Lifer

Cerb

Elite Member

Cogman

Lifer

TRENDING THREADS