Database formatting advice

GWestphal · Nov 25, 2013

I have a database that comprised of approximately 20 million rows and about 25 columns. 23 rows are just numbers/text, 2 rows are blobs with the an actual file stored inside between 1-400KB in size for one column and the other is consistently about 10KB. Of the 23 number/text columns, 11 of them have a fixed number of possible values between 4-10 options.

My question is should I just slam everything into one 20Mx25 table or should I break it up into tables, if so in what way? I don't really care about database size as much as I do about database performance.

We will need to run queries to figure out how many rows have property X in column 5 and property Y in column 6 and things like that.

We'll also be pulling the files out and analyzing them, and writing the results, duration, important markers into another table.

My concern with different tables would be that having to do joins or whatever might affect performance if we need to run 20 million queries to do a particular processing on each of the files.

EagleKeeper · Nov 25, 2013

Why look through 20M records for 20 possible answers.
Split those unique record goups into separate tables

Ken g6 · Nov 25, 2013

I think you meant...

GWestphal said:
23 columns are just numbers/text, 2 columns are blobs

Right?

BLOBs generally seem to be slow. Can you just write them as files somewhere else, and index their filenames? Or, even better, could you name the files based on the database table index?

Strings in a table generally aren't that slow, but integers are faster. When there is a consistent set of strings for a column, for best performance the idea is to make a table for each column mapping the strings to an integer (the automatic table index). Then when searching on a string you do a quick lookup for each string, and the rest is numerical lookups.

KB · Nov 26, 2013

How you organize your database is going to depend largely on the relationships between your data and the queries you will be making. It sounds like it is a simple database with no real parent-child relationships so leaving it as one table isn't a problem. If some data is never used again you might consider creating history tables to archive that data to improve peformance when querying the main table. But even if you reduce the table by half you still won't get as much performance benefit as you would if you create proper indexes based on the parmeters you query.

GWestphal · Nov 26, 2013

@Ken yes I meant columns there.

I plan to write them out, because we do need to have physical copies of the data. This database of blobs is currently the only copies of that dataset, so we need to pull them out at individual files.

Not sure about renaming them, I guess if we kept a record of the original name we could, but it is FDA regulated data so I'm not sure if that is kosher or not.

Most of the strings are date/time so they are all unique, how would the string mapping work there?

Also, having multiple tables would I need to use joins? if so, would that significantly slow performance?

uclabachelor · Nov 26, 2013

GWestphal said:
I have a database that comprised of approximately 20 million rows and about 25 columns. 23 rows are just numbers/text, 2 rows are blobs with the an actual file stored inside between 1-400KB in size for one column and the other is consistently about 10KB. Of the 23 number/text columns, 11 of them have a fixed number of possible values between 4-10 options.

My question is should I just slam everything into one 20Mx25 table or should I break it up into tables, if so in what way? I don't really care about database size as much as I do about database performance.

We will need to run queries to figure out how many rows have property X in column 5 and property Y in column 6 and things like that.

We'll also be pulling the files out and analyzing them, and writing the results, duration, important markers into another table.

My concern with different tables would be that having to do joins or whatever might affect performance if we need to run 20 million queries to do a particular processing on each of the files.

It's called table normalization.

A 20M x 25 table you've described will be really slow. The proper indexes will speed it up a little, but will still be slow.

Cogman · Nov 26, 2013

uclabachelor said:
It's called table normalization.

A 20M x 25 table you've described will be really slow. The proper indexes will speed it up a little, but will still be slow.

Meh, it depends on the DB.

For example, if a common search is "which rows have column x set to y" then in sql server 2012, postgres, and mysql you can make a partial index for columns of x set to y which will make querying for that object wickedly fast.

EagleKeeper · Nov 26, 2013

Convert your date field to a BCD number
yyyymmddhh

do the same with the time in 24 hour format
hhmmss

Separate columns for each.
This will allow search by date to be efficient vs string.
Keep the data string in a separate column for reference if desired.

Have the file info in a separate table - that way only an index is returned; not the info. You can then query on that index if you need the data.

GWestphal · Nov 27, 2013

The blob/file data will never be altered. We will pull each data file into matlab process it and output the results into another table.

With that in mind which is the highest performing option, considering the files/blobs are read only:

1)inline blobs like the majority of the data is already (some data as just files on the filesystem hasn't been added yet)

2)Filestream column

3)just a pathname to a location on the filesystem

EagleKeeper · Nov 27, 2013

My concern is that when adding the files into the database; the queries have to do extra work to either include the files (a large amount of data coming back which will slow things up) or exclude the columns from within the query (data does not come back but needs to be filtered out by the DB engine).

By having a path name to the location, the database engine has to do no extra work other than return the string of the path. Then whomever needs the data has the ability to access it via their tools

Markbnj · Nov 27, 2013

I'm with EagleKeeper. I'd store the file system path, or serve the data up via http from a central server, and store the urls.

beginner99 · Nov 28, 2013

I agree with previous posters. Lob columns can be painfully slow even for relativity small stuff below 400kb. So slow it can easily be the bottle-neck for everything else you do, hell it almost certainly will be unless this matlab calculation are very, very expensive.

What are does files? Chemical Structures?

GWestphal · Nov 28, 2013

Signal processing on the on cardiac signals. Each one takes between 1-15secs (working on speeding that up, this is legacy code from someone else that I'm trying to make more efficient)

EagleKeeper · Nov 28, 2013

GWestphal said:
Signal processing on the on cardiac signals. Each one takes between 1-15secs (working on speeding that up, this is legacy code from someone else that I'm trying to make more efficient)

What language?

GWestphal · Nov 28, 2013

matlab for now, but may compile down to Java or C

slashbinslashbash · Nov 29, 2013

A fairly simple thing to do would be to simply split up the table into 2 pieces: the easy numeric data and the blob data.

TABLE1
Index
data1
...
data23

TABLE2
Index
blob1
blob2

The index of course relates the two tables in a 1:1 manner. (Which breaks the rules of normalization, but I think it will improve performance to have a "lightweight" table separate from the data-intensive table.... the entirety of TABLE1 in this instance would probably fit within the RAM of the host computer, which is always a good thing) This is assuming that I'm reading you right, that you're not going to be doing crazy things that rely on selecting rows based on both data13 and blob2..... this splitup will be good if you often do things like "give me all blob2's for rows matching data13=XYZ and data15=ABC and data19=PDQ" or "search all blob2's for matching data pattern, then give me data1...data23 for those that match".

Definitely use the appropriate datetime/timestamp data structure provided by your DBMS; keeping those as strings will kill you if you query those often. The datetime structure basically stores the datetime information as integers. The more data that you can cast as integers, the better. MySQL has an "ENUM" (enumerated) data type that lets you specify a given number of string values that it stores internally as integers.

Oh and.... back up everything, EVERYTHING, before you start screwing around with the table structure! It is all too easy to lose it all in the blink of an eye due to a poorly constructed query. Put it on an external hard drive, and leave that drive disconnected, and never touch that copy again, except to make fresh copies when you need them.

GWestphal · Dec 3, 2013

Question: Is the filename part of a files binary data or is it just a filesystem container kind of thing? What I mean is, I have this huge number of blobs and they were stored without filenames. So is that binary data in the blob just the "data" and I need to give them a name when I write them out, or are the names included. I suspect the former, which is a huge pain since, I can get most of the file info from the saved meta data
("XXX SN XXX MN Type X YYYYMMDDHHMMSSMsMsMs.xxx). The problem is the file name is generated with the Ms at the end (no idea how), but the time stamp in the .xml parameters file only has a quasi unix style time stamp that is seconds since Jan 1 2005, so not sure where or how to get the milliseconds part.

slashbinslashbash · Dec 3, 2013

GWestphal said:
Question: Is the filename part of a files binary data or is it just a filesystem container kind of thing? What I mean is, I have this huge number of blobs and they were stored without filenames. So is that binary data in the blob just the "data" and I need to give them a name when I write them out, or are the names included. I suspect the former, which is a huge pain since, I can get most of the file info from the saved meta data
("XXX SN XXX MN Type X YYYYMMDDHHMMSSMsMsMs.xxx). The problem is the file name is generated with the Ms at the end (no idea how), but the time stamp in the .xml parameters file only has a quasi unix style time stamp that is seconds since Jan 1 2005, so not sure where or how to get the milliseconds part.

It would probably be helpful if you told us exactly how this data is stored now. Which DBMS, for one thing? Or it's all or partly stored in XML files? You're kind of giving mixed information here. You are talking about the "file names" in the blahblahblahMsMsMs format as if they already existed, but from what I've gathered so far, the files don't actually exist (they're just blobs in a DB), so how could the filenames exist?

My knee-jerk response is yes, the binary info in the blob is just the "data" (i.e., what would go inside the file) and the "metadata" such as filename would not be included in the blob. Usually file metadata resides in the filesystem structure of the OS, not in the file itself. However, it's possible that the blobs were purposely imported in such a way as to preserve metadata. If that were the case, I would expect it to be properly documented (i.e., "first xxxx bytes of each blob are filesystem metadata").

EagleKeeper · Dec 3, 2013

Blob is data; you need to create a file to write that data to.
Once your create the file; the OS will allow you to track the file and then interface with it.
If you need meta data for the blob file; create a second file containing the meta data; using the same name of the first file but a different type of extension.

GWestphal · Dec 3, 2013

Sorry, it's kind of complicated. I was given a bunch of data sources and told to pool them and make them user accessible and run a bunch of analysis on the files and pull analytics.

The device creates two files: one with cardiac data file and one xml file with meta data. Both files are named XXX SN XXX MN Type X YYYYMMDDHHMMSSMsMsMs.xxx. That data is then sent to a third party processing center. Different centers in the past have processed the data in different ways, the first stored everything in databases. The later kept file system copies.

Data Pool 1 (Majority of Data ~80GB): cardiac data as blobs in a MSSQL database. No file names, only certain meta data parsed into tables. Can recreate most of the filename except for the millisecond digits at the end. Some metadata not accessible because it wasn't parsed and saved into a table.

Data Pool 2 (Minority of Data ~30GB): cardiac data and xml in file form. This is how I know what the filename format should be for Datapool 1.

So basically Data Pool 1 is kind of incomplete due to how that 3rd party processed and saved it. Data Pool 2 is the original files.

So I need to write out the Data Pool 1 blobs to files: in case a single file is ever needed for say and FDA audit or some such. Then I should probably pull the Data Pool 2 files into a data base.

I'm not just a poor research engineer who got all this database stuff dropped on him. But reading up on things it seems like all blob storage might be the fastest for "streaming" but since I also need to make is user accessible in an easy way, using a filestream or filetable do both of the things I need to do accomplish.

Since the analysis part is going to be done in matlab, I'll likely just return the file paths and let matlab reference the file on the filesystem vs "streaming" the blob to matlab which would have to (re)format it anyways.

[I should note, this is all being done locally at the moment SQL 2012 and files are on the local box I'm working on, but later on, they will likely want some sort of web front end]

So I've been fiddling with filestream and filetable. The filetable thing is kinda neat where I can just toss the data into a directory and it gets autopopulated into the filetable table. But the Data Pool 1 data already has it's own GUIDs that are used as unique keys to link the blobs to their associate metadata and I haven't been able to figure out how to move the Data Pool 1 blobs and GUIDs into a filetable, so maybe filestream is more appropriate since it's lower level than filetable.

Ideally, I'd love to be able to drag files into a directory and have them autopopulate the database.

Database formatting advice

Golden Member

Discussion Club Moderator<br>Elite Member

Programming Moderator, Elite Member

Diamond Member

Golden Member

Senior member

Lifer

Discussion Club Moderator<br>Elite Member

Golden Member

Discussion Club Moderator<br>Elite Member

Elite Member <br>Moderator Emeritus

Diamond Member

Golden Member

Discussion Club Moderator<br>Elite Member

Golden Member

Golden Member

Golden Member

Golden Member

Discussion Club Moderator<br>Elite Member

Golden Member