Software/database/method to work with 100-200 million row/3-5 column of data?

PwAg

Senior member
Sep 20, 2000
769
0
71
Hi folks.

Looking for a software package/database/method to accurately & easily work with upwards of 200-million rows of data with 3-5 parameter columns.

Here is what I'm looking to do...
1. Pull out and save every 1500th/1501th row. This results in a 200 million row x 3 column data set being reduced to ~266,666 row x 3 column. Once I have it reduced to ~266,666, will likely drop into excel for further manipulation.

Here is a 3 column example:

Date, Time, Price
Row1 4/1/2010 08:30:05 1202.25 <- Save
Row2 4/1/2010 08:30:05 1202.50 <- Remove
Row3 4/1/2010 08:30:05 1202.25 <- Remove
..... <- Remove
..... <- Remove
Row1500 4/1/2010 08:31:02 1201.50 <- Save
Row1501 4/1/2010 08:31:02 1201.50 <- Save
......
......
Row3000 4/1/2010 08:31:58 1201.25
Row3001 4/1/2010 08:31:58 1201.00
.....
.....
Row4500
Row4501
.....
..... etc
Row150,000,000
Row150,000,001

Want to be able to easily switch to a removal criteria such that I can save every 1200th/1201th row or 1000th/1001th etc.

Data is either in ascii format or txt file from original source.

Any input much appreciated.
 

bigi

Platinum Member
Aug 8, 2001
2,490
156
106
I'd go with 2008 SQL. Table with clustered index. All running on "raided" (Intel) ssd storage. There are lots of rows to traverse and SSD will show the advantage here.

What is the budget?
 

PwAg

Senior member
Sep 20, 2000
769
0
71
Thanks for input. Unfortunately I have no SQL experience. Will it be decent learning curve?

Any way to pull this off using a database containing all relevant data and then performing the operation with Excel?

Thoughts on MySQL?

Will be tackling this on my prototyping workstation (XP PRO SP3, i7-920@3.8, raid raptors). Will not be a "frequent" process. Utmost speed not critical.
 

thedosbox

Senior member
Oct 16, 2009
961
0
0
Thanks for input. Unfortunately I have no SQL experience. Will it be decent learning curve?

SQL skills are fairly common. Bribe a friendly developer.

It's been awhile, but I think they should be able to write a ETL script to load the data from the flatfile and ignore the rows you don't want.

Alternatively, they could load all of the data with a row counter, and write SQL to retrieve everything except the nth row, where "nth" is a parameter.

Any way to pull this off using a database containing all relevant data and then performing the operation with Excel?

No, Excel won't support more than a million rows.
 

Crusty

Lifer
Sep 30, 2001
12,684
2
81
I'd go with 2008 SQL. Table with clustered index. All running on "raided" (Intel) ssd storage. There are lots of rows to traverse and SSD will show the advantage here.

What is the budget?

Waaaaaay overkill. A couple hundred million rows is chump change and any modern RDBMS will be able to handle it on any decent set of SATA drives.

OP:
MySQL would work just fine for your needs. It uses a GPL license so make sure to take that into consideration for any implications it might have on your code. If you need a more liberal license you can use PostgreSQL.

MySQL has an ODBC driver for windows that you can use with Excel, but it has several limitations due to being ODBC and also because of some quirks with the Office runtime. I would suggest using VB.NET if you're already familiar with VB as it has a much more feature correct driver to use. For the most part, you can write queries and stored procedures/functions to do most of your calculations.

More details needs to be given about the computations though... like are you using 266k rows at a time because of some limitation of Excel or is that just calculation you're running?
 

leeland

Diamond Member
Dec 12, 2000
3,659
0
76
SQL 2008 Express would be all you need...10 gb db max size

http://www.microsoft.com/express/Database/

as far as loading the data...there are methods to load directly from code which I have used if the data is formatted in a correct format...


http://blog.sqlauthority.com/2008/0...rt-load-comma-delimited-file-into-sql-server/

The request looks pretty general, and if you can set up a possibly two indexes after the data has been loaded to the table you should be able to do something like a Derived Table (Selecting Date and Time and filtering out the duplicates)

Then joining the derived table back to itself (on date and time) table to get the (min price or max price or first price entered based on row number)...

Code:
SELECT	a.date,
	a.time,
--	MIN(b.price)	as min_price,
--	MAX(b.price)	as max_price
FROM
	(SELECT	 a.date	as date,
		 a.time	as time
	FROM	 Table_Name_Here	as a
	GROUP BY a.date	as date,
 		 a.time	as time
	)	as a
	INNER JOIN Table_Name_Here	as b on a.date = b.date and a.time = b.time
GROUP BY a.date, 
	 a.time

I would say adding an IDENTITY column on the table and making that the clustered index since the other two columns are so non-unique...or if the 'row_number' is supposed to take that roll use that column for the unique clustered index...

and then trying to create a CI on the date and time fields...

The cool part of SQL 2008 is the query advisor which figures out tuning for you...not sure if SQL 2008 express has that or not
 

davecason

Member
Jun 4, 2000
100
0
0
If you could tell me how you want to look at the data, I can tell you the easiest way to process it.

Right up front, you could easily bind the text file to a MS Access database (use the advanced button on the import wizard to skip any fields you don't need) and either pivot it inside MS Access using the form view or make a "query" which is like a "view" in other databases that groups up the data, then pivot it in Excel using that query as a source. Your data volume may make using MS Access impractical on an old computer so you will need to factor that into your decision.

You could also use a very simple script to group or pivot the data into a smaller text file that could be easily opened in Excel. This would be a better option if you have a lesser computer.

So if you know which fields of data are important to you and how you want to evaluate them, let me know and I will give you a solution.

-Dave
 

PwAg

Senior member
Sep 20, 2000
769
0
71
Tx for the further input all. Much appreciated.

I've started to look into BerkeleyDB, MySQL, infobright, and SQL2008 R2 Express.

Leeland, tx you very much for the example code. That is headed in the right direction.

Dave or anyone else:
I've uploaded a sample 1 day of data to Excel (~500,000 rows). About 7mb in size. I have shown to the right side the two manipulations I would like to accomplish. Hope it is clear.
http://www.megaupload.com/?d=W24XJEF9

Does the fact Access has only 2GB capacity not interfere with fact I will be working with upwards of 5-7GB at a time? Should I consider using Access as GUI/wrapper to work with an SQL DB?

I am not DB proficient in the slightest, and will be playing catch up this wkd. ;)
 

Crusty

Lifer
Sep 30, 2001
12,684
2
81
Access 2GB capacity is for the .mdb files. I've got users who use Access to run queries on a MySQL database with billion+ rows and 200GB of data in it. I wouldn't really recommend it though as you'll be wanting to use 'pass through queries' often.
 

leeland

Diamond Member
Dec 12, 2000
3,659
0
76
Is it possible to populate your excel file / text file / csv with the following fields?


row_number / date / time / amount?

vs the current

date / time / amount?
 
Last edited:

leeland

Diamond Member
Dec 12, 2000
3,659
0
76
Based on your spreadsheet it was hard to determine what kind of filter you wanted to obtain...it didn't seem to match based on any kind of (Min or Max amount)

what I tested quickly was loading the spreadsheet to a table and then filtering based on the unique date and time combo...and from there I was able to pull the minimum row (or first entered combo of date time)

and from that point I had the first unique date & time combo with the matching row number...I was able to go back and get the corresponding amount...

the row_id and org_row_id are only for reference...

the row_id just happens to be the IDENTITY column from the final table I was using...

the org_row_id is the original row position inside that excel spreadsheet at the time of import to a table from excel...it should correlate to your spreadsheet 1 to 1

Code:
row_id	org_row_id	date		time		amt
---------------------------------------------------------------
15641	1		4/27/2010	08:30:00	1201.75
15642	414		4/27/2010	08:30:01	1201.75
15643	577		4/27/2010	08:30:02	1201.75
15644	607		4/27/2010	08:30:03	1202
15645	624		4/27/2010	08:30:04	1202
15646	675		4/27/2010	08:30:05	1201.75
15647	745		4/27/2010	08:30:06	1202
15648	881		4/27/2010	08:30:07	1201.75
15649	911		4/27/2010	08:30:08	1202
15650	959		4/27/2010	08:30:09	1202
15651	968		4/27/2010	08:30:10	1202
15652	982		4/27/2010	08:30:11	1201.75
15653	1088		4/27/2010	08:30:12	1202
15654	1102		4/27/2010	08:30:13	1202
15655	1112		4/27/2010	08:30:14	1202
15656	1116		4/27/2010	08:30:15	1202
15657	1137		4/27/2010	08:30:16	1202
15658	1172		4/27/2010	08:30:17	1202
15659	1248		4/27/2010	08:30:18	1202
15660	1254		4/27/2010	08:30:19	1202.25
15661	1261		4/27/2010	08:30:20	1202.25
15662	1267		4/27/2010	08:30:21	1202.25
15663	1278		4/27/2010	08:30:22	1202
15664	1283		4/27/2010	08:30:23	1202
15665	1305		4/27/2010	08:30:24	1202.25
15666	1322		4/27/2010	08:30:25	1202.25
15667	1335		4/27/2010	08:30:26	1202.25
15668	1337		4/27/2010	08:30:27	1202.25
15669	1360		4/27/2010	08:30:28	1202.25