So this guy at work has these data files. Several thousand of them, each about 150MB, and having about 3000 floating point parameters. I wrote him a quick script to parse the files, and generate statistics (count, min, max, avg, stdev) for each of the parameters. An easy task.
I wrote the first version in Python, and the performance is abysmal. Takes between 2000 and 3000 seconds to process a file. Given the amount of data this guy has (and more arriving soon) this was clearly not acceptible. I spent some time mucking around with the python, but didn't make any significant improvement.
Next step was rewriting it in C++. It is, as much as possible, nearly a line-by-line translation of of the python. Had to add some stuff of course ... variable declarations etc. Used the STL map where I used the Python map, etc. And I did some of the FP calcs in long double instead of double for the stddev & avg calcs
The C++ version takes about 17 seconds to process the same files!
I have to find some time to take another look at the Python code and figure out what's smoking it so badly. But in general, I've found that Python seems to suck wind pretty badly when handling large files & large in-memory data sets. Memory on this machine isn't a problem ... 2GB & dual Xeons.
Anybody else have any experience like this?
I wrote the first version in Python, and the performance is abysmal. Takes between 2000 and 3000 seconds to process a file. Given the amount of data this guy has (and more arriving soon) this was clearly not acceptible. I spent some time mucking around with the python, but didn't make any significant improvement.
Next step was rewriting it in C++. It is, as much as possible, nearly a line-by-line translation of of the python. Had to add some stuff of course ... variable declarations etc. Used the STL map where I used the Python map, etc. And I did some of the FP calcs in long double instead of double for the stddev & avg calcs
The C++ version takes about 17 seconds to process the same files!
I have to find some time to take another look at the Python code and figure out what's smoking it so badly. But in general, I've found that Python seems to suck wind pretty badly when handling large files & large in-memory data sets. Memory on this machine isn't a problem ... 2GB & dual Xeons.
Anybody else have any experience like this?