Searching a Binary File Stored in a Database

clamum

Lifer
Feb 13, 2003
26,252
403
126
Is it possible to search the contents of a binary file that is stored in a database table (using SQL Server 2005)?

A table has an image column which stores uploaded files (user can upload many file types: .zip, .txt, .xls, .docx, ...). So would it be possible to somehow re-construct the file and then search the contents of it?

For example, say this table has a some records in it and stored in the image column are resumes, in Microsoft Word format. Could functionality be setup (using ASP.NET 3.5 and SQL Server 2005) to search the resumes for say an applicant's name or university?

I would think the file would have to be first queried from the database in a SELECT query, then outputted to the filesystem by ASP.NET code as the actual binary file, and then somehow that file's contents searched.

Anyone done something like this, or know if it is possible? Thanks.
 

imported_Dhaval00

Senior member
Jul 23, 2004
573
0
0
Multiple ways to go about doing this:
1) When binary column contains documents with one of the following file extensions, the full-text search service uses a filter to interpret the binary data and extract the textual information: .doc, .txt, .xls, .ppt, .htm. The extracted text is indexed and becomes available for querying (make sure you use the CONTAINS keyword). Needless to say, full text search has to be enabled. I am not sure the side effects of full-text index - i.e., its size. You'll also need to look at the execution plan to ensure the index gets used at query time.

2) You can write a CLR SPROC which searches for keywords after selecting only specific types of BLOBs. At the highest level, this seems the most non scalable solution, but CLR SPROCs/functions have surprised me the past. You'll have to test it to make sure of the speed implications. A couple of advantages I can think of include the ability to use System.IO and System.Text.RegularExpressions.

3) This is how I would do it (there could be better ways), but from a scalability and speed stand-point, this seems to be the best solution [at 7 am]: Depending on your table size, you can create a "meta-data" table for your BLOBs. This table can contain an nvarchar(max) column - feed the table using a one-time TSQL query with existing documents. For newer records, you'll have to take into account additional logic of populating this new table every time something is inserted/updated. What this will allow you is create indexes, use TSQL natively, and enable full-text search on that table... making your queries super-fast.

I am assuming this is not your creation, but just a design caveat - it is usually not a good idea to store files in the DB. You can instead store the file paths while your raw files sit on the file system (this also ties with #3 above... you'll need a meta-data table for faster queries, but overall, this would be the most manageable solution. Again, just putting it out there... ignore the comment if you already took this into consideration or if this is an existing solution.

For all I know, #1 can be efficient if you are also storing the "type of file" in a separate column - for your full-text search query, you can select only specific files (for example, ignore .zip) via JOINs before running the search.