Opinions on a strategy for XML processing

Markbnj · Mar 19, 2009

Wanted to get everyone's opinions about this problem I'm working on. I have an incoming chunk of data from a provider that consists of many (3000-5000) individual pieces of digital content. Each piece is in the form of a file, and is stored out at the end of a long directory structure, one file per leaf directory, along with a smallish chunk of XML metadata.

I have to search through these metadata files and based on the properties of elements reassemble the pieces of content into a new structure, by copying them to new directories, and running the metadata through an XSLT.

After puzzling over it part of today I concluded that what I need to do is assemble all the existing metadata files into a single XML document, which can be queried using xpath to select out the nodes and relationships I want for the new structure.

The alternative would be to implement a search of the XML metadata in the leaf directories every time I need to select out a set of nodes with certain properties.

I like the idea of building one XML file out of all the small metadata files for two reasons: the first is that there are a lot of pieces and pieces of pieces in these sets, so the leaf nodes will be visited a lot, maybe 1000 times to build the whole structure. The second is that I think the logic will be easier to write and maintain in xpath.

Any thoughts?

imported_Dhaval00 · Mar 19, 2009

Originally posted by: Markbnj
I have to search through these metadata files and based on the properties of elements reassemble the pieces of content into a new structure, by copying them to new directories, and running the metadata through an XSLT.

Worked on similar issues over multiple projects (of course, M$-specific). There are two strategies:

1) Store the XML blobs in SQL Server and then use XQuery/XPath inside SQL Server to search for nodes. This is quite fast, but you'll have to maintain indexes over the XML column. Also, from a maintainability standpoint, the XQuery's can get kind of annoying. From a longevity standpoint, I don't think this is super-scalable (compared to the next strategy). I don't use this strategy anymore.

2) If the metadata is small enough, then parse the elements and store them in a relational table upon receipt. This is almost exclusively what I do now because my experience has taught me that search-related queries perform best inside relational DB engines. Plus your structure can be exploited beautifully, I think - you can store the parsed metadata in a relational structure with extra columns that point to the data file and the XSLT. When you need to put the metadata together, then you can use XQuery inline inside SQL Server.

There might be better solutions, but I have been served well by the second strategy. Of course, both the strategies are moot if you can't use a DB engine.

Markbnj · Mar 19, 2009

The records are going to end up in the database in the 3rd stage of the pipeline. The piece I described in the OP is the second stage. Essentially we have the same types of content coming in from numerous providers, all with slightly different metafile schema. So the second stage is a transformation from their input formats to our common format, which is then processed in the third stage and fed into the database, with the content stored on a SAN. So assume I don't want to go to the database to perform the parsing in this intermediate stage. Most of our providers send one large XML file that we process through a transform. It's this one provider that happens to send a deeply nested dir structure with individual metafiles in each leaf folder.

Ken g6 · Mar 20, 2009

Thinking this through, I infer that you have a tree like this:

dir/path/1:
<a/>
<b/>
<c/>
<d/>

dir/path/2:
<a/>
<b/>
<c/>
<d/>

And you want to transform that to roughly:
<a>
<dir/path/1 />
<dir/path/2 />
</a>

<b>
<dir/path/1 />
<dir/path/2 />
</b>

<c>
<dir/path/1 />
<dir/path/2 />
</c>

<d>
<dir/path/1 />
<dir/path/2 />
</d>

Initially, it seems like there are two plausible ways of doing this. One of the trees must be accessed randomly, and the other would naturally be accessed sequentially. Since there are so many files, it makes sense not to access them all more than once. So you can either create that big XML of the original tree, or other searchable data structure, do random queries of that, and build the result tree sequentially; or, if you know all the values of <a> through <d> that you want to fill in (which I infer you do), you can read the input data sequentially and add the results to the output tree in the correct fields as they come in.

This second approach is probably faster, but isn't likely to scale well if you can't do it in memory. If you have only a few output fields, I imagine you could open one filehandle for each and combine them later. But this doesn't scale well in the number of output fields.

A third idea is to do the equivalent of making a searchable data structure on the output rather than the input. You could just read everything linearly, tag each entry with a prefix to make it sortable (or is that suffix each entry with the path?), output everything to one file, and turn it into an external sorting problem. Then just use Mergesort on the results and wrap the grouped tags into XML.

It's late, so I hope these ideas aren't complete nonsense. :Q

Markbnj · Mar 20, 2009

Not nonsense at all. The transformation is something like that, only somewhat more involved. Imagine that the ultimate digital entity is books, and that some books have a hierarchy like work/volume/part/chapter. You get books from several publishers, and all but one send an xml file that looks like:

<work id="123">
<volume num="1">
<part num="1">
<chapter num="1">
</chapter>
</part>
</volume>
</work>

The one publisher prefers to send a deeply nested directory structure that is based on the decomposition of a long numeric ID into tokens, and that ID is completely orthogonal to the problem of grouping entities together. At the leaf nodes are directories with one of:

<work id="123"/>

or

<volume id="345" workid="123" num="1" />

or

<part id="678" volumeid="345" num="1" />

or

<chapter id="90a" partid="678" num="1" />

The type of entity contained in each leaf directory is not constrained; they occur more or less randomly. Although a manifest is provided with the complete path to each metadata file the type of the entity cannot be determined from the filename.

The problem is to reassemble the one publishers whacko format into something resembling what the others send, i.e. a properly ordered and complete xml document.

So the algorithm will be something like: find the first work, select all the volumes in the work, select all the chapters in the volume, write the completed work tree to the output xml file. It's all that searching for the pieces that I hope to make easier by first building a complete document. It will just be a flat container with all the metafiles from the leaf nodes in it, but at that point I can at least use linq and xpath to get at all the components I need in the right order.

imported_Dhaval00 · Mar 20, 2009

How much of a performance hit can you take, Mark? I mean, does this have to be super-scalable? How big do you think these documents are going to be?

Markbnj · Mar 20, 2009

Originally posted by: Dhaval00
How much of a performance hit can you take, Mark? I mean, does this have to be super-scalable? How big do you think these documents are going to be?

I don't think performance is going to be the big issue. Initially we will load all of the provider's inventory in a backfill operation, followed by regular updates as new content is added, old content is removed, or data associated with content changes. I don't expect more than a few thousand records per day from each provider.

imported_Dhaval00 · Mar 20, 2009

In that case, I think you'll be fine just using the basics - Xpath and LINQ. Initially, I thought performance was an issue for you. I don't know if you've observed this, but I have been amazed by how a DataSet outperforms XPath queries in .NET - so see if you can leverage that functionality. Also, LINQ seems faster than .NET 2.0-based Xpath APIs - I don't think I have tested this extensively, but I have done a few test runs on files that range from 5-10MB, and LINQ, in general, seems faster (this is milliseconds).

To over-engineer this: you could create kind of a staging table in SQL Server and read in the XML document upon arrival. The XML document must have a schema and possibly a PROPERTY and a VALUE index - then querying will be super-fast. Once you're done with the load, you can truncate the table every night. But as I said, this probably is an overkill - you'll have to maintain the index and the index size will probably be three times the data size. This is another strategy that I have used over time and have been fairly happy with it. I hate the index size-issue though, but it doesn't matter for me because I'll truncate the table from within my SSIS package at the end of the ETL.

In any case, I think your requirements sound straight-forward, so you should be OK to exploit the built-in data structures and types.

Markbnj · Mar 20, 2009

Interesting that almost all of these same ideas came up on our stand-up call this morning. My colleagues are a little nervous about the xpath approach. They've tried it in the past with very large XML files and had memory problems. But I think those were an order of magnitude larger. Best I can tell the ones I will get won't grow larger than about 25-30 megs.

The idea about round-tripping it to SQL Server came up as well. That may still be a valid approach. I believe I could use a single staging table to hold all three primary entity types. But I think we're going to try a straight up directory scan first.

Search

Opinions on a strategy for XML processing

Markbnj

Elite Member <br>Moderator Emeritus

imported_Dhaval00

Senior member

Markbnj

Elite Member <br>Moderator Emeritus

Ken g6

Programming Moderator, Elite Member

Markbnj

Elite Member <br>Moderator Emeritus

imported_Dhaval00

Senior member

Markbnj

Elite Member <br>Moderator Emeritus

imported_Dhaval00

Senior member

Markbnj

Elite Member <br>Moderator Emeritus

TRENDING THREADS