Ideas for aging data out of Elastic Search queries

Markbnj · Feb 26, 2014

Looking for some feedback on an idea, or potential alternatives. I'm working on a system that will be acquiring records from a number of outside sources and indexing them into Elastic Search. The overall strategy is that we'll index all sources repeatedly. New stuff will appear in those sources, and sometimes old stuff will go away. The process acquiring the data will not be able to detect when something isn't there. It will just get what is there and send it down the pipe.

Our requirement is that when things disappear from the outside sources they also disappear from our search index as soon as possible. One way to do that might be to index into a new index and then switch indexes when the process is complete.

But we got to thinking about alternatives to that strategy, and came up with the idea of assigning a "generation number" to records each time we run the process. The generation number would start at 0 and increment. Our search queries would filter out records with a generation number < max(generation_number) - 1. We could either leave the old records in there or have another process reap them periodically (perhaps moving them to an archival index).

Any thoughts on the idea?

KLin · Feb 26, 2014

Were you planning on assigning a generation_number for new records from each source? Otherwise you could end up filtering out records for other sources that still need to show data even though other sources for the same generation_number have deprecated records.

Just a though.

Markbnj · Feb 26, 2014

KLin said:
Were you planning on assigning a generation_number for new records from each source? Otherwise you could end up filtering out records for other sources that still need to show data even though other sources for the same generation_number have deprecated records.

Just a though.

Actually we did discuss that. Haven't made up our minds yet. It complicates the filtering on the search side, and I haven't convinced myself it's necessary yet. We'll visit all of the sources in our set on each "cycle" of the acquisition process, and gather everything that is there. Everything gathered on that cycle would get the same generation number, so theoretically every record that remains present in the source data should get its generation incremented on every pass. Records that are removed from the source data would sink to the bottom of the results (when ordered by generation number desc).

Cerb · Feb 26, 2014

What about separating the generations into their own working sets, indicies included, and then merging results? Is that feasible? IE, each batch you would apply a generation to would be its own unique set of data, with its own unique indices, so when you threw it away, everything would go away in one step. The counter to that being that you would need an added merge stage in each set of queries on the data; and possibly added copying, if data would change generations based on relevance over time. Duck soup with an SQL DBMS, but I get why that would be an equally arduous can of worms for something like this.

Likewise, if something like that would work, you could possibly have the data prioritized with such groups, so they would gradually go into either deprecation generation buckets, or re-use buckets.

beginner99 · Feb 27, 2014

Why not just use a flag for active / deleted? if required maybe also add a date column when a record was inactivated. Could also become handy down the line sometime when doing analytics.

Or does the generation number actually have a further meaning?

Markbnj · Feb 27, 2014

beginner99 said:
Why not just use a flag for active / deleted? if required maybe also add a date column when a record was inactivated. Could also become handy down the line sometime when doing analytics.

Or does the generation number actually have a further meaning?

How and when do we set the deleted flag? We don't know that something is deleted unless we compare the state we got on the last cycle with the state from previous cycles.

Right now I am looking into whether there is a way to build the newly acquired data into a new index, and then switch indexes at the end of the process.

Markbnj · Feb 27, 2014

Cerb said:
What about separating the generations into their own working sets, indicies included, and then merging results? Is that feasible? IE, each batch you would apply a generation to would be its own unique set of data, with its own unique indices, so when you threw it away, everything would go away in one step. The counter to that being that you would need an added merge stage in each set of queries on the data; and possibly added copying, if data would change generations based on relevance over time. Duck soup with an SQL DBMS, but I get why that would be an equally arduous can of worms for something like this.

Likewise, if something like that would work, you could possibly have the data prioritized with such groups, so they would gradually go into either deprecation generation buckets, or re-use buckets.

You got me thinking here. I don't need to merge results, since each generation wholly replaces the previous one. I just needed some way to filter out the old generations. The problem with our original idea, aside from it being fairly complex on its face, is that there seems to be no efficient way to get the min and max values of a field. You have to actually open an iterator on a sorted query result, grab the first value, then iterate to the end and grab the last. We can't do that for every use of the filter.

So I went back spelunking in the ES/Lucene docs and on Stackoverflow, and I think we've come up with a better way, and one that is similar to what you suggested.

ES supports index aliases, so that you can assign an alias name to an index and then use the alias in all search queries. You could, for example, search against "index" but have that be an alias to "index_20140301" or whatever. You can also execute alias changes atomically, so you could build "index_20140302" and then when you're done remove the alias "index" from the earlier index and add it to the new one in an atomic operation. You can then drop the older index.

So this seems like a pretty good solution for what we want to do.

Search

Ideas for aging data out of Elastic Search queries

Markbnj

Elite Member <br>Moderator Emeritus

KLin

Lifer

Markbnj

Elite Member <br>Moderator Emeritus

Cerb

Elite Member

beginner99

Diamond Member

Markbnj

Elite Member <br>Moderator Emeritus

Markbnj

Elite Member <br>Moderator Emeritus

TRENDING THREADS