Looking for some feedback on an idea, or potential alternatives. I'm working on a system that will be acquiring records from a number of outside sources and indexing them into Elastic Search. The overall strategy is that we'll index all sources repeatedly. New stuff will appear in those sources, and sometimes old stuff will go away. The process acquiring the data will not be able to detect when something isn't there. It will just get what is there and send it down the pipe.
Our requirement is that when things disappear from the outside sources they also disappear from our search index as soon as possible. One way to do that might be to index into a new index and then switch indexes when the process is complete.
But we got to thinking about alternatives to that strategy, and came up with the idea of assigning a "generation number" to records each time we run the process. The generation number would start at 0 and increment. Our search queries would filter out records with a generation number < max(generation_number) - 1. We could either leave the old records in there or have another process reap them periodically (perhaps moving them to an archival index).
Any thoughts on the idea?
Our requirement is that when things disappear from the outside sources they also disappear from our search index as soon as possible. One way to do that might be to index into a new index and then switch indexes when the process is complete.
But we got to thinking about alternatives to that strategy, and came up with the idea of assigning a "generation number" to records each time we run the process. The generation number would start at 0 and increment. Our search queries would filter out records with a generation number < max(generation_number) - 1. We could either leave the old records in there or have another process reap them periodically (perhaps moving them to an archival index).
Any thoughts on the idea?