Algorithms for pairing

xtknight

Elite Member
Oct 15, 2004
12,974
0
71
I am working with different files full of data for 30 years (1 year per file). Each of these files contains a list of variables along with a corresponding ID for that variable. The problem is, some years have the same variable names but a different ID. I am getting past this by matching "by name".

Within each of these variables there are perhaps 10 or 100 different "values" that have the same problem (this is gov't data). They have a name or description that's the same for every year but contain a different ID. But now I can't simply match by name since I have to report a change in ID over time (e.g., to help a researcher). Therefore I end up with stuff like this:

For a value with the name "sleet and hail":
year::ID
1975::5
1976::5
1977::6
1978::5

Ideally, I would collate this data into a simplified version using year ranges, like so:
1975-1976::5
1977::6
1978::5

What is this called, and is there an algorithm made for doing this? I do have one that works for me, although it is in VB and pretty messy.
 

tfinch2

Lifer
Feb 3, 2004
22,114
1
0
Make an array of Queue objects, 1 Queue for each ID. Then when you pop them off after you have sorted all the data into it's Queue, you can have a condition for the ranges.
 

xtknight

Elite Member
Oct 15, 2004
12,974
0
71
Since nobody knew what this was called, I decided to create a working prototype in C (and verified that there were no memory leaks with valgrind). With VB I was using strings and separating my elements with colons within the string. With C I used a 3D array. (Ahh the beauty of arbitrary arrays :))

Does this look like the most optimized way to do this? If so, well enjoy the code if you ever have to do this.

tfinch2: I'm not sure what you mean by "Queue objects". Is this a C++ class or do you just mean create an array like I did in the code below? (Not sure how this could be done any quicker but I'd appreciate criticism. I'm not exactly the most experienced C programmer out there.)

I will also be doing a qsort to sort the years passed. This is another place where I run into a conundrum as I have to sort the years and then redo the whole array over again. Is there some class that lets me sort values but keep the pairs the same? (like sorting a table in Word or whatever) I haven't converted them into ranges yet, either.

Output so far:

./a.out 1977 3 1976 3 1978 6 1975 6
4 years detected with corresponding values.
correspondingYears[0][0]: 1977 (3)
correspondingYears[0][1]: 1976 (3)
correspondingYears[1][0]: 1978 (6)
correspondingYears[1][1]: 1975 (6)

I either want to go for speed or simplicity. If there's a simpler (human wise) way to do this, I'd like to know it. If there's a faster (CPU wise) way, I'd like to know it. Actually, I'd like to know both. I'm probably already gaining a lot of speed by using C or C++ over VB anyway. I plan to call a C or C++ library from VB. This is just a prototype/test EXE.