Do something similar to SQL 'Group By' in java?

wantedSpidy

Senior member
Nov 16, 2006
557
0
0
Hey guys,

I need to write a method in Java, that does something similar to SQL's group by.

Here's an example input file:
1, 2, 3, 4
1, 2, 5, 7
1, 2, 5, 8
1, 2, 5, 9
1, 2, 3, 4

And I want to group by column 3,
so my output file should be:
1, 2, 3, 4
1, 2, 3, 4
1, 2, 5, 7
1, 2, 5, 8
1, 2, 5, 9

Whats the best way to do this? Just sort based on values in that column? What if I want to group by more than 1 column?

Thanks for the help in advance!!
 

kamper

Diamond Member
Mar 18, 2003
5,513
0
0
How is that output in any way related to a 'group by' action? As far as I can tell, all you did was shuffle rows around.
 

Crusty

Lifer
Sep 30, 2001
12,684
2
81
If your input will always be the same structure, I would create a Class that mimics it. In this case, 4 integers and way to store the sort information; an integer that records the column to sort perhaps. In that class overload the compareTo(...) function. For each row in the input, create a new instance of your class and add them to a standard array. Then just use the built in java array sort function, something like java.util.Arrays.Sort(YourClass[] items), then just spit them back out to the output.
 

Crusty

Lifer
Sep 30, 2001
12,684
2
81
Originally posted by: kamper
How is that output in any way related to a 'group by' action? As far as I can tell, all you did was shuffle rows around.

It's not.... I just assumed he meant a sort instead.
 

wantedSpidy

Senior member
Nov 16, 2006
557
0
0
Actually I do mean group by, I will eventually perform some aggregate function on the groups -

For example, when I group by column 3 in the above example, lets say I calculate the average value in column 4 for each group.
Then I would get,

1,2,3,[4]
1,2,5,[8]
 

wantedSpidy

Senior member
Nov 16, 2006
557
0
0
how would you do it for more than 2 group by's?

For example I say group by column 2 and 3, and for those groups calculate the SUM of values in column 4.

Input:
1, 2, 3, 4
2, 2, 3, 5
3, 2, 4, 7
4, 3, 4, 8

Output:
4+5 = 9 [from (2,3)]
7 = 7 [from (2,4)]
8 = 8 [from (3,4)]
5
 

statik213

Golden Member
Oct 31, 2004
1,654
0
0
The easy way would be to sort the results. Lets assume that each row is an object FourTuple with (int) properties a,b,c and d.

Object[] data; // your array
Arrays.sort(data, new Compartor() {
public int compare(Object o1, Object o2){
FourTuple first = (FourTuple) o1;
FourTuple second = (FourTuple) o2;

if (first.a > second.a) {
return 1;
}
if (first.a < second.a) {
return -1;
}
// this mean the first column is tied in thee two rows
if (first.b > second.b) {
return 1;
}
if (first.b < second.b) {
return -1;
}
if (first.c > second.c) {
return 1;
}
if (first.c < second.c) {
return -1;
}
if (first.d > second.d) {
return 1;
}
if (first.d < second.d) {
return -1;
}
// at this point all a.b,c,d are equal... so return "equal"
return 0;
}
public boolean equals(Object that) {
return this.a == that.a && this.b == that.b && this.c == that.c && this.d == that.d;
}
});

Aggregating the results is much easier now since the list is sorted.
 

wantedSpidy

Senior member
Nov 16, 2006
557
0
0
Thanks, I used the same idea to write a generic group by for n number of values in each row.

I used break's in the inner loops to get the same effect as you got though those if else's above.

Thanks a million, really appreciate it.
 

kamper

Diamond Member
Mar 18, 2003
5,513
0
0
Originally posted by: wantedSpidy
Actually I do mean group by, I will eventually perform some aggregate function on the groups -

For example, when I group by column 3 in the above example, lets say I calculate the average value in column 4 for each group.
Then I would get,

1,2,3,[4]
1,2,5,[8]
You need to apply an aggregate to every column that you are not grouping by. Otherwise what would you have done if the first two columns hadn't all had identical values?

The easiest way I can see to do a group by on two columns is to treat the two columns together as a key so let's say you've got the following input:
1, 2, 3, 4
1, 2, 5, 7
1, 3, 5, 8
1, 3, 5, 10
1, 2, 5, 9
1, 2, 3, 4

and you want to group by columns 2 and 3 and display the avg of columns 1 and 4. Then you'd get:
1, 2, 3, (4+4)/2=4
1, 2, 5, (7+9)/2=8
1, 3, 5, (8+10)/2=9

If you wanted to first group by one column and then by a second column in the result, you'd have to apply new aggregates to all the other columns. You could think of it as a new query on the first result. At least that's what I think, I haven't though too in depth about it.

Coincidentally, I actually did implement a rough sql subset the other week at work for mining some log files which had tabular data in them. The interesting thing about group by is that, unlike other operators like 'where' which take a table and return a table, it takes a table and returns a dictionary of tables, keyed on the value of the column that you're grouping by. That's why you have to apply aggregates to every other column to get a single table back. That was a fun (and easier than expected) experience.
 

wantedSpidy

Senior member
Nov 16, 2006
557
0
0
"and you want to group by columns 2 and 3 and display the avg of columns 1 and 4. Then you'd get:
1, 2, 3, (4+4)/2=4
1, 2, 5, (7+9)/2=8
1, 3, 5, (8+10)/2=9 "
That is what I wanted to do. And was successful last night :D

"The easiest way I can see to do a group by on two columns is to treat the two columns together as a key"
yup, I ended up using the same idea.

Thanks
 

kamper

Diamond Member
Mar 18, 2003
5,513
0
0
Cool. Out of curiosity, how are you representing your tables in memory? I used a dictionary (Map) per row with column names as keys and cell values obviously as values and a table was just a list of such dictionaries. It was convenient to parse into that format but if I were to do it over I'd probably abstract the idea of a table and have a tuple (array) of column titles as well as a list of tuples, one for each row and rely on the order to get the correct values.
 

wantedSpidy

Senior member
Nov 16, 2006
557
0
0
kill me but I'm using ArrayLists as my temp solution.

I'll revamp the project during christmas break when I have more time on my hands :D
 

statik213

Golden Member
Oct 31, 2004
1,654
0
0
Originally posted by: wantedSpidy
Thanks, I used the same idea to write a generic group by for n number of values in each row.

I used break's in the inner loops to get the same effect as you got though those if else's above.

Thanks a million, really appreciate it.

np... btw how much data are you processing? If your operations become more complex, you might be better off using sqlite and creating in-memory databases. There's a JDBC wrapper for sqlite out there, i've never used it though.
 

wantedSpidy

Senior member
Nov 16, 2006
557
0
0
right now, I just have some static files with about maybe 1000 tuples.

Need to make it more dynamic, do some polling maybe?

I'll make another thread, to get all your suggestions :D