MySQL NOT IN() Range List failures

EricMartello

Senior member
Apr 17, 2003
910
0
0
Does anyone know of a way to exclude rows from a query if they are already in the table? I am using something like this:

SELECT rowID FROM table1 WHERE rowID NOT IN(SELECT row2ID FROM table2)

The amount of IDs returned in the subselect are in the millions, and as far as I can see using IN() or NOT IN() is definitely a recipe for failure.

Apparently, it has been around for a while:
http://bugs.mysql.com/bug.php?id=15872

The thing is, I need to exclude the rows before adding new ones do the DB so is there a way to do it in a single query? Alternately, is there a way to INSERT records while having insert ignore IDs that are already there WITHOUT throwing an error? I dunno...I gotta think of something and it needs to work quickly.

My second choice would be to remove the NOT IN subselect, and do the subselect first as a standalone query. At that point I would put the IDs returned into an array and just that array to exclude the new entries based on their IDs already being in the array. The drawback here is that I would need to loop through every row one-by-one to test if they are already in the array. Not an optimal solution...any ideas are welcome. :)
 

KLin

Lifer
Feb 29, 2000
30,213
562
126
Is the row2id a primary key or foreign key in table2?

EDIT: If the answer is yes, then my 2nd question would be is it a many to one relationship(there are multiple records with the same rowID in table2 related back to one rowID in table1)? If yes, then try a SELECT DISTINCT within the subquery.
 

EricMartello

Senior member
Apr 17, 2003
910
0
0
Originally posted by: Markbnj
According to the change log that big has been fixed. Which version of the server are you using?

It is fixed in 5.1, I am using 5.0 at this time and for the moment upgrading is not an option. Also funny that they didn't bother to fix it until 5.1 after knowing about it since 2006. hehe

Originally posted by: KLin
Is the row2id a primary key or foreign key in table2?

EDIT: If the answer is yes, then my 2nd question would be is it a many to one relationship(there are multiple records with the same rowID in table2 related back to one rowID in table1)? If yes, then try a SELECT DISTINCT within the subquery.

It is a foreign key in table 2.

TABLE1
tableID1 | ....

TABLE2
tableID2 | TABLE1.tableID1

 

KLin

Lifer
Feb 29, 2000
30,213
562
126
SELECT tableID1 FROM table1 WHERE tableID1 NOT IN(SELECT DISTINCT tableID1 FROM table2)

Try that?
 

EricMartello

Senior member
Apr 17, 2003
910
0
0
Nope, no dice. It still takes nearly 80 seconds for the query to complete as it is now. Select Distinct wouldn't help much in this situation because tableID1 consists largely of unique entries; it's not like the same ID is repeated thousands of times. Let me elaborate more on my query because I've been messing with it and I found that "NOT IN()" is not the bottleneck.

There are 3 tables being queried. T1, T2, T3 and this is how I have it:

SELECT id1, id2, id3, id4 FROM T1
WHERE id1 NOT IN (SELECT id1 FROM T2 WHERE id4 = '1')
AND id3 IN (SELECT id3 FROM T3 WHERE id4 = '1')
AND id4 = '0'

The above query takes about 65-80 seconds to execute on the primary table which contains 1.1 million rows. I was trying to use a LEFT JOIN to replace the second subselect but it won't work right. If I use LEFT JOIN it just pulls all the rows instead of limiting them as I am trying to do.

Here is what I get if I do an EXPLAIN on this:

id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY T1 ALL NULL NULL NULL NULL 1164502 Using where
3 DEPENDENT SUBQUERY T3 ALL NULL NULL NULL NULL 2 Using where
2 DEPENDENT SUBQUERY T2 index_subquery idx1,idx2 idx1 4 func 1 Using where

Notice that the T2 subselect is using the indexes I set up for it. T3 does not have indexes because there are so few rows in that table that it is not necessary. T1 is not using any indexes even though both columns in the WHERE clause are in fact indexed, so it is doing a full table scan of all 1.1 million rows. :/ I think MySQL needs to rename their query optimizer to something a little more accurate. hah
 

GeekDrew

Diamond Member
Jun 7, 2000
9,099
19
81
Originally posted by: EricMartello
I was trying to use a LEFT JOIN to replace the second subselect but it won't work right. If I use LEFT JOIN it just pulls all the rows instead of limiting them as I am trying to do.

LEFT JOIN y ON x.id = y.id
means that it should return ALL rows from x, and only those rows from y that match your condition.

You want an inner join in order to restrict the results similar to the way your IN(sub select) statement is working.

Try this:

SELECT t1.id1, t1.id2, t1.id3, t1.id4
FROM t1
INNER JOIN t3 ON ((t1.id3 = t3.id3) AND (t3.id4 = '1'))
WHERE t1.id1 NOT IN (SELECT t2.id1 FROM t2 WHERE t2.id4 = '1')
AND t1.id4 = '0'
 

EricMartello

Senior member
Apr 17, 2003
910
0
0
Yeah, you are right. The INNER JOIN does do the job but it is too slow - in fact, slower than doing two subselects by about 30%. It was one of the first things I tried. I don't understand why it slows down so much if I add in a second subselect. If I just do it with the "NOT IN" part the query is quick, albeit returning incorrect results. I think I'm going to have to go ghetto on this one and do it manually. It doesn't seem like there is a suitable query to get all the results without taking an absurdly long time. If only there was something like INSERT IF NOT EXIST id1 that would solve this problem nicely...
 

BoberFett

Lifer
Oct 9, 1999
37,562
9
81
I don't know MySQL or it's nuances, but it appears you're trying to find all values from Table1 which never appears in Table2. Have you tried a outer join rather than an inner?

select ID
from Table1 left join Table2 on Table1.ID = Table2.ID
where Table2.ID is null
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,610
4,530
75
Originally posted by: EricMartello
My second choice would be to remove the NOT IN subselect, and do the subselect first as a standalone query. At that point I would put the IDs returned into an array and just that array to exclude the new entries based on their IDs already being in the array. The drawback here is that I would need to loop through every row one-by-one to test if they are already in the array. Not an optimal solution...any ideas are welcome. :)
Definitely not an optimal solution. Do you have a higher-level language that you would be looping in? (E.g. Java, .NET, Perl, Python, etc.)

If so, you could create a hash/map (it's called different things in different languages) of the subquery IDs, and check if each main-query ID exists in it.
 

MrChad

Lifer
Aug 22, 2001
13,507
3
81
What about this?

SELECT rowID FROM table1 WHERE NOT EXISTS (SELECT 1 FROM table2 WHERE row2ID = table1.rowID)
 

KIAman

Diamond Member
Mar 7, 2001
3,342
23
81
If performance is a priority and storage is not a concern, you should de-normalize your tables.

Consider making a new redundant table T4 that combines your criteria from T2 and T3 distinctly. Create a trigger to update T4 whenever T2 and T3 is updated. Keep T4 indexed and maintained.

Then your select statement can be simplified.

Select ID1, ID2, ID3, ID4
From T1
Where ID1 in (Select ID1 from T4)
and ID4 = '0'
 

MrChad

Lifer
Aug 22, 2001
13,507
3
81
I'm not well-versed in MySQL specifically, but IN and NOT IN queries can almost always be easily replaced by either JOINs (such as BoberFett's solution) or EXISTS/NOT EXISTS for better performance. The only time I use IN / NOT IN is if I have a short, pre-defined list (e.g. NOT IN (1, 2, 3)), not when I have an inner query.
 

blahblah99

Platinum Member
Oct 10, 2000
2,689
0
0
Why can't you do something like this:

SELECT a.rowID FROM table1 AS a RIGHT JOIN table2 AS b ON a.rowID=b.rowID WHERE b.rowID IS NULL

That would select all rows in table 2 that exists in table 1 but not in 2.

replacing RIGHT JOIN with LEFT JOIN and WHERE b.rowID IS NULL with WHERE a.rowID IS NULL would select all rows in table 1 that is in table 2 but not in 1.