I Need Some MySQL Query Optimization

TridenT · Sep 19, 2014

Here are the three tables I'm dealing with:

The customers table holds customers. The checkins table holds checkins that customers perform. Once a customer is entered into the customer table, a customer can "checkin". (e.g. You checkin at a hotel or an event, etc.) Then, there is a record kept for each time a customer checks in that holds some information about their checkin. (And each checkin row references the customer id) You find out how many times the customer "visited" by tallying up the amount of checkins for that customer.

Currently, customers holds about 36,000 entries. Checkins only holds about 30. I plan to add about 500,000 entries to checkins soon enough. So, I imagine that will add another strain.

I started adding indexes like crazy when my performance took a hit. I just started adding shit like nobody's business. So, for customers there's a unique index on name&id, id&name&on, id&on, and an index on name. Some of those indexes make no sense. (Since id is a primary key. Just having an index on name and maybe name&on should be enough.) For checkins I've just had an index on customer_id and event_id. As well, there's primary unique indexes on id for both tables. (by default)

Here's a query that's the biggest issue. (Takes 20-150ms on my local machine (i7-4770 w/ 256gb Samsung EVO SSD) ) It's part of a searching for customers function.

Code:

SELECT COUNT(ch.on) AS visits, cu.id AS cid, cu.name AS name, cu.email AS email
    FROM checkins AS ch
    RIGHT OUTER JOIN customers AS cu ON ch.customer_id = cu.id
    WHERE cu.name LIKE '%$name%'
    AND cu.on = '1'
    GROUP BY cu.id
    ORDER BY visits DESC, name ASC
LIMIT <?php= function here, 12-30 is common. ?>

What I'm trying to get is the customers who contain the name searched for, the amount of visits they have, and are turned on in the database. (And yes, on is a terrible name choice for a column. I'll change it eventually.) But, I only want the results back by visits descending and then by names ascending. (And I'll limit the results by some integer)

This seems to be a difficult query because I have to count all the visits that people have. And, I have to use an outer join because you can be a customer in the database but have 0 checkins. You still want to see if the customer is in the database in your search results.

Any suggestions about how to optimize this query or maybe how to rewrite my code in such a way to optimize this?

I didn't want to spend any more time thinking about this without consulting the internet. I'm not a DB admin nor an expert with sql. It's been over a year since I've taken my intro to databases class and even though I found clever ways to do things there, I can't remember much of anything from it since I've rarely hit a performance wall. (I don't work with large data sets that require expensive operations often)

TridenT · Sep 19, 2014

This seems fundamentally impossible to get around.

I think of it this way:
If I do $name = "", I basically have to do a count operation for each customer. I don't know how I could get around that when I require the most frequently visited customers to be returned first.

uclabachelor · Sep 19, 2014

TridenT said:
This seems fundamentally impossible to get around.

It is impossible because you're trying to do everything in one query, including trying to find a "magical" index for it when in reality this problem needs to be broken up into two or three queries (or nested queries) with the proper indexes on each.

First of all, clear all of your indexes from both tables. Next, write a script to insert about 100k customers into the customers table and about 10k entries in the checkins to start, slowly ramping up to 1M rows and testing the query below along the way.

Inner query:

Code:

SELECT COUNT(*) AS visits, customer_id FROM checkins GROUP BY customer_id

.

Add an index on customer_id, id.

Main query:

Code:

SELECT customers.<fields>, checkins.visits FROM customers LEFT JOIN ( Inner query above) AS checkins ON checkins.customer_id=customers.id WHERE customers.name LIKE '%$name%' ORDER BY checkins.visits DESC, customers.name ASC

.

Add a full text index on customers.name
Add an index on customers.id, customers.email, and possibly customers.name

That should get you to a pretty optimal query.. there may be some additional optimization that can be performed with better indexes, but it's Friday and I'm done for the week.

And paypal me some lunch money when you see this.

Cerb · Sep 19, 2014

Code:

SELECT DISTINCT name, customers.id, COUNT(checkins.on) AS visits
    FROM customers
    LEFT JOIN checkins ON customers.id = checkins.customer_id
    WHERE name LIKE '%$name%'
    GROUP BY customers.id
    ORDER BY visits DESC, name ASC

I'm pretty sure that's what you're after. It may still take some time, given the use of count(), with InnoDB. Triggers or Flexviews would be the next step after that, to have materialized views, assuming the query isn't having index problems (like over name). Materialized views would remove the need to count() every time, instead only doing so when checkins gets an update. If the search is hardly ever done without a name pattern, it probably won't be an issue.

P.S. I slightly misunderstood what he was after, and this might not get the right results as written (I could change it a little so it would in Postgres, but I'm not sure about MySQL).

TridenT · Sep 20, 2014

hold on...

TridenT · Sep 20, 2014

Sorry for the long post.

uclabachelor said:
snip

Hmm, didn't think about left join. That changes things, obviously.

I don't think DISTINCT is in my interests. That might work for unique names, emails, and so forth. But, I could have multiple customers with the same name. Maybe even the same name and email combo (for whatever reason. Just safeguarding here...).

Anyway, I've been reiterating the functions today. I'm having some problems figuring out how to get the best of both worlds here. I ran about 700 (automated) tests on each of the three methods below. The test being the letters from a-z then all combinations of (a-z)^2 (676 tests + 26 for single chars).

Forgive the naming conventions but the idea is that Fastest was my attempt before posting here. FastestRedux is using your sql and reworking the method I was using to find out if customers were checked in already (Basically reduces it from a 170ms query to 10-15ms). FastestReduxCombo is using my old sql query but now with the new way to find out if customers were checked in already.

Here's the functions in testing form:
Faster

Code:

function searchCustomersFastest($args){
    $timingArray = array();
    
    $name = $args['name'];
    $limit = $args['limit'];
    $eventID = $args['eventID'];
    $numInSystemSQL =
            "SELECT COUNT(*) as count
            FROM customers
            WHERE customers.on = '1'
            AND name LIKE '%$name%'
            ";
    $start_time = microtime(TRUE);
    $numInSystemQuery = mysql_query($numInSystemSQL) or die (returnSQLError($numInSystemSQL));
    $numInSystemNumber = mysql_fetch_array($numInSystemQuery);
    $end_time = microtime(TRUE);
    $timingArray[0] = ($end_time - $start_time);
    $numInSystemNumber = $numInSystemNumber['count'];
    $highestVisitsAndLikeName =
    "SELECT COUNT(ch.on) AS visits, cu.id AS cid, cu.name AS name, cu.email AS email
    FROM checkins AS ch
    RIGHT OUTER JOIN customers AS cu ON ch.customer_id = cu.id
    WHERE cu.name LIKE '%$name%'
    AND cu.on = '1'
    GROUP BY cu.id
    ORDER BY visits DESC, name ASC, cu.id DESC
    " . ($numInSystemNumber > ($limit + 1) ? ("LIMIT " .  $limit) : "");
    $start_time = microtime(TRUE);
    $visitquery = mysql_query($highestVisitsAndLikeName) or die (returnSQLError(mysql_error()));
    $end_time = microtime(TRUE);
    $timingArray[1] = ($end_time - $start_time);
    $alreadycheckedinsql =
    "SELECT customers.name AS cname, checkins.customer_id AS cid, checkins.payment AS payment
    FROM checkins
    JOIN customers
    ON checkins.customer_id = customers.id
    WHERE event_id = '$eventID'
    AND checkins.on = '1'
    AND customers.on = '1'
    AND customer_id IN (
        SELECT cu.id AS cid
        FROM checkins AS ch
        RIGHT OUTER JOIN customers AS cu ON ch.customer_id = cu.id
        WHERE cu.name LIKE '%$name%'
        AND cu.on = '1'
        GROUP BY cu.id
        ORDER BY COUNT(ch.on) DESC, name ASC, cu.id DESC
        )";
    $start_time = microtime(TRUE);
    $alreadycheckedinquery = mysql_query($alreadycheckedinsql) or die (returnSQLError($alreadycheckedinsql));
    $end_time = microtime(TRUE);
    $timingArray[2] = ($end_time - $start_time);
    $alreadycheckedin = array();
    while($tmp = mysql_fetch_array($alreadycheckedinquery)){
        $alreadycheckedin[$tmp['cid']] = $tmp['payment'];
    }
    $keysAlreadyCheckedIn = array_keys($alreadycheckedin);
    $customerArray = array();
    $start_time = microtime(TRUE);
    while($visit = mysql_fetch_array($visitquery)){
        $st = microtime(TRUE);
        $cid = $visit['cid'];
        $name = $visit['name'];
        $visits = $visit['visits'];
        $isCheckedIn = in_array($cid, $keysAlreadyCheckedIn);
        $checkinID = getCheckinIDForCustomerAndEvent($cid, $eventID);
        $usedFreeEntrance = $checkinID ? hasUsedFreeEntrance($cid, $checkinID) : false;
        $numberOfFreeEntrances = getNumberOfFreeEntrances($cid);
        array_push($customerArray, 
            array(
            "cid" => $cid,
            "email" => $visit['email'],
            "payment" => ($isCheckedIn ? $alreadycheckedin[$cid] : ''),
            "name" => $name,
            "visits" => $visits,
            "isCheckedIn" => $isCheckedIn,
            "usedFreeEntrance" => $usedFreeEntrance,
            "numberOfFreeEntrances" => $numberOfFreeEntrances,
            )
        );
        $timingArray[4] = (microtime(TRUE) - $st);
    }
    $end_time = microtime(TRUE);
    $timingArray[3] = ($end_time - $start_time);
    $returnJSON = array();
    $returnJSON['customers'] = $customerArray;
    $returnJSON['timingArray'] = $timingArray;
    if($numInSystemNumber > $limit + 1){
        $returnJSON['numberOfExtra'] = $numInSystemNumber - $limit;
    } else {
        $returnJSON['numberOfExtra'] = 0;
    }
    return json_encode($returnJSON);
}

Here's FastestRedux:

Code:

function searchCustomersFastestRedux($args){
    $timingArray = array();
    
    $name = $args['name'];
    $limit = $args['limit'];
    $eventID = $args['eventID'];
    $numInSystemSQL =
            "SELECT COUNT(*) as count
            FROM customers
            WHERE customers.on = '1'
            AND name LIKE '%$name%'
            ";
    $start_time = microtime(TRUE);
    $numInSystemQuery = mysql_query($numInSystemSQL) or die (returnSQLError($numInSystemSQL));
    $numInSystemNumber = mysql_fetch_array($numInSystemQuery);
    $end_time = microtime(TRUE);
    $timingArray[0] = ($end_time - $start_time);
    $numInSystemNumber = $numInSystemNumber['count'];
    $highestVisitsAndLikeName =
    "SELECT customers.id as cid, customers.name as name, customers.email as email, IFNULL(checkins.visits, 0) as visits
    FROM customers 
    LEFT JOIN
        (SELECT COUNT(checkins.on) AS visits, customer_id 
        FROM checkins 
        GROUP BY customer_id)
    AS checkins 
    ON checkins.customer_id=customers.id 
    WHERE customers.name LIKE '%$name%'
    AND customers.on = '1'
    ORDER BY checkins.visits DESC, customers.name ASC, customers.id DESC
    " .($numInSystemNumber > ($limit + 1) ? ("LIMIT " .  $limit) : "");
    $start_time = microtime(TRUE);
    $visitquery = mysql_query($highestVisitsAndLikeName) or die (returnSQLError(mysql_error()));
    $end_time = microtime(TRUE);
    $timingArray[1] = ($end_time - $start_time);
    $customerArray = array();
    $start_time = microtime(TRUE);
    while($visit = mysql_fetch_array($visitquery)){
        $st = microtime(TRUE);
        $cid = $visit['cid'];
        $name = $visit['name'];
        $visits = $visit['visits'];
        $isCheckedIn = getCustomerCheckedInPayment($cid, $eventID);
        $checkinID = getCheckinIDForCustomerAndEvent($cid, $eventID);
        $usedFreeEntrance = $checkinID ? hasUsedFreeEntrance($cid, $checkinID) : false;
        $numberOfFreeEntrances = getNumberOfFreeEntrances($cid);
        array_push($customerArray, 
            array(
            "cid" => $cid,
            "email" => $visit['email'],
            "payment" => ($isCheckedIn == null ? "" : $isCheckedIn),
            "name" => $name,
            "visits" => $visits,
            "isCheckedIn" => ($isCheckedIn != null),
            "usedFreeEntrance" => $usedFreeEntrance,
            "numberOfFreeEntrances" => $numberOfFreeEntrances,
            )
        );
        $timingArray[3] = (microtime(TRUE) - $st);
    }
    $end_time = microtime(TRUE);
    $timingArray[2] = ($end_time - $start_time);
    $returnJSON = array();
    $returnJSON['customers'] = $customerArray;
    $returnJSON['timingArray'] = $timingArray;
    if($numInSystemNumber > $limit + 1){
        $returnJSON['numberOfExtra'] = $numInSystemNumber - $limit;
    } else {
        $returnJSON['numberOfExtra'] = 0;
    }
    return json_encode($returnJSON);
}

Here's fastestReduxCombo:

Code:

function searchCustomersFastestReduxCombo($args){
    $timingArray = array();
    
    $name = $args['name'];
    $limit = $args['limit'];
    $eventID = $args['eventID'];
    $numInSystemSQL =
            "SELECT COUNT(*) as count
            FROM customers
            WHERE customers.on = '1'
            AND name LIKE '%$name%'
            ";
    $start_time = microtime(TRUE);
    $numInSystemQuery = mysql_query($numInSystemSQL) or die (returnSQLError($numInSystemSQL));
    $numInSystemNumber = mysql_fetch_array($numInSystemQuery);
    $end_time = microtime(TRUE);
    $timingArray[0] = ($end_time - $start_time);
    $numInSystemNumber = $numInSystemNumber['count'];
    $highestVisitsAndLikeName =
    "SELECT COUNT(ch.on) AS visits, cu.id AS cid, cu.name AS name, cu.email AS email
    FROM checkins AS ch
    RIGHT OUTER JOIN customers AS cu ON ch.customer_id = cu.id
    WHERE cu.name LIKE '%$name%'
    AND cu.on = '1'
    GROUP BY cu.id
    ORDER BY visits DESC, name ASC, cu.id DESC
    " . ($numInSystemNumber > ($limit + 1) ? ("LIMIT " .  $limit) : "");
    $start_time = microtime(TRUE);
    $visitquery = mysql_query($highestVisitsAndLikeName) or die (returnSQLError(mysql_error()));
    $end_time = microtime(TRUE);
    $timingArray[1] = ($end_time - $start_time);
    $customerArray = array();
    $start_time = microtime(TRUE);
    while($visit = mysql_fetch_array($visitquery)){
        $st = microtime(TRUE);
        $cid = $visit['cid'];
        $name = $visit['name'];
        $visits = $visit['visits'];
        $isCheckedIn = getCustomerCheckedInPayment($cid, $eventID);
        $checkinID = getCheckinIDForCustomerAndEvent($cid, $eventID);
        $usedFreeEntrance = $checkinID ? hasUsedFreeEntrance($cid, $checkinID) : false;
        $numberOfFreeEntrances = getNumberOfFreeEntrances($cid);
        array_push($customerArray, 
            array(
            "cid" => $cid,
            "email" => $visit['email'],
            "payment" => ($isCheckedIn == null ? "" : $isCheckedIn),
            "name" => $name,
            "visits" => $visits,
            "isCheckedIn" => ($isCheckedIn != null),
            "usedFreeEntrance" => $usedFreeEntrance,
            "numberOfFreeEntrances" => $numberOfFreeEntrances,
            )
        );
        $timingArray[3] = (microtime(TRUE) - $st);
    }
    $end_time = microtime(TRUE);
    $timingArray[2] = ($end_time - $start_time);
    $returnJSON = array();
    $returnJSON['customers'] = $customerArray;
    $returnJSON['timingArray'] = $timingArray;
    if($numInSystemNumber > $limit + 1){
        $returnJSON['numberOfExtra'] = $numInSystemNumber - $limit;
    } else {
        $returnJSON['numberOfExtra'] = 0;
    }
    return json_encode($returnJSON);
}

Here's samples of run times:
For faster, ignore statements 3-4. Ignore 2-3 for fasterredux, ignore 2-3 for fasterreduxcombo. The statement number references timingArray[>>>>number<<<<] in the code above.

Code:

These next results have a lot of results... check the last one though.

NAME: a, LENGTH: 1

Begin result 0
Benchmarking searchCustomersFastest with param: a
Time to perform: 0.45402598381042
Time to perform Statements 0: 0.024001121520996 Statements 1: 0.246013879776 Statements 2: 0.17800998687744 Statements 3: 0.0039999485015869 Statements 4: 0
Benchmarking searchCustomersFastestRedux with param: a
Time to perform: 0.14100813865662
Time to perform Statements 0: 0.0099999904632568 Statements 1: 0.12700700759888 Statements 2: 0.0040011405944824 Statements 3: 0
Benchmarking searchCustomersFastestReduxCombo with param: a
Time to perform: 0.25001406669617
Time to perform Statements 0: 0.010999917984009 Statements 1: 0.23401403427124 Statements 2: 0.005000114440918 Statements 3: 0.0010001659393311
End result 0

NAME: b, LENGTH: 1

Begin result 1
Benchmarking searchCustomersFastest with param: b
Time to perform: 0.085004806518555
Time to perform Statements 0: 0.0099999904632568 Statements 1: 0.029001951217651 Statements 2: 0.043002843856812 Statements 3: 0.003000020980835 Statements 4: 0
Benchmarking searchCustomersFastestRedux with param: b
Time to perform: 0.096005201339722
Time to perform Statements 0: 0.011000156402588 Statements 1: 0.081004858016968 Statements 2: 0.004000186920166 Statements 3: 0
Benchmarking searchCustomersFastestReduxCombo with param: b
Time to perform: 0.043002843856812
Time to perform Statements 0: 0.010000944137573 Statements 1: 0.029001951217651 Statements 2: 0.0039999485015869 Statements 3: 0
End result 1

NAME: c, LENGTH: 1

Begin result 2
Benchmarking searchCustomersFastest with param: c
Time to perform: 0.099005222320557
Time to perform Statements 0: 0.011000156402588 Statements 1: 0.035001993179321 Statements 2: 0.050003051757812 Statements 3: 0.003000020980835 Statements 4: 0
Benchmarking searchCustomersFastestRedux with param: c
Time to perform: 0.098005056381226
Time to perform Statements 0: 0.0099999904632568 Statements 1: 0.084005117416382 Statements 2: 0.0039999485015869 Statements 3: 0
Benchmarking searchCustomersFastestReduxCombo with param: c
Time to perform: 0.051002979278564
Time to perform Statements 0: 0.011000871658325 Statements 1: 0.036002159118652 Statements 2: 0.0039999485015869 Statements 3: 0
End result 2

NAME: d, LENGTH: 1

Begin result 3
Benchmarking searchCustomersFastest with param: d
Time to perform: 0.12400698661804
Time to perform Statements 0: 0.011000871658325 Statements 1: 0.047002077102661 Statements 2: 0.062004089355469 Statements 3: 0.0039999485015869 Statements 4: 0
Benchmarking searchCustomersFastestRedux with param: d
Time to perform: 0.10000610351562
Time to perform Statements 0: 0.011000871658325 Statements 1: 0.085005044937134 Statements 2: 0.004000186920166 Statements 3: 0
Benchmarking searchCustomersFastestReduxCombo with param: d
Time to perform: 0.062002897262573
Time to perform Statements 0: 0.010999917984009 Statements 1: 0.046002864837646 Statements 2: 0.005000114440918 Statements 3: 0.00099992752075195
End result 3

NAME: e, LENGTH: 1

Begin result 4
Benchmarking searchCustomersFastest with param: e
Time to perform: 0.26201510429382
Time to perform Statements 0: 0.011000871658325 Statements 1: 0.11700701713562 Statements 2: 0.13000702857971 Statements 3: 0.004000186920166 Statements 4: 0
Benchmarking searchCustomersFastestRedux with param: e
Time to perform: 0.11600685119629
Time to perform Statements 0: 0.011000871658325 Statements 1: 0.10100603103638 Statements 2: 0.0039999485015869 Statements 3: 0.00099992752075195
Benchmarking searchCustomersFastestReduxCombo with param: e
Time to perform: 0.13400793075562
Time to perform Statements 0: 0.011001110076904 Statements 1: 0.11800599098206 Statements 2: 0.0050008296966553 Statements 3: 0.00099992752075195
End result 4

...
...
...
...
...
...

This has 0 results.

NAME: qq, LENGTH: 2

Begin result 16
Benchmarking searchCustomersFastest with param: qq
Time to perform: 0.039001941680908
Time to perform Statements 0: 0.0099999904632568 Statements 1: 0.010000944137573 Statements 2: 0.019001007080078 Statements 3: 0
Benchmarking searchCustomersFastestRedux with param: qq
Time to perform: 0.020001173019409
Time to perform Statements 0: 0.010001182556152 Statements 1: 0.0099999904632568 Statements 2: 0
Benchmarking searchCustomersFastestReduxCombo with param: qq
Time to perform: 0.021000862121582
Time to perform Statements 0: 0.011000871658325 Statements 1: 0.0099999904632568 Statements 2: 0
End result 16

Here's results for a full run:

Code:

TOTAL RUNS: 702

Total run time for benchmarkSearchCustomers (Fastest): 
37.668162345886, 
average run time: 0.053658350920066, 
median time: 0.049003124237061, 
longest time: 0.45402598381042, 
standard deviation: 0.025842059549636

Total run time for benchmarkSearchCustomers2 (FastestRedux): 
59.079370260239, 
average run time: 0.084158647094357, 
median time: 0.093004941940308, 
longest time: 0.14100813865662, 
standard deviation: 0.024883380426155

Total run time for benchmarkSearchCustomers3 (FastestReduxCombo): 
19.38410782814, 
average run time: 0.027612689213875, 
median time: 0.0260009765625, 
longest time: 0.25001406669617, 
standard deviation: 0.013567375694995

Clearly, FastestReduxCombo is the quickest. It has a lower average, lower total run time, and smallest standard deviation. Something I noticed while working on this though is that for "statements 1" on FastestRedux (your version of the sql), the statement has consistently high run times. Outside of returning nothing, it frequently takes 80ms+.

The only problem with FastestReduxCombo (my sql query) is that the longest time is nearly double what the quickest is. 250ms is a long time for a query to calculate. And it will happen a lot because this search is pinged every time someone types into a prompt. (It is a _.throttled prompt because of these problematic speed constraints)

Here are the current indexes that I added:

On checkins
*Primary - id
*customer_id, id - index on two columns: customer_id, id
*customer_id, event_id - index on two columns: customer_id, event_id

On customers
*Primary - id
*id, name - index on two columns: id, name
*name - FULLTEXT index on name (This didn't do anything. Cardinality - 1, I can run analyze/repair/check table and it'll populate to 34k or whatever but then it goes back to 1 a few seconds later when I run my tests. Maybe it's because I run RESET QUERY CACHE before all my functions start? (So each one has a fresh start...))

I removed the name index and got the same results.

Thoughts?

Cerb · Sep 20, 2014

TridenT said:
I can't use DISTINCT in my query. Names are not distinct. The only thing that can be distinct is id. (Weird possibility but someone could have the same name and email address. Or they have multiple customer accounts. Only user_id would be unique but it's not implemented yet and I don't think all customers will have a user_id. Maybe in the future though. Undecided.)

DISTINCT operates on the rows. It's 2nd nature to me, even where it shouldn't be necessary, at this point. That is, {name, id, count()} would need be duplicated in its entirety.

SQL is deeply flawed in that it is based on pointer/id-based records, and allows duplicates. It shouldn't create any with primary keys inner joined on primary [foreign] keys, or a left join where the left side has a primary key, but again, 2nd nature

.

Do check the time, and query explanation, of just the count() part, though:

Code:

SELECT COUNT(on) AS visits, customer_id FROM checkins GROUP BY customer_id

If that takes enough time by itself, it might be worth keeping an up to date list (materialized view), rather than querying for it from the base data every time. COUNT and subqueries can both absolutely KILL performance with InnoDB, without special indices or views (it's part of the trade-off for concurrent transaction safety, and is the same with MS SQL or Postgres, and likely any other MVCC ACID engines). Remember, your base tables are there to keep things correct and consistent. Additional tables, with derived data, are fine, so long as they are kept up to date, and there is a one-way dependency to the tables they are derived from (IE, they are caches of queries).

200-300ms just seems like quite a long time. It makes sense that the first one takes longest, because it can't filter as many rows with the first query. The 2nd is being slowed down due to count in the subquery. It could be rewritten a few ways, like with a union, but that could end up even slower. I'd personally defer to materialized views first, since they will remain predictable as the data grows, while a given query optimization may not.

P.S. Also consider putting a single column index on checkins.customer_id, so that it has an index that's usable generically. Multicolumn indices are really only good for when they will be used in the specific order they are made in. Without composite keys, when that is will not be as often as you'd think.

P.P.S. Another:

Code:

SELECT cid, name, email, IFNULL(visits,0) AS visits
FROM (
    SELECT id as cid, name, email
    FROM customers
    WHERE on = 1
    AND name LIKE '%$name%'
) AS tcu
LEFT JOIN (
    SELECT customer_id as cid, COUNT(on) AS visits
    GROUP BY customer_id
) AS tch ON tcu.cid = tch.cid
ORDER BY visits DESC, cid ASC

The above is likely to be really fast, or quite slow, depending on how well the optimizer can flatten the subqueries (ideal: match name, then verify on=1, then get counts, with only 1 temp table).

P.P.P.S. Found it (looked earlier, and failed): http://makandracards.com/makandra/1...indexes-and-like-queries-for-full-text-search
Note that, however, %text% pattern matching can't use those indices.

TridenT · Sep 20, 2014

Been feeling anal about this whole scenario. Figured I could switch between the two functions when the COUNT(*) on the first LIKE query returns a high enough number to switch over. In this case, 13,000 is about the time to swap SQL queries. (Very easily done, too.)

Code:

if($numInSystemNumber > 13000){
        $highestVisitsAndLikeName =
        "SELECT customers.id as cid, customers.name as name, customers.email as email, IFNULL(checkins.visits, 0) as visits
        FROM customers 
        LEFT JOIN
            (SELECT COUNT(checkins.on) AS visits, customer_id 
            FROM checkins 
            GROUP BY customer_id)
        AS checkins 
        ON checkins.customer_id=customers.id 
        WHERE customers.name LIKE '%$name%'
        AND customers.on = '1'
        ORDER BY checkins.visits DESC, customers.name ASC, customers.id DESC
        " .($numInSystemNumber > ($limit + 1) ? ("LIMIT " .  $limit) : "");
    }
    else{
        $highestVisitsAndLikeName =
        "SELECT COUNT(ch.on) AS visits, cu.id AS cid, cu.name AS name, cu.email AS email
        FROM checkins AS ch
        RIGHT OUTER JOIN customers AS cu ON ch.customer_id = cu.id
        WHERE cu.name LIKE '%$name%'
        AND cu.on = '1'
        GROUP BY cu.id
        ORDER BY visits DESC, name ASC, cu.id DESC
        " . ($numInSystemNumber > ($limit + 1) ? ("LIMIT " .  $limit) : "");
    }

And when you look at the results (FinalFrontier is the new method):

Code:

Total run time for benchmarkSearchCustomers (Fastest): 37.998162508011,  average run time: 0.054051440267441, median time: 0.049002885818481,  longest time: 0.61003518104553, standard deviation: 0.03249400314677

Total  run time for benchmarkSearchCustomers2 (FastestRedux): 58.918383598328,  average run time: 0.083809933994776, median time: 0.093004941940308,  longest time: 0.16200995445251, standard deviation: 0.024842227830373

Total  run time for benchmarkSearchCustomers3 (FastestReduxCombo):  19.548100709915, average run time: 0.027806686642838, median time:  0.025002002716064, longest time: 0.37402081489563, standard deviation:  0.018726227124082

Total run time for benchmarkSearchCustomers4  (FinalFrontier): 19.057107686996, average run time: 0.02710826129018,  median time: 0.025002002716064, longest time: 0.16100907325745, standard  deviation: 0.01197190886371

It doesn't seem like much but limiting that longest time to something that's half the time is huge. Suddenly, these two queries take half the time (significant for how slow these are):
For reference, result -1 is $name = ""; (empty, the most often searched for thing)

Code:

Begin result -1
Benchmarking searchCustomersFastest with param:
Time to perform: 0.60603499412537 ... Number of Results: 12
Time to perform Statements 0: 0.015001058578491 Statements 1: 0.36202001571655 Statements 2: 0.22401285171509 Statements 3: 0.0040009021759033 Statements 4: 0
Counts : Count 0: 34568
Benchmarking searchCustomersFastestRedux with param:
Time to perform: 0.16100907325745 ... Number of Results: 12
Time to perform Statements 0: 0.0090000629425049 Statements 1: 0.14700794219971 Statements 2: 0.0050010681152344 Statements 3: 0
Counts : Count 0: 34568
Benchmarking searchCustomersFastestReduxCombo with param:
Time to perform: 0.37302088737488 ... Number of Results: 12
Time to perform Statements 0: 0.0090000629425049 Statements 1: 0.35902094841003 Statements 2: 0.0049998760223389 Statements 3: 0.00099992752075195
Counts : Count 0: 34568
Benchmarking searchCustomersFinalFrontier with param:
Time to perform: 0.16300916671753 ... Number of Results: 12
Time to perform Statements 0: 0.0080010890960693 Statements 1: 0.15000796318054 Statements 2: 0.0039999485015869 Statements 3: 0
Counts : Count 0: 34568
End result -1

NAME: a, LENGTH: 1

Begin result 0
Benchmarking searchCustomersFastest with param: a
Time to perform: 0.42202496528625 ... Number of Results: 12
Time to perform Statements 0: 0.010000944137573 Statements 1: 0.23401284217834 Statements 2: 0.17401003837585 Statements 3: 0.0030009746551514 Statements 4: 0
Counts : Count 0: 25654
Benchmarking searchCustomersFastestRedux with param: a
Time to perform: 0.14000797271729 ... Number of Results: 12
Time to perform Statements 0: 0.0099999904632568 Statements 1: 0.12500691413879 Statements 2: 0.0050010681152344 Statements 3: 0.00099992752075195
Counts : Count 0: 25654
Benchmarking searchCustomersFastestReduxCombo with param: a
Time to perform: 0.24801397323608 ... Number of Results: 12
Time to perform Statements 0: 0.0099999904632568 Statements 1: 0.23401403427124 Statements 2: 0.0039999485015869 Statements 3: 0
Counts : Count 0: 25654
Benchmarking searchCustomersFinalFrontier with param: a
Time to perform: 0.14000797271729 ... Number of Results: 12
Time to perform Statements 0: 0.011000156402588 Statements 1: 0.12500786781311 Statements 2: 0.0039999485015869 Statements 3: 0
Counts : Count 0: 25654
End result 0

TridenT · Sep 20, 2014

Cerb said:
P.P.S. Another:

Code:

SELECT cid, name, email, IFNULL(visits,0) AS visits FROM ( SELECT id as cid, name, email FROM customers WHERE on = 1 AND name LIKE '%$name%' ) AS tcu LEFT JOIN ( SELECT customer_id as cid, COUNT(on) AS visits GROUP BY customer_id ) AS tch ON tcu.cid = tch.cid ORDER BY visits DESC, cid ASC

The above is likely to be really fast, or quite slow, depending on how well the optimizer can flatten the subqueries (ideal: match name, then verify on=1, then get counts, with only 1 temp table).

I had to modify yours because it wasn't fully correct. You can see the semantic corrections. (for instance, you forgot a FROM table)

Code:

$highestVisitsAndLikeName =
    "SELECT tcu.cid as cid, tcu.name as name, tcu.email as email, IFNULL(visits,0) AS visits
    FROM (
        SELECT id as cid, name, email
        FROM customers
        WHERE customers.on = '1'
        AND name LIKE '%$name%'
    ) AS tcu
    LEFT JOIN (
        SELECT customer_id as cid, COUNT(checkins.on) AS visits
        FROM checkins
        GROUP BY customer_id
    ) AS tch ON tcu.cid = tch.cid
    ORDER BY visits DESC, tcu.name ASC, tcu.cid DESC
    " . ($numInSystemNumber > ($limit + 1) ? ("LIMIT " .  $limit) : "");

It looks like you basically run the same speed as uclabachelor on my meager test. A lot of the stuff is within margin of error, apparently. (And flukey) It wasn't returning all things perfectly (sorting issues because of the order by statements getting confused about how to order because it was so vague) but it might be resolved now...

Code:

TOTAL RUNS: [B]703[/B]

Total run time for  benchmarkSearchCustomers5 (Cerb): 59.751404762268, average run time:  0.084994885863824, median time: 0.093004941940308, longest time:  0.218013048172, standard deviation: 0.025579658950294

Total run  time for benchmarkSearchCustomers2 (FastestRedux): 59.183421373367,  average run time: 0.084186943632101, median time: 0.093004941940308,  longest time: 0.16300892829895, standard deviation: 0.025073534892265

Total  run time for benchmarkSearchCustomers3 (FastestReduxCombo):  19.576103448868, average run time: 0.027846519841917, median time:  0.025002002716064, longest time: 0.37902212142944, standard deviation:  0.018983902235544

Total run time for benchmarkSearchCustomers4  (FinalFrontier): 19.087087869644, average run time: 0.02715090735369,  median time: 0.025002002716064, longest time: 0.16600894927979, standard  deviation: 0.012070641386925

Cerb · Sep 20, 2014

TridenT said:
I had to modify yours because it wasn't fully correct. You can see the semantic corrections. (for instance, you forgot a FROM table)

Note the editing reason

. I was basically on AT this morning because I couldn't sleep. I see it plain a day, now.

TBH, I'm actually kind of surprised at how well it does. My gut feeling was that it would be as slow as or slower than the left join on a subquery, but with a closer spread between min and max, just due to UNIONs generally being inefficient.

cabri · Sep 22, 2014

In the OP you mentioned that you had to count all the visits up

This seems to be a difficult query because I have to count all the visits that people have.

could you have an additional field in the customer side that is a total count of all visits.
doing so, will reduce that follow on query load considerably.

If no visits, no need to look for visitation records. Path 1
If you have visits, the proceed processing the visitation records. Path 2

uclabachelor · Sep 22, 2014

TridenT said:
Sorry for the long post.
I removed the name index and got the same results.

Thoughts?

If you don't mind, find a way to get me a sample set of the database structure and data you're using to perform these queries....

The query I posted above should perform really fast for large data sets unless 1) there is an index issue, or 2) there is a mysql memory setting issue.

The final query time of 25ms seems like it could be improved for the amount of data you have currently.

Try adding an index on customer_id, name, and email for the customers table... that should speed up the lookup on there as the select and where clauses contain those fields.

TridenT · Sep 22, 2014

uclabachelor said:
If you don't mind, find a way to get me a sample set of the database structure and data you're using to perform these queries....

The query I posted above should perform really fast for large data sets unless 1) there is an index issue, or 2) there is a mysql memory setting issue.

The final query time of 25ms seems like it could be improved for the amount of data you have currently.

Try adding an index on customer_id, name, and email for the customers table... that should speed up the lookup on there as the select and where clauses contain those fields.

Okay.

First table (left) is checkins table. Second table (right) is customers. Some results out the very many. Also, adding an index on varchar columns won't speed up SELECT statements when LIKE '%thing%' is used. That kind of LIKE doesn't use an index.

beginner99 · Sep 23, 2014

TridenT said:
Also, adding an index on varchar columns won't speed up SELECT statements when LIKE '%thing%' is used. That kind of LIKE doesn't use an index.

That's just what I wanted to post now.

WHERE cu.name LIKE '%$name%'

will never use an index and hence will always be slow. Remove the leading wild-card and create an appropriate index. IMHO this will also make output more intuitive for users.

also look at the query plan and use it to optimize the query:

https://dev.mysql.com/doc/refman/5.5/en/execution-plan-information.html

or use a full text index on name:

http://devzone.zend.com/26/using-mysql-full-text-searching/

TridenT · Sep 23, 2014

beginner99 said:
That's just what I wanted to post now.

will never use an index and hence will always be slow. Remove the leading wild-card and create an appropriate index. IMHO this will also make output more intuitive for users.

also look at the query plan and use it to optimize the query:

https://dev.mysql.com/doc/refman/5.5/en/execution-plan-information.html

or use a full text index on name:

http://devzone.zend.com/26/using-mysql-full-text-searching/

Maybe with full-text it would be possible to allow for LIKE '$name%' to be fast (or some kind of matching thing) but I like having LIKE '%$name%' as it really makes things easy. I search for "tma" and I'll still get Batman.

I don't think LIKE '$name%' will work when searching for someone's last name and then getting the full name from it. (Name stores their full name, I just frequently list first names only)

I'm thinking of the cases where people forget a space in firstname lastname (thus, firstnamelastname) and just making it as sure as possible that I get as many matches as I can.

beginner99 · Sep 23, 2014

TridenT said:
Maybe with full-text it would be possible to allow for LIKE '$name%' to be fast (or some kind of matching thing) but I like having LIKE '%$name%' as it really makes things easy. I search for "tma" and I'll still get Batman.

That's what I meant about it being bad for users. it's counter-intuitive and confusing.

TridenT said:
I don't think LIKE '$name%' will work when searching for someone's last name and then getting the full name from it. (Name stores their full name, I just frequently list first names only)

I'm thinking of the cases where people forget a space in firstname lastname (thus, firstnamelastname) and just making it as sure as possible that I get as many matches as I can.

Make separate fields for first and last name. Then the space thing is a non-issue.

The search term can then be applied on both fields. eg.

Code:

WHERE first_name LIKE 'name%' OR last_name LIKE 'name%'

So this is the typical case of when your query gets complex and slow you should question your data model and storing first- and last name in same field is exactly a data model issue.

Why would you ever search for 'ark' when you are looking for 'Mark'? or 'tma' when searching for 'batman'? it doesn't make much sense to me.

TridenT · Sep 23, 2014

beginner99 said:
That's what I meant about it being bad for users. it's counter-intuitive and confusing.

Make separate fields for first and last name. Then the space thing is a non-issue.

The search term can then be applied on both fields. eg.

Code:

WHERE first_name LIKE 'name%' OR last_name LIKE 'name%'

So this is the typical case of when your query gets complex and slow you should question your data model and storing first- and last name in same field is exactly a data model issue.

Why would you ever search for 'ark' when you are looking for 'Mark'? or 'tma' when searching for 'batman'? it doesn't make much sense to me.

Some people have spaces in their names or hyphens or whatever. Really confusing names where maybe they have two names in their first and two names in their last. (I know people like this...) It makes it difficult to search for them when they try to have you search for the easiest part of their name. (Which happens to be the last name of their last name or the last name of their first, etc.) Sometimes they have 4 or 5 separate words for their name and only one of them is easy to find when they speak it. For instance, it's easy to find Mark but if they tell you their name is Xyng Iepx Mark Zyukturith then you're gonna have a bad day doing your method. Especially if the pronunciation isn't typical English and the spelling isn't English (But uses English characters anyway). Silent letters, ones that don't actually make that sound but people spell it with them, and all that shit. You might think that's really uncommon but it only takes once for it to be a time sink and make the interaction terrible. It left a bad taste in my mouth when I had to deal with LIKE '$name%' before.

A version of contains($needle, $haystack) is really the only bullet proof way to go on this. And LIKE '%$name%' is that.

Only other option (still a sacrifice though) would be something like full-text, if full-text can easily do LIKE '$name%' on separate words but all contained in one column value. But, hyphens in names will probably cause issues if someone searches for the second part of the hyphened name.

Cerb · Sep 23, 2014

And again, test the queries and their parts.

See if count() is taking a ton of time, or pattern matching, or joining.

Then, do something.

If pattern matching takes forever, FI, then, whether you split up the names or not (if you get them as first and last separately, definitely don't combine them!), you may want a view of names only, to more quickly search over (or even multiple views, like matches for substrings).

uclabachelor · Sep 23, 2014

TridenT said:
Okay.

First table (left) is checkins table. Second table (right) is customers. Some results out the very many. Also, adding an index on varchar columns won't speed up SELECT statements when LIKE '%thing%' is used. That kind of LIKE doesn't use an index.

You're right... mysql won't use the index on a substring LIKE search but the results do have the name field in the SELECT AND ORDER BY name clause, both of which will use the index.

I Need Some MySQL Query Optimization

Lifer

Lifer

Senior member

Elite Member

Lifer

Lifer

Elite Member

Lifer

Lifer

Elite Member

Diamond Member

Senior member

Lifer

Diamond Member

Lifer

Diamond Member

Lifer

Elite Member

Senior member