MySQL query optimization

_kabal_

Executive Member
Joined
Oct 24, 2005
Messages
5,922
I have a MySQL query that I am trying to optimize. thought it would be a good time to at least learn some sort of decent query optimization.....

PHP:
cre8 tbl foo_event
(
    event_name VARCAR(20) NOT NULL,
    event_date DATETIME NOT NULL,
    app_id VARCAR(100),
    user_id BIGINT(20),
    username VARCAR(100),
    first_name VARCAR(100),
    last_name VARCAR(100),
    email VARCAR(100),
    asset_identifier VARCAR(100),
    asset_title VARCAR(255)
); TYPE INNODB


This table currently has 2 million rows.
We will most likely archive data older than a month, but monthly data would still be +4-5 million rows

"event_name" has 5 distinct values ("open", "close", "download", etc), and shouldnt grow much more.
"asset_identifier" could have 10s of thousands of distinct values. 99% of these are product EANs, eg "9780636188617" (not guaranteed to be an number).

This is a typical query, Top 10 Assets by Event in the last 7 days
PHP:
SEL
          event_name,
          asset_identifier,
          asset_title,
          count(*) AS `eventCount`
        FROM `foo_event`
        WHERE (event_name = 'close' AND
               event_date BETWEEN '2017-03-07 00:00:00.0' AND '2017-03-13 23:59:59.999')
        GROUP BY asset_identifier
        ORDER BY eventCount DESC
        LIMIT 10;

I have created an index as follows:

PHP:
CREATE INDEX foo_event_asset_by_event_date_range_index ON foo_event (event_name, event_date, asset_identifier);

The `EXPLAIN` on the above outputs the following:

PHP:
id, select_type, table, type, possible_keys, key, key_len, ref, rows, Extra
1, SIMPLE, foo_event, range, foo_event_name_index,foo_event_event_date_index,foo_event_asset_by_event_date_range_index,foo_event_active_users_date_range_index, foo_event_asset_by_event_date_range_index, 27, NULL, 102228, Using index condition; Using temporary; Using filesort




.

Read performance is all I care about. I could change the DDL of the table, but I dont really want to unless it will improve performance in a significant way.
 
Last edited:

Hamster

Resident Rodent
Joined
Aug 22, 2006
Messages
42,920
I'd drop that composite index and create an index on asset_identifier alone if 99% of the time it'll be unique.

EDIT: Reading that again, maybe an index on the date column. IMO there's no point in making the name part of an index but I might be completely wrong. Would like to see somebody else's opinion.
 
Last edited:

semaphore

Honorary Master
Joined
Nov 13, 2007
Messages
15,194
I'd drop that composite index and create an index on asset_identifier alone if 99% of the time it'll be unique.

EDIT: Reading that again, maybe an index on the date column. IMO there's no point in making the name part of an index but I might be completely wrong. Would like to see somebody else's opinion.

Indexing variable length strings is never a good idea, but I guess it also depends on how big this table may get.
 

_kabal_

Executive Member
Joined
Oct 24, 2005
Messages
5,922
I'd drop that composite index and create an index on asset_identifier alone if 99% of the time it'll be unique.

EDIT: Reading that again, maybe an index on the date column. IMO there's no point in making the name part of an index but I might be completely wrong. Would like to see somebody else's opinion.

the "asset_identifier" is not that unique

there will most likely be about 10000 "active" unique asset_identifiers raising 2 million events a month. Ff those 10000 asset_identifiers , around 1000 are responsible for 1 million of those events
 
Last edited:

_kabal_

Executive Member
Joined
Oct 24, 2005
Messages
5,922
the table DDL I posted above omits some fields for brevity, but it actually has an "asset_id BIGINT(20)" field, which is the pkey from the source data.

interestingly, if I change my query to

PHP:
SEL
          event_name,
          asset_identifier,
          asset_title,
          count(*) AS `eventCount`
        FROM `foo_event`
        WHERE (event_name = 'close' AND
               event_date BETWEEN '2017-03-07 00:00:00.0' AND '2017-03-13 23:59:59.999')
        GROUP BY asset_id   <---------------- changed from asset_identifer
        ORDER BY eventCount DESC
        LIMIT 10;

the EXPLAIN result is identical, while the actual query time is basically identical too.

I then changed the index to

PHP:
CREATE INDEX foo_event_asset_by_event_date_range_index ON foo_event (event_name, event_date, asset_id);

and again the EXPLAIN and actual query time for both query variants was identical.


so this seems to me that even though it says it is using the index, it is not really using the right most part of it
 

GoB

Expert Member
Joined
Jan 7, 2008
Messages
1,578
I was going to say that it you won't need asset id's in the index, but then realized that should help in the grouping.
Unless the grouping is done otherwise like a hash table.
Maybe try something like the USE INDEX FOR GROUP BY hint and see what it does? (I'm not really familiar with MySql)
 

IndigoIdentity

Expert Member
Joined
May 10, 2010
Messages
1,964
Just wondering but does asset_identifier even belong in the index? The index will actually be grouped by that?
 

IndigoIdentity

Expert Member
Joined
May 10, 2010
Messages
1,964
Also think that indexing the date column is fine unless you're doing "LIKE %restofdate" in which case it is useless as MySQL will compute this every time.
 

IndigoIdentity

Expert Member
Joined
May 10, 2010
Messages
1,964
Could say too, if you want to optimise for reads then ensure that your data types are at their most efficient? Seems like alot of long VARCHAR in there.

You should check out Aurora in AWS too, that would speed the reads up ALOT :)
 

Hamster

Resident Rodent
Joined
Aug 22, 2006
Messages
42,920
the "asset_identifier" is not that unique

there will most likely be about 10000 "active" unique asset_identifiers raising 2 million events a month. Ff those 10000 asset_identifiers , around 1000 are responsible for 1 million of those events

Yeah, I misread the 99% part :whistling: ..hence the edit :D
 

Hamster

Resident Rodent
Joined
Aug 22, 2006
Messages
42,920
PHP:
SEL
          event_name,
          asset_identifier,
          asset_title,
          count(*) AS `eventCount`
        FROM `foo_event`
        WHERE (event_name = 'close' AND
               event_date BETWEEN '2017-03-07 00:00:00.0' AND '2017-03-13 23:59:59.999')
        GROUP BY asset_id   <---------------- changed from asset_identifer
        ORDER BY eventCount DESC
        LIMIT 10;

Sidenote: does this query actually run? I haven't really worked with MySQL that much before so maybe it does things differently but when you do a select of fields and a function all the other fields should generally be in the group by, no? So shouldn't that group by be
Code:
group by event_name, asset_identifier, asset_title
 

prod

Executive Member
Joined
Nov 11, 2009
Messages
6,132
Sidenote: does this query actually run? I haven't really worked with MySQL that much before so maybe it does things differently but when you do a select of fields and a function all the other fields should generally be in the group by, no? So shouldn't that group by be
Code:
group by event_name, asset_identifier, asset_title

ghetto SQL dun need no rulz bra

updated with ghetto SQL

By the way, does MySQL not have filtered indexes like MSSQL? Seeing as you might archive old data and mostly use current?
 

Necropolis

Executive Member
Joined
Feb 26, 2007
Messages
8,401
Sidenote: does this query actually run? I haven't really worked with MySQL that much before so maybe it does things differently but when you do a select of fields and a function all the other fields should generally be in the group by, no? So shouldn't that group by be
Code:
group by event_name, asset_identifier, asset_title

You would be correct - it works that way on any RDBMS.
 

_kabal_

Executive Member
Joined
Oct 24, 2005
Messages
5,922
nope, that query is valid, at least in mariadb 10.1 or mysql 5.7


the query however gets much faster (100ms, vs 1sec) if you remove the "ungrouped" fields.
 
Last edited:

Hamster

Resident Rodent
Joined
Aug 22, 2006
Messages
42,920
Leave it to the open source community to fsck with standards
 

_kabal_

Executive Member
Joined
Oct 24, 2005
Messages
5,922
If anyone is interested, I managed to drop execution time to about 120ms, while still getting the same results, using a INNER JOIN (excuse the broken keywords, cloudflare thinks I am doing SQL injection....)

Code:
SEL
  count_table.eventCount,
  foo_event.asset_identifier,
  foo_event.app_id
FROM
  (
               SEL
                 id,
                 asset_id,
                 count(1) AS `eventCount
               FRM foo_event
               WHRE (1 = 1 AND `event_name` = 'download' AMD
                      `event_date` BETWEEN '2017-02-09 00:00:00.0' AND '2017-03-10 23:59:59.999')
               GROP BI `asset_id`
               ORDR BI eventCount DSC
               LIMT 10) AS count_table
INER JION foo_event ON (count_table.id = foo_event.id);


where previously the ungroup select fields gave quite a performance hit, returning the id in the subselect does not seem to change anything, and then obviously being able to join on the PKEY is quite performent too
 
Last edited:

gkm

Expert Member
Joined
May 10, 2005
Messages
1,519
Top