I have two data nodes and found out a strange behaviour:
I have a table raw_data with a column user_id. raw_data has approx. 300.000.000 rows with 417.000 different user_id's.
Now when I run similar queries with very different results. There is a key on "date" which results in 11.900.000 rows for the selection below:
1. Only the raw_data table:
select user_id, count(*)
from raw_data
where date between '2014-03-01' and '2014-03-05'
group by user_id
order by count(*)
limit 100;
Execution time is approx. 15 seconds.
2. Now I want to join the user table in that query:
select user.name, user_id, count(*)
from raw_data rd inner join user u on rd.user_id = u.id
where date between '2014-03-01' and '2014-03-05'
group by user_id
order by count(*)
limit 100;
Execution time is more then 20 minutes.
This is the output of the explain statement:
+----+-------------+-------+--------+---------------------------+---------+---------+------------------------+----------+-------------------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------------------+---------+---------+------------------------+----------+-------------------------------------------------------------------------------+
| 1 | SIMPLE | i | range | search,date,user_id_id | date | 3 | NULL | 11905284 | Using where with pushed condition; Using MRR; Using temporary; Using filesort |
| 1 | SIMPLE | a | eq_ref | PRIMARY | PRIMARY | 4 | report.d.user_id | 1 | Using where |
+----+-------------+-------+--------+---------------------------+---------+---------+------------------------+----------+-------------------------------------------------------------------------------+
3. I used a subselect from try number 1 to join user table and this was as fast as try number 1:
select u.name, user_id, hits
from (select user_id, count(*) as hits
from raw_data
where date between '2014-03-01' and '2014-03-05'
group by user_id
order by count(*)
limit 100 ) t1
inner join user u on u.id = t1.user_id;
Execution time is the same as in query 1, approx. 15 seconds.
So why is the join that much slower?
I have a table raw_data with a column user_id. raw_data has approx. 300.000.000 rows with 417.000 different user_id's.
Now when I run similar queries with very different results. There is a key on "date" which results in 11.900.000 rows for the selection below:
1. Only the raw_data table:
select user_id, count(*)
from raw_data
where date between '2014-03-01' and '2014-03-05'
group by user_id
order by count(*)
limit 100;
Execution time is approx. 15 seconds.
2. Now I want to join the user table in that query:
select user.name, user_id, count(*)
from raw_data rd inner join user u on rd.user_id = u.id
where date between '2014-03-01' and '2014-03-05'
group by user_id
order by count(*)
limit 100;
Execution time is more then 20 minutes.
This is the output of the explain statement:
+----+-------------+-------+--------+---------------------------+---------+---------+------------------------+----------+-------------------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------------------+---------+---------+------------------------+----------+-------------------------------------------------------------------------------+
| 1 | SIMPLE | i | range | search,date,user_id_id | date | 3 | NULL | 11905284 | Using where with pushed condition; Using MRR; Using temporary; Using filesort |
| 1 | SIMPLE | a | eq_ref | PRIMARY | PRIMARY | 4 | report.d.user_id | 1 | Using where |
+----+-------------+-------+--------+---------------------------+---------+---------+------------------------+----------+-------------------------------------------------------------------------------+
3. I used a subselect from try number 1 to join user table and this was as fast as try number 1:
select u.name, user_id, hits
from (select user_id, count(*) as hits
from raw_data
where date between '2014-03-01' and '2014-03-05'
group by user_id
order by count(*)
limit 100 ) t1
inner join user u on u.id = t1.user_id;
Execution time is the same as in query 1, approx. 15 seconds.
So why is the join that much slower?