Monday, March 03, 2008

DISTINCT? Don't be lazy!

Sometimes, the easy solution is not the best one. I saw this problem happening several times. The query returns duplicates, and the first reaction of the developer is to fix it with DISTINCT.

Let's look at an example. Given the data below:

select * from people;
+-----------+-------+
| person_id | name |
+-----------+-------+
| 1 | Joe |
| 2 | Mary |
| 3 | Frank |
+-----------+-------+
3 rows in set (0.00 sec)

select * from projects;
+------------+-------------+-----------+
| project_id | descr | person_id |
+------------+-------------+-----------+
| 1 | Joe First | 1 |
| 2 | Joe second | 1 |
| 3 | Mary First | 2 |
| 4 | Mary second | 2 |
| 5 | Frank first | 3 |
+------------+-------------+-----------+
5 rows in set (0.00 sec)

select * from jobs;
+--------+-----------+-----------+------------+
| job_id | job_descr | person_id | project_id |
+--------+-----------+-----------+------------+
| 1 | joe aaa | 1 | 1 |
| 2 | joe bbb | 1 | 1 |
| 3 | joe ccc | 1 | 2 |
| 4 | Mary aaa | 2 | 3 |
| 5 | Mary bbb | 2 | 3 |
| 6 | Mary ccc | 2 | 3 |
| 7 | Mary ddd | 2 | 4 |
| 8 | Frank aaa | 3 | 5 |
| 9 | Frank bbb | 3 | 5 |
+--------+-----------+-----------+------------+
9 rows in set (0.01 sec)

The problem comes with this query:

SELECT p.name, COUNT(j.job_id) AS total , job_descr
FROM people p
INNER JOIN jobs j ON p.person_id = j.person_id
INNER JOIN projects pr ON pr.person_id = j.person_id
GROUP BY p.person_id ORDER BY total DESC,p.name;
+-------+-------+-----------+
| name | total | job_descr |
+-------+-------+-----------+
| Mary | 8 | Mary aaa |
| Joe | 6 | joe aaa |
| Frank | 2 | Frank aaa |
+-------+-------+-----------+

As you can easily see, the query reports twice the amount of jobs for Mary and Joe. The lazy solution is this

SELECT p.name, COUNT(DISTINCT j.job_id) AS total , job_descr
FROM people p
INNER JOIN jobs j ON p.person_id = j.person_id
INNER JOIN projects pr ON pr.person_id = j.person_id
GROUP BY p.person_id ORDER BY total DESC,p.name;
+-------+-------+-----------+
| name | total | job_descr |
+-------+-------+-----------+
| Mary | 4 | Mary aaa |
| Joe | 3 | joe aaa |
| Frank | 2 | Frank aaa |
+-------+-------+-----------+
However, this query does not tackle the real problem, which is that the query is joining two tables (projects and jobs) using a non-primary key column. And this "solution" also ignores the even more serious problem that the person_id is redundant, and should not be in the jobs table in the first place. Joining with a pair of primary/foreign key is the right cure:
SELECT p.name, COUNT(j.job_id) AS total, job_descr
FROM people p
INNER JOIN jobs j ON p.person_id = j.person_id
INNER JOIN projects pr ON pr.project_id = j.project_id
GROUP BY p.person_id ORDER BY total DESC,p.name ;
+-------+-------+-----------+
| name | total | job_descr |
+-------+-------+-----------+
| Mary | 4 | Mary aaa |
| Joe | 3 | joe aaa |
| Frank | 2 | Frank aaa |
+-------+-------+-----------+
The result is the same, but if you apply these queries on a couple of heavily populated tables, the first lazy query can be 5 times slower than the second one. The reason is simple: since the join was done on a non primary key column, the query performs a Cartesian product of projects and jobs, followed by a costly sort to remove the duplicates. The second query, instead, filters off the duplicates efficiently on the first step, thus delivering the wanted result faster.

4 comments:

Anonymous said...

Since you use a picture of a koala in the context of "lazy", I should note that koalas are not actually lazy.
Their sole food (eucalyptus leaves) from which they extract both their regular nutrition and water needs, does not provide them with much energy. This is why they sleep most of the time, and feel no particular need to run around.

And since there will be Americans reading this, I should also note that koalas are NOT bears. They are no relation whatsoever.
This contrary to the dropbear, which also roams the Australian trees; one should definitely watch out for those.

Giuseppe Maxia said...

Arjen,
Thanks for the reminder. In my travel to Australia I have seen the explanation of koalas non-laziness several times. So I know that they are not ethically lazy, but they are disinclined to work or exertion for lack of energy to spend.
So, no offense was intended.
It's just that the picture is so cute!

Cheers

Giuseppe

rpbouman said...

Hi Giuseppe,

good post!

When I learned SQL, our teacher taught us to distrust DISTINCT. Nine out of ten times, DISTINCT is a tell-tale for an underlying problem just like you describe here. And if you really do need to report distinct occurrences out of a list of duplicates, you are either cleansing rotten data or you are doing an aggregate query (in which case you should use GROUP BY).

So DISTINCT really has no business in a normal sensible query.

Personally, I feel about the same about UNION. I have never had to write UNION - I always write UNION ALL.

izenmania said...

Nobody taught me to distrust DISTINCT, unfortunately. I learned it myself, though, a couple weeks ago, when I rewrote a query to remove the DISTINCT and it went from 35 seconds to 0.01 seconds.