2016-02-10

Simpler Way to find rows with Largest/Smallest value for each group in SQL

Sometimes we need to find largest or smallest value for each column and also associated column. Recently I found out that DISTINCT ON is quite good, than my previous way (LEFT JOIN less IS NULL, or WHERE IN SELECT MAX/MIN), see this example:

CREATE TABLE test1 ( id bigserial primary key, val int, name text );

INSERT INTO test1(val,name) VALUES(6,'a'),(9,'a'),(5,'b'),(11,'b'),(2,'c'),(8,'c');

SELECT * FROM test1;

 id | val | name 
----+-----+------
  1 |   6 | a
  2 |   9 | a
  3 |   5 | b
  4 |  11 | b
  5 |   2 | c
  6 |   8 | c

To find which id for each name that has greatest value, we usually do something like this (WHERE IN SELECT MAX):

SELECT *
FROM test1 x1
WHERE (name, val) IN (
  SELECT name, MAX(val) 
  FROM test1
  GROUP BY 1
)
ORDER BY x1.name;


 id | val | name
----+-----+------
  2 |   9 | a
  4 |  11 | b
  6 |   8 | c

 Sort  (cost=73.95..74.64 rows=275 width=44)
   Sort Key: test1.name
   ->  Hash Join  (cost=33.50..62.81 rows=275 width=44)
         Hash Cond: ((test1.name = test1_1.name) AND (test1.val = (max(test1_1.val))))
         ->  Seq Scan on test1  (cost=0.00..21.00 rows=1100 width=44)
         ->  Hash  (cost=30.50..30.50 rows=200 width=36)
               ->  HashAggregate  (cost=26.50..28.50 rows=200 width=36)
                     Group Key: test1_1.name
                     ->  Seq Scan on test1 test1_1  (cost=0.00..21.00 rows=1100 width=36)

Or this (LEFT JOIN less IS NULL):

SELECT x1.*
FROM test1 x1
  LEFT JOIN test1 x2
    ON x1.name = x2.name
    AND x1.val < x2.val
WHERE x2.id IS NULL
GROUP BY x1.id, x1.name
ORDER BY x1.name; 


 id | val | name 
----+-----+------
  2 |   9 | a
  4 |  11 | b
  6 |   8 | c

 Group  (cost=264.68..264.75 rows=10 width=44)
   Group Key: x1.name, x1.id
   ->  Sort  (cost=264.68..264.70 rows=10 width=44)
         Sort Key: x1.name, x1.id
         ->  Merge Left Join  (cost=153.14..264.51 rows=10 width=44)
               Merge Cond: (x1.name = x2.name)
               Join Filter: (x1.val < x2.val)
               Filter: (x2.id IS NULL)
               ->  Sort  (cost=76.57..79.32 rows=1100 width=44)
                     Sort Key: x1.name
                     ->  Seq Scan on test1 x1  (cost=0.00..21.00 rows=1100 width=44)
               ->  Sort  (cost=76.57..79.32 rows=1100 width=44)
                     Sort Key: x2.name
                     ->  Seq Scan on test1 x2  (cost=0.00..21.00 rows=1100 width=44)


There some other way, that I found simpler (and faster in real life) also return only one row if there's duplicate value, that is DISTINCT ON!

SELECT DISTINCT ON (x1.name) *
FROM test1 x1
ORDER BY x1.name, x1.val DESC;

 id | val | name 
----+-----+------
  2 |   9 | a
  4 |  11 | b
  6 |   8 | c

 Unique  (cost=76.57..82.07 rows=200 width=44)
   ->  Sort  (cost=76.57..79.32 rows=1100 width=44)
         Sort Key: name, val
         ->  Seq Scan on test1 x1  (cost=0.00..21.00 rows=1100 width=44)

So in this case, we look for highest value, distinct by name. Well, that's all for now.