As the title suggests, I’d like to select the first row of each set of rows grouped with a
Specifically, if I’ve got a
purchases table that looks like this:
SELECT * FROM purchases;
I’d like to query for the
id of the largest purchase (
total) made by each
customer. Something like this:
SELECT FIRST(id), customer, FIRST(total)
GROUP BY customer
ORDER BY total DESC;
DISTINCT ON is typically simplest and fastest for this in PostgreSQL.
(For performance optimization for certain workloads see below.)
SELECT DISTINCT ON (customer)
id, customer, total
ORDER BY customer, total DESC, id;
Or shorter (if not as clear) with ordinal numbers of output columns:
SELECT DISTINCT ON (2)
id, customer, total
ORDER BY 2, 3 DESC, 1;
total can be NULL (won’t hurt either way, but you’ll want to match existing indexes):
ORDER BY customer, total DESC NULLS LAST, id;
DISTINCT ON is a PostgreSQL extension of the standard (where only
DISTINCT on the whole
SELECT list is defined).
List any number of expressions in the
DISTINCT ON clause, the combined row value defines duplicates. The manual:
Obviously, two rows are considered distinct if they differ in at least
one column value. Null values are considered equal in this
Bold emphasis mine.
DISTINCT ON can be combined with
ORDER BY. Leading expressions in
ORDER BY must be in the set of expressions in
DISTINCT ON, but you can rearrange order among those freely. Example.
You can add additional expressions to
ORDER BY to pick a particular row from each group of peers. Or, as the manual puts it:
DISTINCT ONexpression(s) must match the leftmost
ORDER BYclause will normally contain additional
expression(s) that determine the desired precedence of rows within
id as last item to break ties:
“Pick the row with the smallest
id from each group sharing the highest
To order results in a way that disagrees with the sort order determining the first per group, you can nest above query in an outer query with another
ORDER BY. Example.
total can be NULL, you most probably want the row with the greatest non-null value. Add
NULLS LAST like demonstrated. See:
- Sort by column ASC, but NULL values first?
SELECT list is not constrained by expressions in
DISTINCT ON or
ORDER BY in any way. (Not needed in the simple case above):
You don’t have to include any of the expressions in
You can include any other expression in the
SELECTlist. This is instrumental for replacing much more complex queries with subqueries and aggregate / window functions.
I tested with Postgres versions 8.3 – 13. But the feature has been there at least since version 7.1, so basically always.
The perfect index for the above query would be a multi-column index spanning all three columns in matching sequence and with matching sort order:
CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);
May be too specialized. But use it if read performance for the particular query is crucial. If you have
DESC NULLS LAST in the query, use the same in the index so that sort order matches and the index is applicable.
Effectiveness / Performance optimization
Weigh cost and benefit before creating tailored indexes for each query. The potential of above index largely depends on data distribution.
The index is used because it delivers pre-sorted data. In Postgres 9.2 or later the query can also benefit from an index only scan if the index is smaller than the underlying table. The index has to be scanned in its entirety, though.
For few rows per customer (high cardinality in column
customer), this is very efficient. Even more so if you need sorted output anyway. The benefit shrinks with a growing number of rows per customer.
Ideally, you have enough
work_mem to process the involved sort step in RAM and not spill to disk. But generally setting
work_mem too high can have adverse effects. Consider
SET LOCAL for exceptionally big queries. Find how much you need with
EXPLAIN ANALYZE. Mention of “Disk:” in the sort step indicates the need for more:
- Configuration parameter work_mem in PostgreSQL on Linux
- Optimize simple query using ORDER BY date and text
For many rows per customer (low cardinality in column
customer), a loose index scan (a.k.a. “skip scan”) would be (much) more efficient, but that’s not implemented up to Postgres 14. (An implementation for index-only scans is in development for Postgres 15. See here and here.)
For now, there are faster query techniques to substitute for this. In particular if you have a separate table holding unique customers, which is the typical use case. But also if you don’t:
- SELECT DISTINCT is slower than expected on my table in PostgreSQL
- Optimize GROUP BY query to retrieve latest row per user
- Optimize groupwise maximum query
- Query last N related rows per row
See separate answer.