GROUP BY and HAVING

8 min read

Aggregate functions like COUNT(*) and SUM(total) give you one number across a whole table. The next step up — and one of the most powerful patterns in SQL — is computing aggregates per group of rows: orders per user, products per category, bugs per reporter, duplicate emails per address. That's what GROUP BY does, and HAVING lets you filter those groups. Together they answer almost every "how many X per Y?" question testers and PMs ask.

What GROUP BY actually does

GROUP BY collapses rows that share a value in the named column(s) into a single output row. Aggregates (COUNT, SUM, AVG, MAX, MIN) are then applied separately to each group.

-- How many orders does each user have?
SELECT user_id, COUNT(*) AS order_count
FROM orders
GROUP BY user_id;

Result:

| user_id | order_count |
|---------|-------------|
| 1       | 1           |
| 2       | 2           |
| 4       | 1           |
| 5       | 1           |

Five orders in orders, but the result has only four rows — Bob (user 2) has two orders, which collapsed into one row with order_count = 2. That's the whole idea: groups of rows become single summary rows.

How GROUP BY runs, in stages

Step 1 of 5

FROM + WHERE

Start with all rows from the table, then filter individual rows with WHERE.

The order matters: WHERE filters individual rows before grouping; HAVING filters after. If you mix them up the database refuses your query — but the error messages aren't always obvious, so it pays to know the order cold.

GROUP BY with JOINs — the readable version

Showing user_id = 1 is awkward. Joining users makes the result human-readable:

SELECT
    u.name,
    COUNT(o.id)      AS order_count,
    SUM(o.total)     AS total_spent
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.id, u.name;

Result:

| name        | order_count | total_spent |
|-------------|-------------|-------------|
| Alice Khan  | 1           | 145.00      |
| Bob Patel   | 2           |  368.50     |
| Carol Singh | 0           | NULL        |
| Dan Müller  | 1           |  63.00      |
| Esha Roy    | 1           |  25.00      |

Two details to call out:

  • LEFT JOIN keeps Carol in the result with order_count = 0. With INNER JOIN she'd vanish — and a missing user is exactly the wrong answer to "how many orders does each user have?".
  • COUNT(o.id) vs COUNT(*). COUNT(o.id) counts rows where o.id is non-NULL — so Carol's (NULL-padded) row contributes 0. COUNT(*) would have counted Carol's row as 1. The difference matters; pick the one that says what you mean.

GROUP BY multiple columns

You can group by more than one column. Each unique combination becomes a group:

-- Orders per user per status
SELECT u.name, o.status, COUNT(*) AS count
FROM orders o
INNER JOIN users u ON o.user_id = u.id
GROUP BY u.id, u.name, o.status
ORDER BY u.name, o.status;

Result:

| name        | status     | count |
|-------------|------------|-------|
| Alice Khan  | completed  | 1     |
| Bob Patel   | completed  | 1     |
| Bob Patel   | shipped    | 1     |
| Dan Müller  | pending    | 1     |
| Esha Roy    | cancelled  | 1     |

Bob has two distinct (name, status) pairs and so gets two rows. That kind of breakdown is exactly what dashboards are built from.

HAVING — filter the groups

HAVING is to grouped results what WHERE is to individual rows: a filter that keeps some, drops others.

-- Users who have placed more than 1 order
SELECT u.name, COUNT(o.id) AS order_count
FROM users u
INNER JOIN orders o ON u.id = o.user_id
GROUP BY u.id, u.name
HAVING COUNT(o.id) > 1;

Result:

| name      | order_count |
|-----------|-------------|
| Bob Patel | 2           |

HAVING COUNT(o.id) > 1 works because the count exists after the GROUP BY. You can't put that condition in WHERE — at WHERE time, no count exists yet. Try it and the database tells you off:

-- WRONG — COUNT(*) doesn't exist at WHERE time
SELECT user_id, COUNT(*) FROM orders
WHERE COUNT(*) > 1
GROUP BY user_id;
-- ERROR: aggregate functions are not allowed in WHERE

WHERE and HAVING in the same query

You can use both. WHERE filters rows before grouping; HAVING filters after. A practical pattern: drop irrelevant rows first (cheap), then aggregate, then filter the aggregates:

-- Users with 2 or more *completed* orders
SELECT u.name, COUNT(*) AS completed_orders
FROM orders o
INNER JOIN users u ON o.user_id = u.id
WHERE o.status = 'completed'    -- filter rows first
GROUP BY u.id, u.name
HAVING COUNT(*) >= 2;            -- then filter groups

Result (in our seed data):

(no rows — nobody has 2+ completed orders yet)

Empty result, but the query is structurally correct. As soon as the data has someone with two completed orders, they'll appear.

QA use cases for GROUP BY + HAVING

Three patterns that pay for themselves:

-- 1. Find duplicate emails — should never be > 1 if the column is UNIQUE
SELECT email, COUNT(*) AS dupes
FROM users
GROUP BY email
HAVING COUNT(*) > 1;
 
-- 2. Products per category — verify data distribution
-- (Are any categories empty? Are any oddly heavy?)
SELECT c.name, COUNT(p.id) AS product_count
FROM categories c
LEFT JOIN products p ON p.category_id = c.id
GROUP BY c.id, c.name
ORDER BY product_count DESC;
 
-- 3. Users with multiple cancelled orders — possible bug or fraud signal
SELECT u.name, COUNT(*) AS cancelled_count
FROM orders o
INNER JOIN users u ON o.user_id = u.id
WHERE o.status = 'cancelled'
GROUP BY u.id, u.name
HAVING COUNT(*) > 1;

Query #1 is your unique-constraint bug detector. If a column is supposed to be unique and any row comes back with dupes > 1, the constraint is broken (or doesn't exist) — a real, reportable defect.

A small rule about non-aggregated columns

When you GROUP BY, every column in the SELECT list must be either:

  • One of the columns you grouped by, or
  • Inside an aggregate function (COUNT, SUM, AVG, …).

The reason is logical: each group becomes one row, so a column like users.email that varies within a group has no single value to show. Some databases (notably MySQL with default settings) silently pick some value and move on, which leads to confusing bugs. The portable fix is always: add the column to the GROUP BY, or wrap it in an aggregate.

⚠️ Common Mistakes

  • HAVING without GROUP BY. Some databases tolerate it (treating the whole table as one group), but it's confusing. If you want to filter aggregates over the whole table, use a subquery or two queries — make the intent obvious.
  • Selecting columns that aren't in GROUP BY or an aggregate. A subtle bug under MySQL's lax mode — the column shows some value but not necessarily the one you'd expect. Always pin every selected column down: group by it or aggregate it.
  • Using COUNT(*) after a LEFT JOIN when you meant non-null only. COUNT(*) counts NULL-padded rows too. For "orders per user including users with zero orders," COUNT(o.id) gives 0 for unmatched users; COUNT(*) gives 1 (counting the NULL-padded row).

🎯 Practice Task

30 minutes. Run all queries against the e-commerce database.

  1. Count orders per user, including users with zero orders. (LEFT JOIN, GROUP BY, COUNT(o.id).)
  2. Total spend per user across completed orders. Sort largest first.
  3. Number of products in each category. Show the category name, not the id.
  4. Find any duplicate emails — if any row comes back, the unique constraint is broken.
  5. Find users who have at least one cancelled order. (Use HAVING COUNT(*) >= 1.) Compare the result to a query that uses WHERE status = 'cancelled' directly with no grouping — what's different?
  6. Average order total per user. Use AVG inside a GROUP BY. Sort by avg descending — the user whose orders are biggest on average is at the top.
  7. Stretch: show every product with the number of times it has been ordered (0 for products that have never sold). LEFT JOIN order_items onto products, GROUP BY product, COUNT the order_items rows. The zero-row results are products you may want to flag for the merch team.

That wraps up Chapter 3. You can now combine multiple tables and roll the result up into useful summaries — the shape of every operational SQL query teams write. Chapter 4 changes the question from reading data to changing data: INSERT, UPDATE, DELETE, and the transaction safety net that keeps you from disasters.

// tip to track lessons you complete and pick up where you left off across devices.