3 min read

Indexing Strategies for Query Optimization

A database index is a data structure that speeds up data retrieval. Without an index, the database must scan every row in a table to find the relevant rows — a full table scan. With an index, the database can jump directly to the relevant pages.

The analogy is a book. Reading every page to find a specific topic is impractical. An index at the back tells you exactly which page to turn to. Database indexes work the same way.

Key sources: PostgreSQL documentation on indexing, "Database Internals" by Alex Petrov, "High-Performance MySQL" by Baron Schwartz et al., and "SQL Performance Explained" by Markus Winand.


The Cost of a Full Table Scan

When a table has no useful index, the database reads every row sequentially. For a table with 10 million rows, that means 10 million reads per query.

EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 42;
-- Seq Scan on orders  (cost=0.00..184.73 rows=1 width=100)
--   Filter: (customer_id = 42)
--   Planning Time: 0.1 ms
--   Execution Time: 45.2 ms

Forty-five milliseconds for a single lookup might seem fast. But at 1,000 queries per second, that's 45 seconds of CPU time per second. The database becomes CPU-bound very quickly.

An index eliminates the full scan:

CREATE INDEX idx_orders_customer ON orders(customer_id);

EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 42;
-- Index Scan using idx_orders_customer  (cost=0.29..8.30 rows=1 width=100)
--   Index Cond: (customer_id = 42)
--   Execution Time: 0.03 ms

From 45ms to 0.03ms — a 1,500x improvement.


B-Tree: The Default Index

Most databases default to a B-tree (balanced tree) index. It organizes data in a sorted, tree-like structure where finding any value requires O(log n) lookups. A table with 10 million rows needs about 24 comparisons to find any row via a B-tree index.

B-tree indexes support: - Equality lookups: WHERE id = 42 - Range queries: WHERE date > '2025-01-01' - Sorting: ORDER BY date (reads straight from the index, avoiding a sort) - Prefix matching: WHERE name LIKE 'Al%'


Composite Indexes

An index on a single column is straightforward. But queries often filter on multiple columns:

SELECT * FROM orders
WHERE customer_id = 42
  AND status = 'shipped'
  AND date > '2025-01-01';

A composite index on (customer_id, status, date) can satisfy all three conditions with a single index lookup.

Column order matters. Place the most selective column first — the one that eliminates the most rows. If customer_id narrows results to 0.1% of rows, it should be first. If status only narrows to 50%, it should come later.

-- Good: customer_id is highly selective
CREATE INDEX idx_orders_cust_status_date
ON orders(customer_id, status, date);

-- Less effective: status first doesn't narrow much
CREATE INDEX idx_orders_status_cust_date
ON orders(status, customer_id, date);

Indexing for ORDER BY and GROUP BY

Indexes store data in sorted order. This means the database can read the index sequentially instead of sorting the result set.

-- Without index: database loads all matching rows, then sorts
SELECT * FROM orders
WHERE customer_id = 42
ORDER BY date DESC;

-- With (customer_id, date) index: reads index backwards, no sort needed
CREATE INDEX idx_orders_cust_date ON orders(customer_id, date DESC);

Similarly, GROUP BY benefits from an index on the grouping column because the database can aggregate rows as it traverses the sorted index.


Partial Indexes

Sometimes you only need to index a subset of rows. A partial index adds a WHERE clause to the index definition, making it smaller and faster.

-- Only index active orders (95% of lookups are for active orders)
CREATE INDEX idx_active_orders ON orders(status, date)
WHERE status = 'active';

The index is tiny compared to a full-table index. Queries filtering on status = 'active' use this index. Queries for status = 'archived' do a full scan, which is acceptable because archived records are rarely queried.


Covering Indexes

A covering index contains all the columns a query needs. The database can answer the query entirely from the index, never touching the table itself.

-- Query needs only customer_id and total
SELECT customer_id, SUM(total)
FROM orders
WHERE date > '2025-01-01'
GROUP BY customer_id;

-- Covering index includes all needed columns
CREATE INDEX idx_covering ON orders(date, customer_id, total);

The EXPLAIN output will show "Index Only Scan" — the fastest possible read path.


Common Pitfalls

Too many indexes. Every index adds overhead on INSERT, UPDATE, and DELETE. Each write must update every relevant index. A table with 10 indexes is 10x slower to write to.

Indexing low-cardinality columns. A boolean column (true/false) splits data into two groups. An index doesn't help much — the database still reads half the table.

Over-indexing for a single query. If one slow query runs once per day, creating 3 new indexes for it might hurt every other write operation. Profile first, index second.

Ignoring index maintenance. B-tree indexes can become bloated over time. Regular REINDEX or VACUUM (PostgreSQL), OPTIMIZE TABLE (MySQL), or index rebuilds keep performance stable.


Key Takeaways

  1. Indexes turn full table scans into quick lookups but add write overhead.
  2. B-tree is the default — good for equality, range, and sort operations.
  3. Composite indexes should order columns by selectivity (most selective first).
  4. Partial and covering indexes optimize specific query patterns without full-table overhead.
  5. Don't over-index — measure query performance first, then add indexes for identified slow queries.

The best indexing strategy is data-driven: profile your slow queries, understand your access patterns, and index accordingly. Blindly indexing every column creates more problems than it solves.