Indexing Strategies for Query Optimization
A database index is a data structure that speeds up data retrieval. Without an index, the database must scan every row in a table to find the relevant rows — a full table scan. With an index, the database can jump directly to the relevant pages.
The analogy is a book. Reading every page to find a specific topic is impractical. An index at the back tells you exactly which page to turn to. Database indexes work the same way.
Key sources: PostgreSQL documentation on indexing, "Database Internals" by Alex Petrov, "High-Performance MySQL" by Baron Schwartz et al., and "SQL Performance Explained" by Markus Winand.
The Cost of a Full Table Scan
When a table has no useful index, the database reads every row sequentially. For a table with 10 million rows, that means 10 million reads per query.
sql
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 42;
-- Seq Scan on orders (cost=0.00..184.73 rows=1 width=100)
-- Filter: (customer_id = 42)
-- Planning Time: 0.1 ms
-- Execution Time: 45.2 ms
Forty-five milliseconds for a single lookup might seem fast. But at 1,000 queries per second, that's 45 seconds of CPU time per second. The database becomes CPU-bound very quickly.
An index eliminates the full scan:
```sql CREATE INDEX idx_orders_customer ON orders(customer_id);
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 42; -- Index Scan using idx_orders_customer (cost=0.29..8.30 rows=1 width=100) -- Index Cond: (customer_id = 42) -- Execution Time: 0.03 ms ```
From 45ms to 0.03ms — a 1,500x improvement.
B-Tree: The Default Index
Most databases default to a B-tree (balanced tree) index. It organizes data in a sorted, tree-like structure where finding any value requires O(log n) lookups. A table with 10 million rows needs about 24 comparisons to find any row via a B-tree index.
B-tree indexes support:
- Equality lookups: WHERE id = 42
- Range queries: WHERE date > '2025-01-01'
- Sorting: ORDER BY date (reads straight from the index, avoiding a sort)
- Prefix matching: WHERE name LIKE 'Al%'
Composite Indexes
An index on a single column is straightforward. But queries often filter on multiple columns:
sql
SELECT * FROM orders
WHERE customer_id = 42
AND status = 'shipped'
AND date > '2025-01-01';
A composite index on (customer_id, status, date) can satisfy all three conditions with a single index lookup.
Column order matters. Place the most selective column first — the one that eliminates the most rows. If customer_id narrows results to 0.1% of rows, it should be first. If status only narrows to 50%, it should come later.
```sql -- Good: customer_id is highly selective CREATE INDEX idx_orders_cust_status_date ON orders(customer_id, status, date);
-- Less effective: status first doesn't narrow much CREATE INDEX idx_orders_status_cust_date ON orders(status, customer_id, date); ```
Indexing for ORDER BY and GROUP BY
Indexes store data in sorted order. This means the database can read the index sequentially instead of sorting the result set.
```sql -- Without index: database loads all matching rows, then sorts SELECT * FROM orders WHERE customer_id = 42 ORDER BY date DESC;
-- With (customer_id, date) index: reads index backwards, no sort needed CREATE INDEX idx_orders_cust_date ON orders(customer_id, date DESC); ```
Similarly, GROUP BY benefits from an index on the grouping column because the database can aggregate rows as it traverses the sorted index.
Partial Indexes
Sometimes you only need to index a subset of rows. A partial index adds a WHERE clause to the index definition, making it smaller and faster.
sql
-- Only index active orders (95% of lookups are for active orders)
CREATE INDEX idx_active_orders ON orders(status, date)
WHERE status = 'active';
The index is tiny compared to a full-table index. Queries filtering on status = 'active' use this index. Queries for status = 'archived' do a full scan, which is acceptable because archived records are rarely queried.
Covering Indexes
A covering index contains all the columns a query needs. The database can answer the query entirely from the index, never touching the table itself.
```sql -- Query needs only customer_id and total SELECT customer_id, SUM(total) FROM orders WHERE date > '2025-01-01' GROUP BY customer_id;
-- Covering index includes all needed columns CREATE INDEX idx_covering ON orders(date, customer_id, total); ```
The EXPLAIN output will show "Index Only Scan" — the fastest possible read path.
Common Pitfalls
Too many indexes. Every index adds overhead on INSERT, UPDATE, and DELETE. Each write must update every relevant index. A table with 10 indexes is 10x slower to write to.
Indexing low-cardinality columns. A boolean column (true/false) splits data into two groups. An index doesn't help much — the database still reads half the table.
Over-indexing for a single query. If one slow query runs once per day, creating 3 new indexes for it might hurt every other write operation. Profile first, index second.
Ignoring index maintenance. B-tree indexes can become bloated over time. Regular REINDEX or VACUUM (PostgreSQL), OPTIMIZE TABLE (MySQL), or index rebuilds keep performance stable.
Key Takeaways
- Indexes turn full table scans into quick lookups but add write overhead.
- B-tree is the default — good for equality, range, and sort operations.
- Composite indexes should order columns by selectivity (most selective first).
- Partial and covering indexes optimize specific query patterns without full-table overhead.
- Don't over-index — measure query performance first, then add indexes for identified slow queries.
The best indexing strategy is data-driven: profile your slow queries, understand your access patterns, and index accordingly. Blindly indexing every column creates more problems than it solves.