PostgreSQL DISTINCT ON Pitfalls: Silent Data Corruption from Column Name Collisions in Joins

The Query That Looks Correct But Isn't

Consider this scenario: you have an orders table and a customers table, both storing a status column. Your orders can be 'pending', 'shipped', or 'cancelled'. Your customers can be 'active', 'suspended', or 'deleted'. You need the most recent order for each customer, filtering only active customers.

Order.joins(:customer)
  .where(customers: { status: 'active' })
  .select('DISTINCT ON (customer_id) orders.*')
  .order(:customer_id, created_at: :desc)

This query executes without error. It returns rows. It looks perfectly reasonable. But it's silently corrupt.

PostgreSQL's DISTINCT ON operates after the SELECT clause resolves column references. When you write DISTINCT ON (customer_id), PostgreSQL must determine which table's customer_id to use for deduplication. More critically, your query now has two status columns in scope — one from orders, one from customers. Without explicit table prefixes, PostgreSQL picks one. Often the wrong one.

The danger: Your application receives order records with status: 'active' when the actual order status was 'cancelled'. You've accidentally pulled the customer's status into the order record's status field.

This isn't a database error — it's a column name collision. No exception is raised. Your logs show successful queries. Your test suite may pass if it doesn't verify the exact status values. You won't discover this until production data reveals orders marked incorrectly, customer communications go to the wrong status group, or financial reports miscount cancelled transactions.

How PostgreSQL Resolves Ambiguous Columns in SELECT *

When PostgreSQL encounters a JOIN with SELECT *, it includes all columns from all tables in the result set. If two tables share a column name (like platform, status, or name), PostgreSQL doesn't raise an error—it simply includes both columns, with the latter table's column overriding the former in most client contexts.

Consider this deceptively simple query:

SELECT DISTINCT ON (distributions.id) *
FROM metric_snapshots
JOIN distributions ON distributions.id = metric_snapshots.distribution_id
ORDER BY distributions.id, metric_snapshots.fetched_at DESC;

If both tables have a platform column, the result set contains two platform columns. Most database clients (including ActiveRecord) resolve the ambiguity by taking the rightmost occurrence—in this case, distributions.platform silently wins over metric_snapshots.platform.

The ORDER BY Trap

Here's where DISTINCT ON becomes dangerous. The ORDER BY clause references metric_snapshots.fetched_at to select the most recent snapshot, but the returned row contains a mixture of columns from both tables. You might think you're getting the latest metric_snapshot.platform value, but you're actually getting distribution.platform—which never changes and defeats the entire purpose of ordering by timestamp.

Key takeaway: DISTINCT ON with SELECT * across joined tables creates a mismatch between the ordering logic and the returned columns. The query executes without errors, but the data is subtly wrong.

PostgreSQL's documentation on target lists confirms this behaviour: unqualified column references are resolved left-to-right across the FROM clause, but SELECT * expands to include all columns from all tables, leading to potential collisions that ORMs handle inconsistently.

Silent Data Corruption Warning

DISTINCT ON with SELECT * across joined tables that share column names (like status, platform, or updated_at) will silently return data from the wrong table's column. PostgreSQL raises no error, no warning appears in your logs, and your test suite may pass if it doesn't verify exact values.

ActiveRecord makes this especially dangerous because it defaults to SELECT * and resolves duplicate column names by using whichever comes last in the result set — quietly overwriting the correct value. The fix is to use a two-phase subquery pattern: first identify matching IDs through your join, then query the target table in isolation where no collision is possible.

Audit your codebase now: grep -r "DISTINCT ON" app/ | grep -i "join" — every match is a potential source of corrupt data in production.

Reproducing the Bug: A Step-by-Step Demonstration

Let's construct a minimal reproduction that exposes the silent data corruption. You'll need two tables sharing a common column name—we'll use platform as our collision point.

Schema Setup

CREATE TABLE distributions (
  id SERIAL PRIMARY KEY,
  video_id INTEGER,
  platform VARCHAR(50)
);

CREATE TABLE metrics (
  id SERIAL PRIMARY KEY,
  distribution_id INTEGER REFERENCES distributions(id),
  platform VARCHAR(50),
  view_count INTEGER
);

Insert Test Data

INSERT INTO distributions (id, video_id, platform) VALUES
  (1, 100, 'youtube'),
  (2, 100, 'tiktok');

INSERT INTO metrics (distribution_id, platform, view_count) VALUES
  (1, 'youtube', 1000),
  (2, 'tiktok', 5000);

The Problematic Query

SELECT DISTINCT ON (metrics.distribution_id) metrics.*
FROM metrics
JOIN distributions ON distributions.id = metrics.distribution_id
WHERE distributions.video_id = 100
ORDER BY metrics.distribution_id, metrics.id DESC;

Expected output: Two rows with platform showing 'youtube' and 'tiktok' respectively.

Actual output: Both rows may show 'youtube' (or both 'tiktok'), depending on which table's platform column PostgreSQL resolves in the SELECT * expansion. The column ambiguity means PostgreSQL picks one arbitrarily—and it's not necessarily the one from metrics.

Key takeaway: The SELECT * expansion happens before DISTINCT ON evaluates, so the wrong table's column can silently overwrite your data. No error, no warning—just incorrect results.

This reproduction isolates the exact mechanism: shared column names in joins create unqualified references that PostgreSQL resolves in implementation-defined ways.

Why This Bites Rails and ActiveRecord Users Especially Hard

ActiveRecord's query-building patterns create a perfect storm for this bug. The framework's convenience methods — .joins(), .select(), and .distinct — encourage chaining that obscures the underlying SQL until it's too late.

The SELECT * Trap

ActiveRecord defaults to SELECT * when you don't specify columns explicitly. This is harmless in single-table queries, but deadly when combined with joins. When two tables share a column name like platform, status, or updated_at, PostgreSQL returns both columns in the result set — and ActiveRecord silently uses whichever comes last in the column order.

# Looks innocent, but which 'platform' will you get?
Distribution.joins(:metric_snapshots)
  .select("DISTINCT ON (distributions.id) *")
  .where("metric_snapshots.fetched_at > ?", 1.day.ago)

If metric_snapshots.platform appears after distributions.platform in the result set, your Distribution objects will be hydrated with metric data, not distribution data. No errors. No warnings. Just wrong values.

The Raw SQL Fragment Escape Hatch

When developers need DISTINCT ON (which ActiveRecord doesn't support natively), they often drop into raw SQL fragments via .select(). This bypass removes ActiveRecord's limited guardrails whilst keeping the convenience of chaining — the worst of both worlds.

The collision happens silently because ActiveRecord's attribute assignment loops through columns by name, overwriting earlier values with later ones. You won't see duplicate column errors because PostgreSQL allows them in join results.

The Fix: The Two-Phase Subquery Pattern

The most reliable solution is a two-phase subquery pattern that completely eliminates column ambiguity. Instead of attempting to resolve collisions within a single complex query, you separate the work: first identify the correct IDs through your join, then query the target table directly using only those IDs.

Phase one uses the join to find matching records and extract just the foreign key values:

# Phase 1: find matching distribution IDs via join
distribution_ids = MetricSnapshot.joins(distribution: :related_item)
  .where(related_items: { container_id: container_id })
  .select(:distribution_id)
  .distinct

Phase two queries the target table in isolation, with no joins involved:

# Phase 2: query metrics directly (no column collision possible)
MetricSnapshot.where(distribution_id: distribution_ids)
  .where("fetched_at > ?", since)
  .select("DISTINCT ON (distribution_id) *")
  .order(:distribution_id, fetched_at: :desc)

Because the final SELECT only touches one table, there's no possibility of column name collision — PostgreSQL knows exactly which table every column belongs to.

In raw SQL, this pattern becomes:

SELECT DISTINCT ON (distribution_id) *
FROM metric_snapshots
WHERE distribution_id IN (
  SELECT DISTINCT ms.distribution_id
  FROM metric_snapshots ms
  INNER JOIN distributions d ON ms.distribution_id = d.id
  INNER JOIN related_items ri ON d.related_item_id = ri.id
  WHERE ri.container_id = $1
)
AND fetched_at > $2
ORDER BY distribution_id, fetched_at DESC;

This approach trades a single complex query for a subquery, but the PostgreSQL query planner handles IN clauses efficiently — especially with proper indexing on foreign keys.

Alternative Approaches and Why the Two-Phase Pattern Wins

Several approaches can prevent column collision bugs in DISTINCT ON queries, each with distinct trade-offs.

Fully Qualifying Column Names

The most obvious fix is prefixing every column reference with its table name:

MetricSnapshot.joins(distribution: :related_item)
  .where(related_items: { container_id: container_id })
  .select("DISTINCT ON (metric_snapshots.distribution_id) metric_snapshots.*")
  .order("metric_snapshots.distribution_id", "metric_snapshots.fetched_at DESC")

This requires constant vigilance. A single oversight — one unqualified column in a WHERE clause or ORDER BY — resurrects the bug. In large teams or six months after writing the query, this discipline erodes.

Common Table Expressions (CTEs)

CTEs can isolate the join logic from the distinct selection:

WITH filtered AS (
  SELECT ms.* FROM metric_snapshots ms
  JOIN distributions d ON ms.distribution_id = d.id
  JOIN related_items ri ON d.related_item_id = ri.id
  WHERE ri.container_id = 123
)
SELECT DISTINCT ON (distribution_id) * FROM filtered
ORDER BY distribution_id, fetched_at DESC;

This improves readability but doesn't eliminate the ambiguity — filtered still contains columns from all joined tables unless explicitly pruned.

Why Two-Phase Wins

The two-phase pattern makes the bug structurally impossible. By querying a single table in phase two, no collision can occur — there's only one source of truth for each column name. You're not relying on developer discipline; the architecture prevents the error.

Trade-off: For performance-critical queries where the subquery adds measurable overhead, fully qualified columns may be preferable. Profile first, then optimise. Correctness trumps micro-optimisation.

Defensive Practices to Prevent Silent Data Corruption

The most critical defensive practice is always specify explicit column lists in production queries, especially when using SELECT * with joins. This forces you to consider column ambiguity upfront rather than discovering it through data anomalies.

For DISTINCT ON specifically, add sanity checks after query execution. If you expect one record per entity but receive unexpected duplicates or wrong data, fail loudly:

results = MetricSnapshot.joins(:distribution)
  .select("DISTINCT ON (distribution_id) metric_snapshots.*")
  .order(:distribution_id, fetched_at: :desc)

# Sanity check: ensure one record per distribution
raise "Duplicate keys detected" if results.group_by(&:distribution_id).any? { |_, v| v.size > 1 }

Code Review Checklists

Establish team conventions for high-risk patterns:

Any DISTINCT ON query touching multiple tables requires explicit column qualification
Joins with shared column names must use table prefixes (distributions.platform, not platform)
Raw SQL must be flagged in pull requests for column collision review

Auditing Existing Codebases

Search your codebase for the dangerous pattern with grep or your IDE:

# Find DISTINCT ON with joins (potential red flag)
grep -r "DISTINCT ON" app/ | grep -i "join"

For each match, verify that column names between joined tables don't collide, or refactor using the two-phase subquery pattern. Consider adding a custom RuboCop rule that warns when DISTINCT ON appears alongside .joins() without explicit column lists — preventive enforcement beats post-incident debugging.

Explicit Columns in Production

Always specify explicit column lists in production queries when using joins, especially with DISTINCT ON. Using SELECT * across joined tables lets shared column names like status, platform, or updated_at silently collide — with the wrong table's value overwriting the correct one.

Forcing yourself to enumerate columns upfront makes you confront ambiguity at write time, not when production data reveals orders marked 'active' instead of 'cancelled' or financial reports miscounting transactions. If you can't avoid SELECT *, use the two-phase subquery pattern so the final query only touches a single table, making collision structurally impossible.