Building Resilient Webhook Systems: A Tale of Two Directions

When building API integrations, most Rails developers eventually encounter webhooks—but the conversation usually starts and ends with "receiving webhooks from Stripe." This narrow focus obscures an uneasy truth: production webhook systems are almost always bidirectional. You need to receive status updates from external services and broadcast your own events to partners. These two flows look deceptively similar but require fundamentally different architectural approaches.

Consider the lifecycle differences. When you receive a webhook, you're at the mercy of the sender's retry policy. Your endpoint must be fast, idempotent, and forgiving—a 500 error might mean the sender never retries, or worse, disables your integration entirely. When you send webhooks, you control the retry strategy. A network timeout deserves aggressive retries with exponential backoff, but a 400 Bad Request signals a payload problem that won't fix itself.

# Receiving: Validate first, acknowledge quickly
def create
  validate_signature
  return if performed?  # Already rendered 401 for bad signatures

  process_update(params)
  head :ok  # Acknowledge receipt, process asynchronously
end

# Sending: Retry transient failures, discard permanent ones
class WebhookSenderJob < ApplicationJob
  retry_on WebhookService::RetryableError, wait: :polynomially_longer
  discard_on WebhookService::PermanentError  # 4xx, bad payload
end

The security models differ too. Inbound webhooks need per-sender signature verification with independent secret rotation. Outbound webhooks use your signing secret, shared with receivers who verify your authenticity. Same cryptographic primitive (HMAC-SHA256), opposite trust relationships.

This article walks through building both directions in a production Rails application—covering signature verification, polymorphic audit trails, and the subtle engineering decisions that distinguish hobbyist integrations from resilient infrastructure. We'll explore why network errors and HTTP errors deserve different treatment, and how to structure your code so webhook concerns don't bleed into your core domain logic.

graph TB subgraph Legend["Key Differences"] X["🔴 Sender Controls Inbound 🟡 You Control Outbound ⚡ Retry Strategy: - Inbound: Fast Fail - Outbound: Intelligent Retry"] style X fill:#f8f9fa style Legend fill:#e7f5ff end subgraph Outbound["OUTBOUND WEBHOOKS - Your Control"] G["Rails App Triggers Event"] H["Queue (Required)"] I["External Receiver (Partner API)"] J["Intelligent Retry (Exponential Backoff)"] K["Max Retries Exceeded"] L["Delivered Success"] G -->|Enqueue| H H -->|Send| I I -->|Timeout/Error| J J -->|Retry with Delay| I I -->|Still Failing| K I -->|Success| L K -->|Alert/Log| G J -->|Up to N Times| J style J fill:#ffd43b style K fill:#ff6b6b style L fill:#51cf66 style G fill:#4dabf7 end subgraph Inbound["INBOUND WEBHOOKS - External Control"] A["External Sender (Your Partner)"] B["Rails App Receives Webhook"] C["Queue (Optional)"] D["Processing Handler"] E["Fast Fail (Sender Retries)"] F["Success Response 200"] A -->|POST Request| B B -->|Validate| C C -->|Dispatch| D D -->|Error| E D -->|Success| F E -->|"Sender's Retry Policy"| A F -->|ACK| A style E fill:#ff6b6b style F fill:#51cf66 style A fill:#4dabf7 end Legend ~~~ Outbound Outbound ~~~ Inbound

Bidirectional Webhook Flow Comparison

When receiving webhooks, you're accepting push notifications from external systems—fundamentally different from serving traditional API requests. The key challenge is that you have no control over retry behavior, and must assume the sender won't handle your rejections gracefully. This demands a security-first, defensive design.

HMAC Signature Verification

The cornerstone of webhook security is HMAC signature verification. Never trust that a request actually came from your integration partner just because it hit your endpoint. Instead, verify a signature computed from the payload:

def verify_signature
  expected_signature = OpenSSL::HMAC.hexdigest(
    "SHA256", 
    venue.webhook_secret, 
    request.raw_post
  )
  
  provided_signature = request.headers["X-Webhook-Signature"]
  
  unless ActiveSupport::SecurityUtils.secure_compare(expected_signature, provided_signature)
    render json: { error: "Invalid signature" }, status: :unauthorized
    return false
  end
  
  true
end

Notice secure_compare—this timing-safe comparison prevents attackers from discovering the correct signature character-by-character through timing analysis.

Store Secrets Per Partner

Unlike outbound webhooks where you control the secret, inbound webhooks require storing each partner's secret securely. A JSONB settings column per venue works well: venue.settings["webhook_secret"]. This enables independent secret rotation without redeployment when a partner rotates their key.

Controller Architecture

Keep webhook controllers separate from your main API hierarchy. They have fundamentally different authentication (HMAC vs. API keys) and different lifecycle concerns. An ActionController::API base class specifically for webhooks keeps these concerns isolated:

class Webhooks::BaseController < ActionController::API
  before_action :verify_signature
end

The most important mindset shift: receiving webhooks is fire-and-forget from the sender's perspective. Return success quickly, then process asynchronously if needed.

Sending webhooks is the flip side of receiving them, and it comes with its own set of challenges. While inbound webhooks focus on verification and security, outbound webhooks are all about reliability — ensuring your notification reaches the destination even when networks are flaky or services temporarily unavailable.

The most critical architectural decision when building outbound webhook systems is classifying failures correctly. Network failures (timeouts, connection refused, DNS errors) and 5xx server errors are transient — the remote service is temporarily unreachable but will likely recover. These should trigger retries with exponential backoff. HTTP 4xx client errors are permanent — they indicate a configuration or payload problem that retrying won't fix:

class WebhookSenderService
  class RetryableError < StandardError; end
  class PermanentError < StandardError; end

  def call
    response = send_webhook

    return Result.new(success: true) if response.is_a?(Net::HTTPSuccess)

    # 5xx are transient — raise to trigger job retry with backoff
    if response.is_a?(Net::HTTPServerError)
      raise RetryableError, "Server error: HTTP #{response.code}"
    end

    # 4xx are permanent — the payload or config is wrong
    raise PermanentError, "Client error: HTTP #{response.code}"

  rescue Net::OpenTimeout, Net::ReadTimeout, SocketError => e
    raise RetryableError, e.message
  end
end

Your background job should implement exponential backoff with polynomial retry intervals. A typical pattern: retry transient failures (network timeouts and 5xx responses) up to 5 times with increasing delays (30s, 5min, 30min, 2h, 8h), then move to a dead-letter queue. Permanent failures (4xx responses) should be discarded immediately — they indicate a payload or configuration problem that retrying won't fix. For audit purposes, create the sync log record in a pending state before making the HTTP call, then update it to success or failed afterward — if your process crashes mid-request, the pending record serves as evidence.

One subtle but important detail: use after_commit rather than after_save for your webhook callback. This ensures the job is enqueued only after the database transaction commits, preventing race conditions where the job executes against uncommitted data or phantom records from rolled-back transactions.

A robust audit trail transforms webhook debugging from guesswork into structured investigation. Your webhook_events table should use polymorphic associations to handle both directions with a single schema:

class CreateWebhookEvents < ActiveRecord::Migration[7.1]
  def change
    create_table :webhook_events do |t|
      t.references :eventable, polymorphic: true, null: false
      t.string :direction, null: false  # 'inbound' or 'outbound'
      t.string :status, null: false     # 'pending', 'success', 'failed'
      t.text :request_body
      t.text :response_body
      t.integer :http_status
      t.string :error_class
      t.text :error_message
      t.jsonb :metadata, default: {}
      t.timestamps
    end
    
    add_index :webhook_events, [:eventable_type, :eventable_id]
    add_index :webhook_events, [:direction, :status, :created_at]
  end
end

The eventable association allows tracking events against different domain objects — an order status update, a payment notification, or an inventory sync — without schema changes. The direction field keeps both flows in one table while enabling separate queries.

What to capture: Store the raw request/response bodies as text for exact replay during debugging. JSONB metadata handles variable data like retry attempt numbers, venue identifiers, or API versions. Keep http_status separate from status — a 500 response is still a "completed" HTTP transaction, distinct from network timeouts.

Managing table growth: This table grows linearly with webhook volume. For high-traffic systems, partition by created_at monthly and implement a retention policy. Archive events older than 90 days to cold storage, keeping only failed events indefinitely for pattern analysis.

# Production debugging example
WebhookEvent.where(direction: 'outbound', status: 'failed')
            .where('created_at > ?', 1.day.ago)
            .group(:error_class)
            .count
# => {"Net::OpenTimeout"=>47, "WebhookService::InvalidSignature"=>3}

This immediately reveals whether you're fighting network instability or a configuration issue — fundamentally different problems requiring different solutions.

When building webhook systems, your error handling philosophy should fundamentally distinguish between expected failures and unexpected ones. Expected failures—like a 404 from a deleted resource or a 422 validation error—signal a problem with your payload or configuration. These shouldn't trigger retries; the payload is wrong and won't magically become right on attempt #17. Unexpected failures—network timeouts, connection refused, temporary 503s—are transient and should retry.

This distinction shapes your entire recovery strategy:

class WebhookDeliveryJob < ApplicationJob
  retry_on WebhookSenderService::RetryableError,
    wait: :polynomially_longer, attempts: 5
  discard_on WebhookSenderService::PermanentError

  def perform(syncable)
    result = WebhookSyncService.call(syncable)

    syncable.sync_logs.create!(
      status: result.success? ? 'success' : 'failed',
      direction: 'to_upstream',
      error_message: result.error
    )
  rescue WebhookSenderService::RetryableError => e
    # Network error or 5xx - log but re-raise for retry
    syncable.sync_logs.create!(
      status: 'pending',
      direction: 'to_upstream',
      error_message: e.message
    )
    raise
  end
end

For monitoring, treat your sync logs as a first-class audit trail. A growing number of pending logs indicates jobs are retrying (possible upstream degradation). A spike in failed logs suggests configuration drift or API contract changes. Set up alerts for both patterns.

For truly stuck webhooks—perhaps the upstream system is down for days—implement a dead letter queue pattern. After exhausting retries, move the event to a separate failed_webhooks table with enough context for manual replay. Build an admin interface where ops can inspect the payload, update it if needed, and retry once the issue is resolved.

The key insight: automation handles the common path (transient failures), but you need human-friendly tools for the edge cases.

flowchart TD A["Webhook delivery attempt"] --> B{"Response?"} B -->|2xx Success| D["Delivery successful"] B -->|Network error| E["Timeouts DNS failures Connection refused"] B -->|4xx Client Error| I["400 Bad Request 401 Unauthorized 404 Not Found"] B -->|5xx Server Error| J["500 Internal Error 503 Unavailable 502 Bad Gateway"] E --> G["RETRY with exponential backoff"] J --> G I --> K["LOG and DISCARD Permanent failure"] D --> M["Record success"] G --> N{"Max retries exceeded?"} N -->|No| A N -->|Yes| O["Move to dead-letter queue for review"] K --> L["Record failure in webhook log"] O --> L

Outbound Webhook Error Handling Decision Tree

Rails provides excellent primitives for webhook handling that align naturally with its convention-over-configuration philosophy. Here's how to structure a production-ready implementation.

Controller Setup for Inbound Webhooks

Use ActionController::API for webhook endpoints rather than your existing API framework (like Grape). This keeps authentication concerns separate and prevents webhook routes from being caught by overly-broad error handlers:

# config/routes.rb
post '/webhooks/venue/:venue_id', to: 'webhooks/venues#create'
mount GrapeAPI => '/'  # After webhook routes

# app/controllers/webhooks/venues_controller.rb
class Webhooks::VenuesController < Webhooks::BaseController
  def create
    verify_signature
    return if performed?

    ProcessWebhookJob.perform_later(venue_id: params[:venue_id], payload: request.raw_post)
    head :ok
  end

  private

  def verify_signature
    received = request.headers['X-Webhook-Signature']

    unless received.present?
      render json: { error: 'Missing signature' }, status: :unauthorized
      return
    end

    expected = OpenSSL::HMAC.hexdigest('SHA256', venue.webhook_secret, request.raw_post)

    unless ActiveSupport::SecurityUtils.secure_compare(received, expected)
      render json: { error: 'Invalid signature' }, status: :unauthorized
    end
  end
end

The performed? check pattern lets validation methods render responses directly while maintaining readable flow control in the action.

Background Processing with Targeted Retries

For outbound webhooks, differentiate between transient failures (retry) and permanent failures (discard). Network timeouts and 5xx server errors are typically transient — the receiver may be temporarily down. 4xx client errors are permanent — your payload or credentials are wrong, and retrying won't help:

class SendWebhookJob < ApplicationJob
  retry_on WebhookService::RetryableError,
    wait: :polynomially_longer, attempts: 5
  discard_on WebhookService::PermanentError

  def perform(record)
    WebhookService.new(record).send_update
  end
end

HTTP 4xx responses typically indicate configuration or payload problems that won't resolve through retries — discard these immediately. 5xx responses are often transient (the receiver is temporarily down or overloaded) and should be retried with backoff, just like network timeouts. RFC 9110 explicitly describes 503 as a temporary condition, often accompanied by a Retry-After header.

Audit Trail with Polymorphic Associations

Track webhook delivery with a reusable polymorphic log:

create_table :webhook_sync_logs do |t|
  t.references :syncable, polymorphic: true, null: false
  t.integer :direction, null: false  # enum: [:to_upstream, :from_upstream]
  t.integer :status, default: 0      # enum: [:pending, :success, :failed]
  t.jsonb :payload
  t.text :error_message
  t.timestamps
end

This design supports bidirectional tracking and multiple record types without additional migrations.

Testing Strategies

For outbound webhooks, use VCR to record real HTTP interactions:

it "sends signed payload", :vcr do
  expect { service.send_update }.to change(WebhookSyncLog, :count).by(1)
  expect(WebhookSyncLog.last).to be_success
end

For inbound webhooks, use request specs with signature generation helpers:

def generate_signature(body, secret)
  OpenSSL::HMAC.hexdigest('SHA256', secret, body)
end

it "accepts valid signature" do
  payload = { status: 'filled' }.to_json
  post venue_webhook_path(venue), params: payload,
    headers: { 'X-Webhook-Signature' => generate_signature(payload, venue.webhook_secret) }
  expect(response).to have_http_status(:ok)
end

Webhook Event Lifecycle State Transitions

Webhook systems present unique security challenges because they expose server-side endpoints to external callers, often with limited ability to verify the source. The most critical defense is proper HMAC signature verification, but implementation details matter enormously.

Timing Attacks on Signature Verification

Never use standard string comparison (==) to verify HMAC signatures. An attacker can measure response times to determine which bytes match, gradually reconstructing a valid signature:

# VULNERABLE - timing leak reveals signature bytes
if request.headers["X-Webhook-Signature"] == expected_signature
  process_webhook
end

# SAFE - nil guard + constant-time comparison
received = request.headers["X-Webhook-Signature"]
if received.present? &&
   ActiveSupport::SecurityUtils.secure_compare(received, expected_signature)
  process_webhook
else
  head :unauthorized
end

The secure_compare method performs a constant-time comparison that prevents attackers from discovering the correct signature byte-by-byte through timing analysis. Note that while the comparison itself is constant-time, the string length may still be observable — which is acceptable for HMAC signatures since both strings are always the same length.

Signature Bypass Attempts

Always verify signatures before parsing the payload. Computing the HMAC against the raw request body, not a re-serialised version, prevents attacks where an attacker exploits JSON parsing differences:

raw_body = request.body.read
expected = OpenSSL::HMAC.hexdigest("SHA256", secret, raw_body)
return head :unauthorized unless secure_compare(provided, expected)

payload = JSON.parse(raw_body) # Only parse after verification

Denial of Service Protection

Webhook endpoints are prime DoS targets. Implement multiple layers of defense:

IP Allowlisting Trade-offs

IP allowlisting provides defense-in-depth but comes with operational overhead. Many webhook providers use dynamic IP ranges or CDNs, requiring frequent allowlist updates. It's best used as a secondary control alongside signature verification, not a replacement. For high-security scenarios, require both valid signatures AND source IP verification.

After years of webhook implementations across financial platforms, here are the lessons that only production load teaches you.

Database indexes matter more than you think. When you're logging every webhook attempt in a polymorphic audit table, missing indexes will kill you. Always index [syncable_type, syncable_id, created_at] together—you'll be querying recent sync history per record constantly during debugging. We learned this when a status page query brought the database to its knees scanning 4 million audit rows.

Separate secrets for each direction. Using the same HMAC secret for inbound and outbound webhooks seems elegant, but creates an operational nightmare during rotation. When an upstream provider forces a secret change, you need to rotate independently without coordinating both sides simultaneously. Store outbound secrets per destination, inbound secrets per source.

Network errors ≠ HTTP errors. This distinction changed our retry strategy completely. Retrying a 400 Bad Request is pointless—your payload is malformed. But Errno::ETIMEDOUT? Absolutely retry. Our job configuration reflects this:

class WebhookSyncJob < ApplicationJob
  retry_on WebhookService::RetryableError,
    wait: :polynomially_longer, attempts: 5
  discard_on WebhookService::PermanentError  # 4xx responses
end

The after_commit callback trap. Using after_save to enqueue webhook jobs leads to a subtle race condition: the job can execute before the transaction commits, seeing stale data. Worse, if the transaction rolls back, you've sent a webhook for a change that never persisted. Always use after_commit on: :update.

Keep a runbook. When webhooks fail at 2 AM, you need a decision tree: Is the signature failing? Check secret rotation dates. Getting timeouts? Check provider status page. Seeing pending logs older than 10 minutes? Dead job workers. We maintain a Notion runbook mapping each failure mode to diagnostic queries and remediation steps.

Throughout this article, we've explored two sides of webhook infrastructure: receiving webhooks from external systems and sending them out. The architectural decisions in each direction share a common thread—designing for failure.

Resilient webhook systems treat sending and receiving as fundamentally different problems. Inbound webhooks require defensive validation and timing-safe authentication, while outbound webhooks need intelligent retry logic that distinguishes transient failures (network errors, 5xx) from permanent ones (4xx, bad payloads). Building separate controller hierarchies and service objects for each direction keeps these concerns cleanly separated.

# Receiving: fail fast with explicit validation
def create
  payload = parse_payload
  return if performed?  # Short-circuit on validation failure
  
  validate_signature(payload)
  return if performed?
  
  # Process only valid, authenticated requests
end

# Sending: classify failures for appropriate handling
begin
  response = http.request(signed_request)
  raise NetworkError if timeout_or_connection_issue
  log_failure(response) if response.code.to_i >= 400
rescue NetworkError => e
  raise  # Let job framework retry
rescue => e
  log_failure(e)  # Don't retry configuration problems
end

Comprehensive audit trails are non-negotiable. A polymorphic sync log table with direction enums and pending/success/failed states provides visibility into webhook behavior across your entire system. When things go wrong—and they will—these logs become your debugging lifeline.

Finally, embrace graceful degradation. Accept webhooks from degraded venues because in-flight transactions matter. Log HTTP errors without retrying because broken payloads won't fix themselves. Use after_commit callbacks so jobs only fire after successful transactions. These patterns acknowledge that distributed systems are messy, and resilience comes from handling inevitable failures intelligently rather than optimistically assuming success.

The webhook systems that survive production are the ones built with failure as a first-class consideration, not an afterthought.

Building Resilient Webhook Systems: A Tale of Two Directions

The Bidirectional Webhook Challenge

Part 1: Receiving Webhooks Securely

Part 2: Sending Webhooks Reliably

Building a Polymorphic Audit Trail

Error Handling Philosophy and Recovery Strategies

Rails-Specific Implementation Patterns

Controller Setup for Inbound Webhooks

Background Processing with Targeted Retries

Audit Trail with Polymorphic Associations

Testing Strategies

Webhook Event Lifecycle State Transitions

Security Considerations and Attack Vectors

Production Lessons and War Stories

Building for Resilience