The Bidirectional Webhook Challenge
When building API integrations, most Rails developers eventually encounter webhooks—but the conversation usually starts and ends with "receiving webhooks from Stripe." This narrow focus obscures an uneasy truth: production webhook systems are almost always bidirectional. You need to receive status updates from external services and broadcast your own events to partners. These two flows look deceptively similar but require fundamentally different architectural approaches.
Consider the lifecycle differences. When you receive a webhook, you're at the mercy of the sender's retry policy. Your endpoint must be fast, idempotent, and forgiving—a 500 error might mean the sender never retries, or worse, disables your integration entirely. When you send webhooks, you control the retry strategy. A network timeout deserves aggressive retries with exponential backoff, but a 400 Bad Request signals a payload problem that won't fix itself.
# Receiving: Validate first, acknowledge quickly
def create
validate_signature
return if performed? # Already rendered 401 for bad signatures
process_update(params)
head :ok # Acknowledge receipt, process asynchronously
end
# Sending: Retry transient failures, discard permanent ones
class WebhookSenderJob < ApplicationJob
retry_on WebhookService::RetryableError, wait: :polynomially_longer
discard_on WebhookService::PermanentError # 4xx, bad payload
end
The security models differ too. Inbound webhooks need per-sender signature verification with independent secret rotation. Outbound webhooks use your signing secret, shared with receivers who verify your authenticity. Same cryptographic primitive (HMAC-SHA256), opposite trust relationships.
This article walks through building both directions in a production Rails application—covering signature verification, polymorphic audit trails, and the subtle engineering decisions that distinguish hobbyist integrations from resilient infrastructure. We'll explore why network errors and HTTP errors deserve different treatment, and how to structure your code so webhook concerns don't bleed into your core domain logic.
Bidirectional Webhook Flow Comparison
Part 1: Receiving Webhooks Securely
When receiving webhooks, you're accepting push notifications from external systems—fundamentally different from serving traditional API requests. The key challenge is that you have no control over retry behavior, and must assume the sender won't handle your rejections gracefully. This demands a security-first, defensive design.
HMAC Signature Verification
The cornerstone of webhook security is HMAC signature verification. Never trust that a request actually came from your integration partner just because it hit your endpoint. Instead, verify a signature computed from the payload:
def verify_signature
expected_signature = OpenSSL::HMAC.hexdigest(
"SHA256",
venue.webhook_secret,
request.raw_post
)
provided_signature = request.headers["X-Webhook-Signature"]
unless ActiveSupport::SecurityUtils.secure_compare(expected_signature, provided_signature)
render json: { error: "Invalid signature" }, status: :unauthorized
return false
end
true
end
Notice secure_compare—this timing-safe comparison prevents attackers from discovering the correct signature character-by-character through timing analysis.
Store Secrets Per Partner
Unlike outbound webhooks where you control the secret, inbound webhooks require storing each partner's secret securely. A JSONB settings column per venue works well: venue.settings["webhook_secret"]. This enables independent secret rotation without redeployment when a partner rotates their key.
Controller Architecture
Keep webhook controllers separate from your main API hierarchy. They have fundamentally different authentication (HMAC vs. API keys) and different lifecycle concerns. An ActionController::API base class specifically for webhooks keeps these concerns isolated:
class Webhooks::BaseController < ActionController::API
before_action :verify_signature
end
The most important mindset shift: receiving webhooks is fire-and-forget from the sender's perspective. Return success quickly, then process asynchronously if needed.
Part 2: Sending Webhooks Reliably
Sending webhooks is the flip side of receiving them, and it comes with its own set of challenges. While inbound webhooks focus on verification and security, outbound webhooks are all about reliability — ensuring your notification reaches the destination even when networks are flaky or services temporarily unavailable.
The most critical architectural decision when building outbound webhook systems is classifying failures correctly. Network failures (timeouts, connection refused, DNS errors) and 5xx server errors are transient — the remote service is temporarily unreachable but will likely recover. These should trigger retries with exponential backoff. HTTP 4xx client errors are permanent — they indicate a configuration or payload problem that retrying won't fix:
class WebhookSenderService
class RetryableError < StandardError; end
class PermanentError < StandardError; end
def call
response = send_webhook
return Result.new(success: true) if response.is_a?(Net::HTTPSuccess)
# 5xx are transient — raise to trigger job retry with backoff
if response.is_a?(Net::HTTPServerError)
raise RetryableError, "Server error: HTTP #{response.code}"
end
# 4xx are permanent — the payload or config is wrong
raise PermanentError, "Client error: HTTP #{response.code}"
rescue Net::OpenTimeout, Net::ReadTimeout, SocketError => e
raise RetryableError, e.message
end
end
Your background job should implement exponential backoff with polynomial retry intervals. A typical pattern: retry transient failures (network timeouts and 5xx responses) up to 5 times with increasing delays (30s, 5min, 30min, 2h, 8h), then move to a dead-letter queue. Permanent failures (4xx responses) should be discarded immediately — they indicate a payload or configuration problem that retrying won't fix. For audit purposes, create the sync log record in a pending state before making the HTTP call, then update it to success or failed afterward — if your process crashes mid-request, the pending record serves as evidence.
One subtle but important detail: use after_commit rather than after_save for your webhook callback. This ensures the job is enqueued only after the database transaction commits, preventing race conditions where the job executes against uncommitted data or phantom records from rolled-back transactions.
Building a Polymorphic Audit Trail
A robust audit trail transforms webhook debugging from guesswork into structured investigation. Your webhook_events table should use polymorphic associations to handle both directions with a single schema:
class CreateWebhookEvents < ActiveRecord::Migration[7.1]
def change
create_table :webhook_events do |t|
t.references :eventable, polymorphic: true, null: false
t.string :direction, null: false # 'inbound' or 'outbound'
t.string :status, null: false # 'pending', 'success', 'failed'
t.text :request_body
t.text :response_body
t.integer :http_status
t.string :error_class
t.text :error_message
t.jsonb :metadata, default: {}
t.timestamps
end
add_index :webhook_events, [:eventable_type, :eventable_id]
add_index :webhook_events, [:direction, :status, :created_at]
end
end
The eventable association allows tracking events against different domain objects — an order status update, a payment notification, or an inventory sync — without schema changes. The direction field keeps both flows in one table while enabling separate queries.
What to capture: Store the raw request/response bodies as text for exact replay during debugging. JSONB metadata handles variable data like retry attempt numbers, venue identifiers, or API versions. Keep http_status separate from status — a 500 response is still a "completed" HTTP transaction, distinct from network timeouts.
Managing table growth: This table grows linearly with webhook volume. For high-traffic systems, partition by created_at monthly and implement a retention policy. Archive events older than 90 days to cold storage, keeping only failed events indefinitely for pattern analysis.
# Production debugging example
WebhookEvent.where(direction: 'outbound', status: 'failed')
.where('created_at > ?', 1.day.ago)
.group(:error_class)
.count
# => {"Net::OpenTimeout"=>47, "WebhookService::InvalidSignature"=>3}
This immediately reveals whether you're fighting network instability or a configuration issue — fundamentally different problems requiring different solutions.
Error Handling Philosophy and Recovery Strategies
When building webhook systems, your error handling philosophy should fundamentally distinguish between expected failures and unexpected ones. Expected failures—like a 404 from a deleted resource or a 422 validation error—signal a problem with your payload or configuration. These shouldn't trigger retries; the payload is wrong and won't magically become right on attempt #17. Unexpected failures—network timeouts, connection refused, temporary 503s—are transient and should retry.
This distinction shapes your entire recovery strategy:
class WebhookDeliveryJob < ApplicationJob
retry_on WebhookSenderService::RetryableError,
wait: :polynomially_longer, attempts: 5
discard_on WebhookSenderService::PermanentError
def perform(syncable)
result = WebhookSyncService.call(syncable)
syncable.sync_logs.create!(
status: result.success? ? 'success' : 'failed',
direction: 'to_upstream',
error_message: result.error
)
rescue WebhookSenderService::RetryableError => e
# Network error or 5xx - log but re-raise for retry
syncable.sync_logs.create!(
status: 'pending',
direction: 'to_upstream',
error_message: e.message
)
raise
end
end
For monitoring, treat your sync logs as a first-class audit trail. A growing number of pending logs indicates jobs are retrying (possible upstream degradation). A spike in failed logs suggests configuration drift or API contract changes. Set up alerts for both patterns.
For truly stuck webhooks—perhaps the upstream system is down for days—implement a dead letter queue pattern. After exhausting retries, move the event to a separate failed_webhooks table with enough context for manual replay. Build an admin interface where ops can inspect the payload, update it if needed, and retry once the issue is resolved.
The key insight: automation handles the common path (transient failures), but you need human-friendly tools for the edge cases.
Outbound Webhook Error Handling Decision Tree
Rails-Specific Implementation Patterns
Rails provides excellent primitives for webhook handling that align naturally with its convention-over-configuration philosophy. Here's how to structure a production-ready implementation.
Controller Setup for Inbound Webhooks
Use ActionController::API for webhook endpoints rather than your existing API framework (like Grape). This keeps authentication concerns separate and prevents webhook routes from being caught by overly-broad error handlers:
# config/routes.rb
post '/webhooks/venue/:venue_id', to: 'webhooks/venues#create'
mount GrapeAPI => '/' # After webhook routes
# app/controllers/webhooks/venues_controller.rb
class Webhooks::VenuesController < Webhooks::BaseController
def create
verify_signature
return if performed?
ProcessWebhookJob.perform_later(venue_id: params[:venue_id], payload: request.raw_post)
head :ok
end
private
def verify_signature
received = request.headers['X-Webhook-Signature']
unless received.present?
render json: { error: 'Missing signature' }, status: :unauthorized
return
end
expected = OpenSSL::HMAC.hexdigest('SHA256', venue.webhook_secret, request.raw_post)
unless ActiveSupport::SecurityUtils.secure_compare(received, expected)
render json: { error: 'Invalid signature' }, status: :unauthorized
end
end
end
The performed? check pattern lets validation methods render responses directly while maintaining readable flow control in the action.
Background Processing with Targeted Retries
For outbound webhooks, differentiate between transient failures (retry) and permanent failures (discard). Network timeouts and 5xx server errors are typically transient — the receiver may be temporarily down. 4xx client errors are permanent — your payload or credentials are wrong, and retrying won't help:
class SendWebhookJob < ApplicationJob
retry_on WebhookService::RetryableError,
wait: :polynomially_longer, attempts: 5
discard_on WebhookService::PermanentError
def perform(record)
WebhookService.new(record).send_update
end
end
HTTP 4xx responses typically indicate configuration or payload problems that won't resolve through retries — discard these immediately. 5xx responses are often transient (the receiver is temporarily down or overloaded) and should be retried with backoff, just like network timeouts. RFC 9110 explicitly describes 503 as a temporary condition, often accompanied by a Retry-After header.
Audit Trail with Polymorphic Associations
Track webhook delivery with a reusable polymorphic log:
create_table :webhook_sync_logs do |t|
t.references :syncable, polymorphic: true, null: false
t.integer :direction, null: false # enum: [:to_upstream, :from_upstream]
t.integer :status, default: 0 # enum: [:pending, :success, :failed]
t.jsonb :payload
t.text :error_message
t.timestamps
end
This design supports bidirectional tracking and multiple record types without additional migrations.
Testing Strategies
For outbound webhooks, use VCR to record real HTTP interactions:
it "sends signed payload", :vcr do
expect { service.send_update }.to change(WebhookSyncLog, :count).by(1)
expect(WebhookSyncLog.last).to be_success
end
For inbound webhooks, use request specs with signature generation helpers:
def generate_signature(body, secret)
OpenSSL::HMAC.hexdigest('SHA256', secret, body)
end
it "accepts valid signature" do
payload = { status: 'filled' }.to_json
post venue_webhook_path(venue), params: payload,
headers: { 'X-Webhook-Signature' => generate_signature(payload, venue.webhook_secret) }
expect(response).to have_http_status(:ok)
end
Webhook Event Lifecycle State Transitions
Webhook Event Lifecycle State Transitions
Security Considerations and Attack Vectors
Webhook systems present unique security challenges because they expose server-side endpoints to external callers, often with limited ability to verify the source. The most critical defense is proper HMAC signature verification, but implementation details matter enormously.
Timing Attacks on Signature Verification
Never use standard string comparison (==) to verify HMAC signatures. An attacker can measure response times to determine which bytes match, gradually reconstructing a valid signature:
# VULNERABLE - timing leak reveals signature bytes
if request.headers["X-Webhook-Signature"] == expected_signature
process_webhook
end
# SAFE - nil guard + constant-time comparison
received = request.headers["X-Webhook-Signature"]
if received.present? &&
ActiveSupport::SecurityUtils.secure_compare(received, expected_signature)
process_webhook
else
head :unauthorized
end
The secure_compare method performs a constant-time comparison that prevents attackers from discovering the correct signature byte-by-byte through timing analysis. Note that while the comparison itself is constant-time, the string length may still be observable — which is acceptable for HMAC signatures since both strings are always the same length.
Signature Bypass Attempts
Always verify signatures before parsing the payload. Computing the HMAC against the raw request body, not a re-serialised version, prevents attacks where an attacker exploits JSON parsing differences:
raw_body = request.body.read
expected = OpenSSL::HMAC.hexdigest("SHA256", secret, raw_body)
return head :unauthorized unless secure_compare(provided, expected)
payload = JSON.parse(raw_body) # Only parse after verification
Denial of Service Protection
Webhook endpoints are prime DoS targets. Implement multiple layers of defense:
- Rate limiting per source: Use Rack::Attack or similar to limit requests per IP or signature key
- Payload size limits: Reject bodies over a reasonable threshold (e.g., 1MB) before verification
- Timeout enforcement: Set strict timeouts for payload processing
- Queue depth monitoring: Track pending webhook jobs and reject new webhooks if the queue is saturated
IP Allowlisting Trade-offs
IP allowlisting provides defense-in-depth but comes with operational overhead. Many webhook providers use dynamic IP ranges or CDNs, requiring frequent allowlist updates. It's best used as a secondary control alongside signature verification, not a replacement. For high-security scenarios, require both valid signatures AND source IP verification.
Production Lessons and War Stories
After years of webhook implementations across financial platforms, here are the lessons that only production load teaches you.
Database indexes matter more than you think. When you're logging every webhook attempt in a polymorphic audit table, missing indexes will kill you. Always index [syncable_type, syncable_id, created_at] together—you'll be querying recent sync history per record constantly during debugging. We learned this when a status page query brought the database to its knees scanning 4 million audit rows.
Separate secrets for each direction. Using the same HMAC secret for inbound and outbound webhooks seems elegant, but creates an operational nightmare during rotation. When an upstream provider forces a secret change, you need to rotate independently without coordinating both sides simultaneously. Store outbound secrets per destination, inbound secrets per source.
Network errors ≠ HTTP errors. This distinction changed our retry strategy completely. Retrying a 400 Bad Request is pointless—your payload is malformed. But Errno::ETIMEDOUT? Absolutely retry. Our job configuration reflects this:
class WebhookSyncJob < ApplicationJob
retry_on WebhookService::RetryableError,
wait: :polynomially_longer, attempts: 5
discard_on WebhookService::PermanentError # 4xx responses
end
The after_commit callback trap. Using after_save to enqueue webhook jobs leads to a subtle race condition: the job can execute before the transaction commits, seeing stale data. Worse, if the transaction rolls back, you've sent a webhook for a change that never persisted. Always use after_commit on: :update.
Keep a runbook. When webhooks fail at 2 AM, you need a decision tree: Is the signature failing? Check secret rotation dates. Getting timeouts? Check provider status page. Seeing pending logs older than 10 minutes? Dead job workers. We maintain a Notion runbook mapping each failure mode to diagnostic queries and remediation steps.
Building for Resilience
Throughout this article, we've explored two sides of webhook infrastructure: receiving webhooks from external systems and sending them out. The architectural decisions in each direction share a common thread—designing for failure.
Resilient webhook systems treat sending and receiving as fundamentally different problems. Inbound webhooks require defensive validation and timing-safe authentication, while outbound webhooks need intelligent retry logic that distinguishes transient failures (network errors, 5xx) from permanent ones (4xx, bad payloads). Building separate controller hierarchies and service objects for each direction keeps these concerns cleanly separated.
# Receiving: fail fast with explicit validation
def create
payload = parse_payload
return if performed? # Short-circuit on validation failure
validate_signature(payload)
return if performed?
# Process only valid, authenticated requests
end
# Sending: classify failures for appropriate handling
begin
response = http.request(signed_request)
raise NetworkError if timeout_or_connection_issue
log_failure(response) if response.code.to_i >= 400
rescue NetworkError => e
raise # Let job framework retry
rescue => e
log_failure(e) # Don't retry configuration problems
end
Comprehensive audit trails are non-negotiable. A polymorphic sync log table with direction enums and pending/success/failed states provides visibility into webhook behavior across your entire system. When things go wrong—and they will—these logs become your debugging lifeline.
Finally, embrace graceful degradation. Accept webhooks from degraded venues because in-flight transactions matter. Log HTTP errors without retrying because broken payloads won't fix themselves. Use after_commit callbacks so jobs only fire after successful transactions. These patterns acknowledge that distributed systems are messy, and resilience comes from handling inevitable failures intelligently rather than optimistically assuming success.
The webhook systems that survive production are the ones built with failure as a first-class consideration, not an afterthought.