The Bidirectional Webhook Challenge
When building API integrations, most Rails developers eventually encounter webhooks—but the conversation usually starts and ends with "receiving webhooks from Stripe." This narrow focus obscures an uneasy truth: production webhook systems are almost always bidirectional. You need to receive status updates from external services and broadcast your own events to partners. These two flows look deceptively similar but require fundamentally different architectural approaches.
Consider the lifecycle differences. When you receive a webhook, you're at the mercy of the sender's retry policy. Your endpoint must be fast, idempotent, and forgiving—a 500 error might mean the sender never retries, or worse, disables your integration entirely. When you send webhooks, you control the retry strategy. A network timeout deserves aggressive retries with exponential backoff, but a 400 Bad Request signals a payload problem that won't fix itself.
# Receiving: Fail fast, be lenient
def create
validate_signature
return if performed? # Already rendered 401
process_update(params)
head :ok # Always 200, even if processing queued
end
# Sending: Retry on network issues, not HTTP errors
class WebhookSenderJob < ApplicationJob
retry_on Net::OpenTimeout, wait: :polynomially_longer
discard_on Net::HTTPClientError # Don't retry 4xx
end
The security models differ too. Inbound webhooks need per-sender signature verification with independent secret rotation. Outbound webhooks use your signing secret, shared with receivers who verify your authenticity. Same cryptographic primitive (HMAC-SHA256), opposite trust relationships.
This article walks through building both directions in a production Rails application—covering signature verification, polymorphic audit trails, and the subtle engineering decisions that distinguish hobbyist integrations from resilient infrastructure. We'll explore why network errors and HTTP errors deserve different treatment, and how to structure your code so webhook concerns don't bleed into your core domain logic.
Bidirectional Webhook Flow Comparison
Part 1: Receiving Webhooks Securely
When receiving webhooks, you're accepting push notifications from external systems—fundamentally different from serving traditional API requests. The key challenge is that you have no control over retry behavior, and must assume the sender won't handle your rejections gracefully. This demands a security-first, defensive design.
HMAC Signature Verification
The cornerstone of webhook security is HMAC signature verification. Never trust that a request actually came from your integration partner just because it hit your endpoint. Instead, verify a signature computed from the payload:
def verify_signature
expected_signature = OpenSSL::HMAC.hexdigest(
"SHA256",
venue.webhook_secret,
request.raw_post
)
provided_signature = request.headers["X-Webhook-Signature"]
unless ActiveSupport::SecurityUtils.secure_compare(expected_signature, provided_signature)
render json: { error: "Invalid signature" }, status: :unauthorized
return false
end
true
end
Notice secure_compare—this timing-safe comparison prevents attackers from discovering the correct signature character-by-character through timing analysis.
Store Secrets Per Partner
Unlike outbound webhooks where you control the secret, inbound webhooks require storing each partner's secret securely. A JSONB settings column per venue works well: venue.settings["webhook_secret"]. This enables independent secret rotation without redeployment when a partner rotates their key.
Controller Architecture
Keep webhook controllers separate from your main API hierarchy. They have fundamentally different authentication (HMAC vs. API keys) and different lifecycle concerns. An ActionController::API base class specifically for webhooks keeps these concerns isolated:
class Webhooks::BaseController < ActionController::API
before_action :verify_signature
end
The most important mindset shift: receiving webhooks is fire-and-forget from the sender's perspective. Return success quickly, then process asynchronously if needed.
Part 2: Sending Webhooks Reliably
Sending webhooks is the flip side of receiving them, and it comes with its own set of challenges. While inbound webhooks focus on verification and security, outbound webhooks are all about reliability — ensuring your notification reaches the destination even when networks are flaky or services temporarily unavailable.
The most critical architectural decision when building outbound webhook systems is understanding the difference between network failures and HTTP errors. Network failures (timeouts, connection refused, DNS errors) are transient — they indicate the remote service might be temporarily unreachable but could recover. These should trigger retries. HTTP errors (4xx, 5xx responses), however, typically indicate a configuration or payload problem that won't resolve itself through retries:
class WebhookSenderService
def call
response = send_webhook
# Log success/failure but don't retry HTTP errors
return Result.new(success: true) if response.is_a?(Net::HTTPSuccess)
return Result.new(success: false, error: "HTTP #{response.code}")
rescue Net::OpenTimeout, Net::ReadTimeout, SocketError => e
# Re-raise network errors so the job retries
raise e
end
end
Your background job should implement exponential backoff with polynomial retry intervals. A typical pattern: retry network failures up to 5 times with increasing delays (30s, 5min, 30min, 2h, 8h), then give up. For audit purposes, create the sync log record in a pending state before making the HTTP call, then update it to success or failed afterward — if your process crashes mid-request, the pending record serves as evidence.
One subtle but important detail: use after_commit rather than after_save for your webhook callback. This ensures the job is enqueued only after the database transaction commits, preventing race conditions where the job executes against uncommitted data or phantom records from rolled-back transactions.
Building a Polymorphic Audit Trail
A robust audit trail transforms webhook debugging from guesswork into structured investigation. Your webhook_events table should use polymorphic associations to handle both directions with a single schema:
class CreateWebhookEvents < ActiveRecord::Migration[7.1]
def change
create_table :webhook_events do |t|
t.references :eventable, polymorphic: true, null: false
t.string :direction, null: false # 'inbound' or 'outbound'
t.string :status, null: false # 'pending', 'success', 'failed'
t.text :request_body
t.text :response_body
t.integer :http_status
t.string :error_class
t.text :error_message
t.jsonb :metadata, default: {}
t.timestamps
end
add_index :webhook_events, [:eventable_type, :eventable_id]
add_index :webhook_events, [:direction, :status, :created_at]
end
end
The eventable association allows tracking events against different domain objects — an order status update, a payment notification, or an inventory sync — without schema changes. The direction field keeps both flows in one table while enabling separate queries.
What to capture: Store the raw request/response bodies as text for exact replay during debugging. JSONB metadata handles variable data like retry attempt numbers, venue identifiers, or API versions. Keep http_status separate from status — a 500 response is still a "completed" HTTP transaction, distinct from network timeouts.
Managing table growth: This table grows linearly with webhook volume. For high-traffic systems, partition by created_at monthly and implement a retention policy. Archive events older than 90 days to cold storage, keeping only failed events indefinitely for pattern analysis.
# Production debugging example
WebhookEvent.where(direction: 'outbound', status: 'failed')
.where('created_at > ?', 1.day.ago)
.group(:error_class)
.count
# => {"Net::OpenTimeout"=>47, "WebhookService::InvalidSignature"=>3}
This immediately reveals whether you're fighting network instability or a configuration issue — fundamentally different problems requiring different solutions.
Error Handling Philosophy and Recovery Strategies
When building webhook systems, your error handling philosophy should fundamentally distinguish between expected failures and unexpected ones. Expected failures—like a 404 from a deleted resource or a 422 validation error—signal a problem with your payload or configuration. These shouldn't trigger retries; the payload is wrong and won't magically become right on attempt #17. Unexpected failures—network timeouts, connection refused, temporary 503s—are transient and should retry.
This distinction shapes your entire recovery strategy:
class WebhookDeliveryJob < ApplicationJob
retry_on Net::OpenTimeout, Net::ReadTimeout, wait: :polynomially_longer
discard_on Net::HTTPClientError, Net::HTTPServerError
def perform(syncable)
result = WebhookSyncService.call(syncable)
if result.success?
syncable.sync_logs.create!(status: 'success', direction: 'to_upstream')
else
# HTTP errors - log and stop
syncable.sync_logs.create!(
status: 'failed',
direction: 'to_upstream',
error_message: result.error
)
end
rescue Net::OpenTimeout => e
# Network error - log but re-raise for retry
syncable.sync_logs.create!(status: 'pending', error_message: e.message)
raise
end
end
For monitoring, treat your sync logs as a first-class audit trail. A growing number of pending logs indicates jobs are retrying (possible upstream degradation). A spike in failed logs suggests configuration drift or API contract changes. Set up alerts for both patterns.
For truly stuck webhooks—perhaps the upstream system is down for days—implement a dead letter queue pattern. After exhausting retries, move the event to a separate failed_webhooks table with enough context for manual replay. Build an admin interface where ops can inspect the payload, update it if needed, and retry once the issue is resolved.
The key insight: automation handles the common path (transient failures), but you need human-friendly tools for the edge cases.
Network Error vs HTTP Error Decision Tree
Rails-Specific Implementation Patterns
Rails provides excellent primitives for webhook handling that align naturally with its convention-over-configuration philosophy. Here's how to structure a production-ready implementation.
Controller Setup for Inbound Webhooks
Use ActionController::API for webhook endpoints rather than your existing API framework (like Grape). This keeps authentication concerns separate and prevents webhook routes from being caught by overly-broad error handlers:
# config/routes.rb
post '/webhooks/venue/:venue_id', to: 'webhooks/venues#create'
mount GrapeAPI => '/' # After webhook routes
# app/controllers/webhooks/venues_controller.rb
class Webhooks::VenuesController < Webhooks::BaseController
def create
verify_signature
return if performed?
ProcessWebhookJob.perform_later(venue_id: params[:venue_id], payload: request.raw_post)
head :ok
end
private
def verify_signature
received = request.headers['X-Webhook-Signature']
expected = OpenSSL::HMAC.hexdigest('SHA256', venue.webhook_secret, request.raw_post)
unless ActiveSupport::SecurityUtils.secure_compare(received, expected)
render json: { error: 'Invalid signature' }, status: :unauthorized
end
end
end
The performed? check pattern lets validation methods render responses directly while maintaining readable flow control in the action.
Background Processing with Targeted Retries
For outbound webhooks, differentiate between network failures (retry) and HTTP errors (don't retry):
class SendWebhookJob < ApplicationJob
retry_on Net::OpenTimeout, Net::ReadTimeout, wait: :polynomially_longer
discard_on WebhookService::HTTPError
def perform(record)
WebhookService.new(record).send_update
end
end
HTTP 4xx/5xx responses indicate configuration or payload problems that won't resolve through retries. Network timeouts are transient and benefit from exponential backoff.
Audit Trail with Polymorphic Associations
Track webhook delivery with a reusable polymorphic log:
create_table :webhook_sync_logs do |t|
t.references :syncable, polymorphic: true, null: false
t.integer :direction, null: false # enum: [:to_upstream, :from_upstream]
t.integer :status, default: 0 # enum: [:pending, :success, :failed]
t.jsonb :payload
t.text :error_message
t.timestamps
end
This design supports bidirectional tracking and multiple record types without additional migrations.
Testing Strategies
For outbound webhooks, use VCR to record real HTTP interactions:
it "sends signed payload", :vcr do
expect { service.send_update }.to change(WebhookSyncLog, :count).by(1)
expect(WebhookSyncLog.last).to be_success
end
For inbound webhooks, use request specs with signature generation helpers:
def generate_signature(body, secret)
OpenSSL::HMAC.hexdigest('SHA256', secret, body)
end
it "accepts valid signature" do
payload = { status: 'filled' }.to_json
post venue_webhook_path(venue), params: payload,
headers: { 'X-Webhook-Signature' => generate_signature(payload, venue.webhook_secret) }
expect(response).to have_http_status(:ok)
end
Webhook Event Lifecycle State Transitions
Webhook Event Lifecycle State Transitions
Security Considerations and Attack Vectors
Webhook systems present unique security challenges because they expose server-side endpoints to external callers, often with limited ability to verify the source. The most critical defense is proper HMAC signature verification, but implementation details matter enormously.
Timing Attacks on Signature Verification
Never use standard string comparison (==) to verify HMAC signatures. An attacker can measure response times to determine which bytes match, gradually reconstructing a valid signature:
# VULNERABLE - timing leak reveals signature bytes
if request.headers["X-Webhook-Signature"] == expected_signature
process_webhook
end
# SAFE - constant-time comparison
if ActiveSupport::SecurityUtils.secure_compare(
request.headers["X-Webhook-Signature"],
expected_signature
)
process_webhook
end
The secure_compare method takes the same amount of time regardless of where the strings differ, preventing timing-based attacks.
Signature Bypass Attempts
Always verify signatures before parsing the payload. Computing the HMAC against the raw request body, not a re-serialized version, prevents attacks where an attacker exploits JSON parsing differences:
raw_body = request.body.read
expected = OpenSSL::HMAC.hexdigest("SHA256", secret, raw_body)
return head :unauthorized unless secure_compare(provided, expected)
payload = JSON.parse(raw_body) # Only parse after verification
Denial of Service Protection
Webhook endpoints are prime DoS targets. Implement multiple layers of defense:
- Rate limiting per source: Use Rack::Attack or similar to limit requests per IP or signature key
- Payload size limits: Reject bodies over a reasonable threshold (e.g., 1MB) before verification
- Timeout enforcement: Set strict timeouts for payload processing
- Queue depth monitoring: Track pending webhook jobs and reject new webhooks if the queue is saturated
IP Allowlisting Trade-offs
IP allowlisting provides defense-in-depth but comes with operational overhead. Many webhook providers use dynamic IP ranges or CDNs, requiring frequent allowlist updates. It's best used as a secondary control alongside signature verification, not a replacement. For high-security scenarios, require both valid signatures AND source IP verification.
Production Lessons and War Stories
After years of webhook implementations across financial platforms, here are the lessons that only production load teaches you.
Database indexes matter more than you think. When you're logging every webhook attempt in a polymorphic audit table, missing indexes will kill you. Always index [syncable_type, syncable_id, created_at] together—you'll be querying recent sync history per record constantly during debugging. We learned this when a status page query brought the database to its knees scanning 4 million audit rows.
Separate secrets for each direction. Using the same HMAC secret for inbound and outbound webhooks seems elegant, but creates an operational nightmare during rotation. When an upstream provider forces a secret change, you need to rotate independently without coordinating both sides simultaneously. Store outbound secrets per destination, inbound secrets per source.
Network errors ≠ HTTP errors. This distinction changed our retry strategy completely. Retrying a 400 Bad Request is pointless—your payload is malformed. But Errno::ETIMEDOUT? Absolutely retry. Our job configuration reflects this:
class WebhookSyncJob < ApplicationJob
retry_on Net::OpenTimeout, wait: :polynomially_longer, attempts: 5
discard_on Net::HTTPClientError # 4xx responses
end
The after_commit callback trap. Using after_save to enqueue webhook jobs leads to a subtle race condition: the job can execute before the transaction commits, seeing stale data. Worse, if the transaction rolls back, you've sent a webhook for a change that never persisted. Always use after_commit on: :update.
Keep a runbook. When webhooks fail at 2 AM, you need a decision tree: Is the signature failing? Check secret rotation dates. Getting timeouts? Check provider status page. Seeing pending logs older than 10 minutes? Dead job workers. We maintain a Notion runbook mapping each failure mode to diagnostic queries and remediation steps.
Building for Resilience
Throughout this article, we've explored two sides of webhook infrastructure: receiving webhooks from external systems and sending them out. The architectural decisions in each direction share a common thread—designing for failure.
Resilient webhook systems treat sending and receiving as fundamentally different problems. Inbound webhooks require defensive validation and timing-safe authentication, while outbound webhooks need intelligent retry logic that distinguishes transient network failures from permanent payload errors. Building separate controller hierarchies and service objects for each direction keeps these concerns cleanly separated.
# Receiving: fail fast with explicit validation
def create
payload = parse_payload
return if performed? # Short-circuit on validation failure
validate_signature(payload)
return if performed?
# Process only valid, authenticated requests
end
# Sending: classify failures for appropriate handling
begin
response = http.request(signed_request)
raise NetworkError if timeout_or_connection_issue
log_failure(response) if response.code.to_i >= 400
rescue NetworkError => e
raise # Let job framework retry
rescue => e
log_failure(e) # Don't retry configuration problems
end
Comprehensive audit trails are non-negotiable. A polymorphic sync log table with direction enums and pending/success/failed states provides visibility into webhook behavior across your entire system. When things go wrong—and they will—these logs become your debugging lifeline.
Finally, embrace graceful degradation. Accept webhooks from degraded venues because in-flight transactions matter. Log HTTP errors without retrying because broken payloads won't fix themselves. Use after_commit callbacks so jobs only fire after successful transactions. These patterns acknowledge that distributed systems are messy, and resilience comes from handling inevitable failures intelligently rather than optimistically assuming success.
The webhook systems that survive production are the ones built with failure as a first-class consideration, not an afterthought.