Highly Available System with Queue

Preventing event loss during downstream downtime

Problem

The client needed a webhook relay that forwarded Square events directly to destination stores, but every time a store went briefly offline, those events would be lost. For an eCommerce integration where webhooks trigger payments, order updates and inventory syncs, lost events meant broken data and angry merchants.

Constraints

  • Webhooks were business critical and needed durable delivery, not best effort HTTP forwarding.
  • Destination store availability was outside our control and could fail unpredictably.
  • The architecture needed to absorb spikes as well as downtime, without custom retry state machines.
  • Team size was small, so reliability had to come from managed primitives and simple operations.

Solution

  • Introduced an SQS queue between webhook ingestion and destination delivery.
  • Switched workers to pull from the queue and handle delivery in controlled batches.
  • Added retry behavior with exponential backoff for transient downstream failures.
  • Made queue as the default path (instead of direct-send with fallback), so both downtime and bursty traffic were handled by the same pipeline.

Outcome

  • Eliminated the biggest failure mode: losing events when destination stores were briefly unavailable.
  • Stabilized the relay path during uneven traffic with predictable, queue driven behavior.
  • Reduced operational guesswork by using one delivery model instead of mixed direct/fallback logic.
  • Established a reliability baseline early, so the team could scale throughput without redesigning the delivery architecture.

More Context

Q. Square does support some retry mechanism, so why not use it directly? A. The client was building a Square plugin, thus Square would only send webhooks for all the merchants at a single endpoint. Square also expects very fast response for webhooks, which meant we couldn't just recieve and forward webhooks in a single API call. Hence a middle layer was needed which responds quickly to Square and then acts on behalf of Square to forward webhooks.

Stack

  • Node.js (express)
  • AWS SQS
  • AWS App Runner
  • AWS DynamoDB
Highly Available System with Queue