Where workflows fail
Automation fails most often at integration boundaries: delayed webhooks, malformed payloads, and hidden assumptions in downstream systems.
Retry policy design
Design retries with intent, not default settings:
- Retry only transient failures
- Cap attempts and introduce exponential backoff
- Route persistent failures to a recovery queue
function shouldRetry(status: number) {
return [408, 409, 429, 500, 502, 503, 504].includes(status);
}Human-in-the-loop handoff
Critical automations should degrade gracefully. Add human review stages when confidence is low or business impact is high.
Reliability warning
Do not silently drop failed jobs. Every failure needs a visible state and an owner.
Operations dashboard
Your operations dashboard should expose queue depth, median completion time, failure reasons, and retry rates by workflow.