Skip to main content

Building a Self-Healing Feedback Loop

Routines, cloud containers, and triage agents that turn overnight error signals and feedback into reviewable fixes by morning.

Michael Parker8 min read
Updated 1 June 2026

For most small engineering teams, the triage rhythm looks much the same. A user hits an error. Sentry pings the channel. Whoever is on rotation drops the feature work in flight, opens the trace, tries to reproduce, files a Linear issue, decides whether it needs fixing now or in the next cycle, and then tries to recover the context they had before the ping arrived.

Every individual step is cheap. The aggregate is brutal. One team that tracked this for a fortnight found roughly forty per cent of engineering time going to reactive triage, not because the bugs were hard, but because the context switches were constant.

A different shape of system addresses this. The error reports and feedback signals still arrive at the same rate. The team simply stops reacting to them in real time. Signals flow into a loop that triages, dedupes, and proposes fixes overnight, and the queue is reviewed once each morning.

This post describes how such a loop is built.

The four pieces

The goal is not to build a "self-healing" anything. The goal is to stop being interrupted. The system that emerges has four moving parts.

  1. Error boundaries in the product that emit structured, agent-readable signals, not just stack traces.
  2. Claude Code routines as the orchestration layer that listens for those signals and decides what to dispatch.
  3. Cloud containers so every agent run gets a clean, isolated checkout of the repo.
  4. Triage and coding agents that turn raw signals into Linear issues and draft pull requests.

A fifth piece, easy to miss, is the human review pattern that sits on top. The whole point of the loop is to deliver reviewable work, not autonomous merges. More on that below.

The flow

  Error boundary / feedback signal
              │
              ▼
       Routine trigger
              │
              ▼
   Triage agent (cloud container)
              │
       ┌──────┴───────────────────────────────────┐
   duplicate?                                      new
       │                                            │
       ▼                                            ▼
 update Linear issue,                 create Linear issue (scope + repro)
 increment signal count                             │
                                                    ▼
                                        well-scoped and safe?
                                                    │
                                  ┌─────────────────┴─────────────────┐
                                 no                                  yes
                                  │                                   │
                                  ▼                                   ▼
                        hold for human triage          Coding agent (cloud container)
                                                                      │
                                                                      ▼
                                                        Draft PR (fix + reasoning)
                                                                      │
                                                                      ▼
                                                          Morning review queue
                                                                      │
                                 ┌────────────────────────────────────┼───────────────────────┐
                              merge                                redirect                   close
                                 │                                     │                        │
                                 ▼                                     ▼                        ▼
                               ship                       re-dispatch with context       mark wrong-tree

It looks more complicated drawn out than it feels in practice. From the human seat there are only two surfaces. Linear, where issues appear, get deduped, and accumulate context. And GitHub, where draft PRs queue up overnight.

Error boundaries as signals, not just safety nets

The first thing to change is not the agents. It is the data they have to work with.

Error boundaries typically do what error boundaries usually do. Catch, render a fallback, log to Sentry. This is useful for users and mediocre for agents. A stack trace tells you where something blew up. It tells you almost nothing about what the user was trying to do.

The fix is to rewrite the boundaries so they emit a structured event that includes the route, the actor and workspace context, the last few user actions in the session, the relevant feature flag state, and a compact summary of what the user was likely attempting. The boundary still catches and renders the fallback. It just also speaks a language a triage agent can reason about.

Feedback signals get the same treatment. Customer messages that hit the inbox are normalised into the same envelope. From the routine's perspective, "the dashboard crashed when I clicked refresh" and "I tried to refresh the dashboard and it spun forever" arrive as comparable inputs.

The lesson, worth internalising early, is that agents are only as good as the structure of what they are handed. The clearer the signal, the less the agent has to guess.

Routines as the orchestration layer

The routine triggers in Claude Code are the spine. One routine subscribes to error events, one to customer feedback, and one to a deduped-issue queue inside Linear. Each routine is short. It reads the signal, decides which agent profile to dispatch, and hands off.

Routines are not agents. That distinction matters. They are the policy layer that says "this kind of signal goes to this kind of worker in this kind of container with these guardrails." To change behaviour, you change a routine, not an agent prompt. That separation is the single biggest reason the system stays maintainable.

Cloud containers as the substrate

Every dispatch runs in its own cloud container. Clean checkout, scoped credentials, fresh state. Containers beat long-lived workers for three reasons.

First, isolation. A coding agent exploring a fix should not pollute the workspace another agent is reasoning about.

Second, reproducibility. When a proposal is reviewed, the exact container can be rerun to see what the agent saw.

Third, the cost shape suits the workload. Most of these runs are short, bursty, and parallel. Long-running workers are the wrong primitive.

The boring practical detail is to cache the dependency layer aggressively. A cold checkout to "ready to run tests" can land around twenty seconds for a monorepo, which is fast enough that no one notices.

Triage agents and the deduplication problem

This is where the system earns its keep, and it is the piece that is most easily underestimated.

The triage agent's job sounds simple. Read the signal. Search Linear for similar existing issues. If there is a match, update it and increment the signal count. If not, create a new issue with a clean scope, repro steps where possible, and a confidence rating.

This matters because most "new" errors are not new. In one two-week measurement window, roughly seventy per cent of inbound error signals were duplicates of an existing issue, or duplicates of each other arriving in the same hour. Without the loop, those duplicates land as fresh Sentry alerts and get reacted to individually. With the loop, they aggregate against a single issue with a rising count, turning twenty decisions into one.

Deduplication turns out to be a precondition for everything downstream. If the triage agent gets dedup wrong, the coding agent gets dispatched repeatedly against the same root cause, and the review queue fills with redundant proposals. This warrants more tuning than any other part of the system.

The agent uses three signals to match. Structural similarity in the captured event envelope, semantic similarity against the issue title and description, and a stack-trace fingerprint where one exists. A match on any two is treated as a duplicate. A match on one flags it for human confirmation. This holds up well.

Coding agents and the proposal contract

Only a subset of issues get dispatched to a coding agent automatically. The routine applies a safety filter. Scope clear, blast radius bounded, no production data dependencies, no auth or billing surfaces. Anything that fails the filter sits in a "hold for human triage" lane.

For issues that pass, the coding agent gets the Linear ticket, a clean container with the repo checked out, and a contract. Reproduce the fault, propose a fix, open a draft PR, and write up the reasoning in the PR description. If it cannot reproduce the fault, say so and stop. If the fix touches more than the change it set out to make, stop.

That last rule is hard-won. Early versions of such an agent will notice tangentially related smells and start refactoring. The PRs are technically correct and impossible to review. Constraining the agent to its declared scope is what makes the review queue tractable.

The morning review pattern

The human side of the loop is the part worth dwelling on, because it is the part that is easy to underrate.

Each morning there is a queue. Usually somewhere between five and twelve proposals, plus a handful of issues that were filed but did not progress to a fix. It can be worked through in one sitting, roughly thirty to forty minutes.

For each proposal there is one of three calls. Merge, redirect, or close. Merge is self-explanatory. Redirect means the fix is in the wrong direction and the agent needs more context, provided in a PR comment and re-dispatched. Close means the agent took a wrong-tree path and the issue needs human attention or should not be fixed at all.

In one window, the split settled at roughly a third, a third, a third. Slightly better than that on small bugs, slightly worse on anything that touches state or background jobs. The "wrong-tree" rate is the one worth watching most carefully, because it is the canary for whether upstream triage is degrading.

What changes in the day is not that bugs get fixed faster, although they do. It is that no one carries the open loop anymore. Bugs that used to nibble at attention all day instead sit in a single review surface visited once. The drop in cognitive load is the real payoff.

What the numbers look like

A snapshot from one recent two-week window (illustrative).

  • Inbound error and feedback signals: ~340
  • Unique issues after dedup: ~95
  • Auto-dispatched to a coding agent: ~52
  • Draft PRs produced overnight: ~48
  • Merged after review: ~17
  • Redirected (re-dispatched with context): ~14
  • Closed as wrong-tree: ~17
  • Mean time from signal to merged fix for the merged set: ~14 hours, most of it overnight

The signal-to-merged-fix number is the surprising one. It used to live in the three-to-seven-day range for non-critical bugs, because the bottleneck was not fix complexity, it was attention.

What tends to go wrong

A few things, in case it saves time.

It is easy to over-trust the coding agent early. Letting it open non-draft PRs gets noticed by CI fast. Switching to draft-only and gating merges behind explicit human review is the right call.

It is easy to under-invest in the safety filter. A first version of "well-scoped and safe" that lets through changes to a background jobs runner produces genuinely scary PRs. A conservative filter is worth the friction.

It is tempting to assume the triage agent can work from stack traces alone. It cannot. The work that goes into structured error boundaries is the highest-leverage investment in the whole system.

Where this is going

The loop as described handles errors and inbound feedback. The obvious next surface is regressions caught by the test suite during scheduled runs, which fit the same envelope. Beyond that, the pattern generalises. Anything that emits a structured signal and lands in a tracker is candidate work for the same triage-and-propose loop.

The bigger shift is that the unit of engineering work has quietly stopped being "find and fix the bug." For a class of bugs, the unit is now "review and accept the proposal." The skill that matters has moved one rung up.

For a small team feeling permanently behind on triage, this pattern is worth an afternoon. The pieces are all available. The hard part is letting go of doing the triage yourself for long enough to see whether the loop closes.