Recovery and Resilience
How ClosedLoop survives crashes, restarts, and ghost loops without losing progress.
Loops are long-running, and long-running work eventually collides with an app restart, a CMD+Q, a hung process, or a transient network blip. ClosedLoop's recovery layer is designed so the state you see in the web app always matches what happened on disk.
Persistent job store
Every loop dispatched to a desktop target is recorded in the job store (desktop-job-store.json) as a LocalJob. Statuses progress through:
QUEUED → STARTING → RUNNING → COMPLETED / FAILED / CANCELLEDWith intermediate states AWAITING_USER, STOPPED, CANCEL_PENDING, and UNKNOWN when the desktop cannot determine the real outcome.
The store is split between activeJobs (map) and terminalJobs (capped ring of the last 100) so the UI can render both quickly.
Boot recovery
When the desktop app launches, BootRecoveryService reconciles the active-job list with reality:
- For each active job with a persisted PID:
- if the process is alive, reattach the NDJSON tailer and the PID watcher (default 3 second poll)
- if the process is dead, finalize it from state on disk
- Finalization reads
state.jsonand theclaude-output.jsonltail to derive the terminal status (COMPLETED/FAILED/no-changes/skipped), uploads any produced artifacts (plans, diffs), and posts a completion or error event to the API.
Each active job has a maximum of three recovery attempts. The finalizer also maps the "unexpectedly still RUNNING" case to FAILED so a PROCESS_FAILED event is always emitted.
Graceful quit
Quitting the desktop app triggers a deterministic shutdown:
- Mark the window as
quittingso the close handler does not re-hide to the tray. - Call
server.closeAllConnections()to drop active NDJSON streams. - Cap
Observability.shutdown()at two seconds usingPromise.race. - Attach a
.catch()on the shutdown promise so any rejection still exits. - If the orderly shutdown does not complete in eight seconds, a hard-exit failsafe terminates the process.
This is the fix that guarantees CMD+Q reliably quits the app.
Ghost loop detection
A ghost loop is a run that keeps emitting output but never progresses. run-loop.sh aborts when:
is_error: trueappears in the iteration's result JSON (session or context limit)- three consecutive iterations produce no material output
- stderr patterns match "prompt is too long", "context limit reached", etc.
Aborted ghost loops finalize with a reason of ghost-loop and emit the full diagnostics payload for the telemetry pipeline.
Recovery replay for EXECUTE
When a desktop restart interrupts an EXECUTE loop that had already finished, the finalizer replays its work:
- Retries finalization after transient errors.
- Falls back to the staged
imported-plan.mdifplan.jsonis missing. - Preserves a matching local
plan.jsonwhen its content equals the hosted markdown plan. Theimported-plan.mdfallback only kicks in when local content has drifted, so resumed runs do not rewrite a canonical plan that already agrees with the source of truth. - Reloads the latest job snapshot before writing final state, so
artifactsUploadedAtis preserved even when the run ultimately failed. - Handles the 0-tokens "no work produced" case explicitly.
This means an EXECUTE loop does not silently lose its artifacts just because your laptop went to sleep at the wrong moment.