Nimbus

Monitoring & Incidents

The Monitoring page is where you watch for trouble and track recovery. It is organized into three tabs — Incidents, Recoveries, and Rules — with a summary strip across the top.

The recovery strip #

The strip at the top of the page answers three questions at a glance:

  • Open incidents — failures that still need attention.
  • AI recoveries today — how many failures AI handled, and the success rate.
  • MTTR — mean time to recover, with a trend sparkline.

Incidents tab #

The Incidents tab lists automation failures. Use the filter chips to narrow the view:

  • Open — failures not yet resolved.
  • AI working — a remediation is in progress.
  • Resolved — failures that recovered (by rerun, by AI, or manually).
  • All

Click a row to expand it. An expanded incident shows:

  • The failing automation and the error classification (timeout, query conflict, data volume, syntax, permission, missing object, or unknown).
  • The error message from SFMC.
  • If AI ran, the diagnosis and a line-level diff of the proposed SQL or configuration change.
  • Actions: Review & promote, Re-run, Decline, or Pause.

Recoveries tab #

The Recoveries tab is a chronological timeline of every recovery action, grouped by day. Filter by AI actions, Auto-reruns, or Resolved, and export the view to CSV for reporting.

Each entry records the original automation, the attempt number, the outcome, and how long recovery took.

Rules tab #

The Rules tab shows the monitoring rules protecting this Business Unit as a grid of cards. Each card shows the rule type (automation or folder), whether AI recovery is on, and stats — times triggered in the last 7 days, success rate, and average MTTR. Add a rule with the dashed Add rule card.

How a failure becomes an incident #

  1. An automation errors in SFMC.
  2. The event reaches Nimbus and the CloudPage enriches it with per-step detail.
  3. Nimbus checks the failure against your monitoring rules.
  4. If a rule matches, a rerun attempt is created and the incident appears here.
  5. The incident updates live as the rerun — and, if needed, AI remediation — progresses.

If no rule matches, the failure is still recorded on the Events page; it simply does not trigger automatic recovery.

Manual reruns #

You do not need a rule to retry something. From an incident (or an event) you can trigger a manual rerun. Nimbus queues it exactly like an automatic one — waits for the original run to settle, builds a trimmed automation, fires it, and reports the result.