Runbook: Deny Storm¶

Incident kind: deny_storm · Severity: high · Detection rule: correlate::rule_deny_storm

Symptoms¶

A SOC alert/incident with kind: "deny_storm" appears in GET /v1/incidents or the live GET /v1/ws/events feed.
An outbound Slack/webhook notification fires (see slack-integration.md §3) — deny decisions are always high-signal.
One agent accumulates 5 or more deny decisions within 60 seconds (DENY_STORM_N / DENY_STORM_WINDOW_SECS, gateway/src/correlate.rs) — the rule fires exactly once, at the threshold crossing, not on every subsequent deny.

This is distinct from auto risk-tier escalation (#1296): that mechanism uses a separate, slower default threshold (5 denials within a rolling 60 minutes, configurable via GET|PUT /v1/tenants/risk-escalation) and tightens agents.risk_tier, which is real authorization state. A deny_storm incident can fire well before risk-tier escalation would trigger.

Before you start: check whether this already auto-resolved¶

The Response Engine (gateway/src/respond.rs) maps deny_storm → freeze the agent, but only runs at SOC autonomy level L3/L4. The default is L1 (notify only — no auto-freeze). Check the tenant's effective level first:

# No HTTP API exists for this yet — it's read from `tenants.soc_autonomy_level`
# (DB override) or the AEGIS_SOC_AUTONOMY_LEVEL env var, default "L1".
sqlite3 db/aegisagent.db "SELECT id, soc_autonomy_level FROM tenants WHERE id = '<tenant_id>';"

If the level is L3/L4, the agent is already frozen — skip to Investigation; remediation is to confirm the freeze was correct, not to perform it yourself.

Investigation¶

Find the incident:

curl -s -H "Authorization: Bearer $AGENT_TOKEN" \
  "http://127.0.0.1:8080/v1/incidents?severity=high&status=open" | jq '.[] | select(.kind=="deny_storm")'

Pull its evidence graph (every decision behind it, with risk_score/reason and matched policies):
```
curl -s -H "Authorization: Bearer $AGENT_TOKEN" \
  "http://127.0.0.1:8080/v1/graph/incident/<incident_id>"
```
See evidence-graph.md for the full investigation workflow (incident → run → agent history → receipt verification).

List the raw denied decisions for the agent to see the actual tool/action pattern:

curl -s -H "Authorization: Bearer $AGENT_TOKEN" \
  "http://127.0.0.1:8080/v1/decisions?agent_id=<agent_id>&decision=deny"

Generate an RCA narrative (sandboxed LLM summarizer, post-decision only — Design Law 2):

curl -s -H "Authorization: Bearer $AGENT_TOKEN" \
  "http://127.0.0.1:8080/v1/incidents/<incident_id>/narrate"

Determine the cause: a misconfigured agent (e.g. retrying a denied action in a loop, wrong tool/action name) is the common case; an active attacker probing for an allowed action is rarer but more urgent — look for varied action/resource values across the denied decisions as a signal of probing rather than a single repeated mistake.

Remediation¶

Misconfigured agent: freeze while you fix the caller, then unfreeze:

curl -s -X POST -H "Authorization: Bearer $AGENT_TOKEN" \
  "http://127.0.0.1:8080/v1/agents/<agent_id>/freeze" -d '{"reason": "deny storm — investigating retry loop"}'
# ... fix the calling code/config ...
curl -s -X POST -H "Authorization: Bearer $AGENT_TOKEN" \
  "http://127.0.0.1:8080/v1/agents/<agent_id>/unfreeze"

Suspected attack/compromise: revoke instead of unfreezing, and rotate the token if it may have leaked (see agent-token-rotation.md):
```
curl -s -X POST -H "Authorization: Bearer $AGENT_TOKEN" \
  "http://127.0.0.1:8080/v1/agents/<agent_id>/revoke"
```

Close the incident once handled:

curl -s -X POST -H "Authorization: Bearer $AGENT_TOKEN" \
  "http://127.0.0.1:8080/v1/incidents/<incident_id>/close"

Verification¶

GET /v1/agents/:id shows the expected status (active after unfreeze, or revoked).
GET /v1/incidents/<incident_id> shows status: "closed".
No new deny_storm incident for the same agent within the next DENY_STORM_WINDOW_SECS (60s) window after remediation.
If you rotated the token, confirm the old token is rejected: a /v1/authorize call with the old token now returns 401.