Runbook: Deny Storm¶
Incident kind: deny_storm · Severity: high · Detection rule: correlate::rule_deny_storm
Symptoms¶
- A SOC alert/incident with
kind: "deny_storm"appears inGET /v1/incidentsor the liveGET /v1/ws/eventsfeed. - An outbound Slack/webhook notification fires (see
slack-integration.md§3) —denydecisions are always high-signal. - One agent accumulates 5 or more
denydecisions within 60 seconds (DENY_STORM_N/DENY_STORM_WINDOW_SECS,gateway/src/correlate.rs) — the rule fires exactly once, at the threshold crossing, not on every subsequent deny.
This is distinct from auto risk-tier escalation (#1296): that mechanism uses a separate, slower default threshold (5 denials within a rolling 60 minutes, configurable via GET|PUT /v1/tenants/risk-escalation) and tightens agents.risk_tier, which is real authorization state. A deny_storm incident can fire well before risk-tier escalation would trigger.
Before you start: check whether this already auto-resolved¶
The Response Engine (gateway/src/respond.rs) maps deny_storm → freeze the agent, but only runs at SOC autonomy level L3/L4. The default is L1 (notify only — no auto-freeze). Check the tenant's effective level first:
# No HTTP API exists for this yet — it's read from `tenants.soc_autonomy_level`
# (DB override) or the AEGIS_SOC_AUTONOMY_LEVEL env var, default "L1".
sqlite3 db/aegisagent.db "SELECT id, soc_autonomy_level FROM tenants WHERE id = '<tenant_id>';"
If the level is L3/L4, the agent is already frozen — skip to Investigation; remediation is to confirm the freeze was correct, not to perform it yourself.
Investigation¶
- Find the incident:
- Pull its evidence graph (every decision behind it, with
risk_score/reasonand matched policies):Seecurl -s -H "Authorization: Bearer $AGENT_TOKEN" \ "http://127.0.0.1:8080/v1/graph/incident/<incident_id>"evidence-graph.mdfor the full investigation workflow (incident → run → agent history → receipt verification). - List the raw denied decisions for the agent to see the actual tool/action pattern:
- Generate an RCA narrative (sandboxed LLM summarizer, post-decision only — Design Law 2):
- Determine the cause: a misconfigured agent (e.g. retrying a denied action in a loop, wrong tool/action name) is the common case; an active attacker probing for an allowed action is rarer but more urgent — look for varied
action/resourcevalues across the denied decisions as a signal of probing rather than a single repeated mistake.
Remediation¶
- Misconfigured agent: freeze while you fix the caller, then unfreeze:
curl -s -X POST -H "Authorization: Bearer $AGENT_TOKEN" \ "http://127.0.0.1:8080/v1/agents/<agent_id>/freeze" -d '{"reason": "deny storm — investigating retry loop"}' # ... fix the calling code/config ... curl -s -X POST -H "Authorization: Bearer $AGENT_TOKEN" \ "http://127.0.0.1:8080/v1/agents/<agent_id>/unfreeze" - Suspected attack/compromise: revoke instead of unfreezing, and rotate the token if it may have leaked (see
agent-token-rotation.md): - Close the incident once handled:
Verification¶
GET /v1/agents/:idshows the expectedstatus(activeafter unfreeze, orrevoked).GET /v1/incidents/<incident_id>showsstatus: "closed".- No new
deny_stormincident for the same agent within the nextDENY_STORM_WINDOW_SECS(60s) window after remediation. - If you rotated the token, confirm the old token is rejected: a
/v1/authorizecall with the old token now returns401.