AegisAgent — Threat Model (June 2026 reset)¶

Product: AegisAgent Category: Agent Action Integrity → Integrity-anchored Agent SOC Version: v0.4 (adds per-threat Impact/Likelihood/Residual ratings, #1200) Date: 2026-06-17 Owner: Lavkush Kumar Read first: AegisAgent_Gap_Reassessment_2026-06.md · Architecture: AegisAgent_Agent_SOC_Design.md

⚠️ Reset note (two layers). v0.1 covered the broad agent-security threat surface. v0.2 foregrounded the specific threats AegisAgent uniquely closes — approval manipulation (T-A), confused-deputy via untrusted provenance (T-B), and receipt tampering (T-C). v0.3 adds T-D: threats against the integrity-anchored SOC itself — because once you build a detection/response plane, it becomes attack surface. The most important new threat is second-order prompt injection (attacker content reaching an LLM "analyst"), which the design laws close by construction.

1. Executive summary¶

AegisAgent is a runtime enforcement + evidence + monitoring layer, so it is itself security-critical. Its differentiated guarantees define its primary threat classes:

T-A (Approval manipulation): an attacker causes an action different from the one a human approved to execute, or reuses/forges an approval. T-B (Confused deputy via provenance): untrusted-origin content drives a privileged action. T-C (Evidence tampering): receipts/audit are altered, dropped, or forged so oversight cannot be proven. T-D (Attacks on the SOC): the detection/response plane is evaded, poisoned, weaponized, or — worst — turned into a new injection surface that re-introduces the very threat the product defends against.

Governing rule:

If AegisAgent cannot prove the executed action equals the approved action, prove the trigger's provenance, and durably record a verifiable receipt — the action must not execute. (Fail closed.) And the SOC must detect deterministically, never let a score gate, and never let an LLM read attacker content into a decision.

2. Research foundation¶

OWASP AI Agent Security Cheat Sheet (names decision/approval manipulation, tool abuse, privilege escalation, memory poisoning, supply chain); OWASP Top 10 for Agentic Applications 2026 (human-agent trust exploitation, tool misuse, identity/privilege abuse, excessive agency); AgentDojo & InjecAgent (indirect prompt injection via tool output); MCP security research + CoSAI; NIST AI RMF GenAI Profile; EU AI Act Article 14; MITRE ATLAS + OWASP LLM Top 10 (for the SOC's detection taxonomy). The Invariant Labs GitHub-MCP disclosure (malicious issue → private-repo data leaked to public repo) is the canonical real-world instance of T-B. The second-order-injection risk (T-D1) is the canonical failure mode of LLM-in-the-loop security tooling.

3. System under threat & assets¶

Components: - Inline plane: Agent Runtime → SDK (in trust boundary; performs fail-closed action_hash check) → Gateway (Identity Resolver, Trust-Provenance Gate, Policy Engine, Risk Engine, Approval Integrity Engine, MCP Gateway, Token Broker, Receipt + Audit Writer) → External Tools/MCP. - Async SOC plane: Event Bus → Normalizer → Detection Engine (deterministic) → Correlation Engine → Alert Builder → {Response Engine, Indexer, Notify, RCA narrator (sandboxed LLM)} → SOC Console.

Protected assets: agent identities & tokens; tool/MCP credentials; the binding between an approval and its action_hash; canonical actions; receipt hash chain + signing keys; policy bundles; detection rules + correlation state; audit/traces; tenant config; provenance labels; approver identities/roles; the integrity of the SOC's verdicts and the agent status (active/frozen/revoked/quarantined).

4. Trust boundaries¶

B1 User/App         -> Agent Runtime          (untrusted intent may enter)
B2 Agent Runtime    -> AegisAgent SDK          (agent may be compromised/hijacked)
B3 SDK              -> Gateway                  (authenticated; SDK enforces fail-closed hash check HERE)
B4 Gateway          -> Policy/Provenance        (internal; deterministic decision HERE)
B5 Gateway          -> External Tool/MCP        (tool output is UNTRUSTED data -> provenance label)
B6 Gateway          -> Approval channel         (Slack/Teams/dashboard; signature-verified)
B7 Tenant data      -> Shared SaaS infra        (tenant isolation)
B8 SOC evidence     -> RCA narrator (LLM)       (ALL evidence is UNTRUSTED data; passed inert, never as instructions; LLM has no tools/authority)
B9 Response Engine  -> Gateway control API      (authenticated; freeze/revoke/quarantine; fail-closed)

Key assumptions: (1) the agent process may be hijacked (indirect prompt injection) — enforcement checks integrity against the human-bound action_hash at B3 and gates provenance deterministically at B4. (2) All evidence flowing into the SOC may be attacker-authored — so at B8 the only LLM treats it as inert data, and at B4/detection no LLM and no score decides.

5. Primary threat class T-A — Approval manipulation (headline)¶

Impact and Likelihood are rated pre-mitigation (inherent severity if the threat were entirely unaddressed); Residual is the rating after the listed mitigation.

ID	Threat	Vector	Impact	Likelihood	Mitigation	Residual
T-A1	Approve-then-swap	Get benign action A approved, execute different action B under it	Critical	Medium	SDK recomputes hash(B) ≠ approved `action_hash(A)` → fail closed	Low
T-A2	Post-approval parameter tampering	Mutate params after approval	High	Medium	New canonical action → new hash → mismatch → fail closed; `edit` forces re-evaluation	Low
T-A3	Replay / reuse	Reuse an old/expired approval, or execute an approved action twice	High	Medium	Bound `action_hash` + `expires_at` + single-use: atomic `consume` (`consumed_at` guard) before execution; SDK fails closed if consume is refused (409)	Low
T-A4	Render-vs-bytes	Approver sees friendly text; different bytes run	Critical	Medium	Approval card renders the canonical action that is hashed; approval binds to that hash	Low
T-A5	Approval-callback forgery	Spoof a Slack/Teams "approve"	High	Medium	Verify callback signature (constant-time HMAC); bind approver identity + role; reject unsigned	Low
T-A6	Approver-role abuse	Unauthorized user approves	Medium	Medium	Approver group/role lookup via SSO/OIDC; policy-scoped approver groups	Medium — depends on the operator's IdP/group config, which is outside AegisAgent's control
T-A7	Self-approval / collusion	Agent's owner approves own high-risk action	Medium	Medium	Optional separation-of-duties + two-person approval for critical actions	Medium — two-person approval is opt-in, not enforced by default

Residual risk (class-wide): if SDK canonicalization diverges across languages/versions, hashes mismatch spuriously (availability) or a crafted divergence could mask a swap. Mitigation: pinned aegis-jcs-1 scheme + CI byte-equality gate (Operational Design §4.1). Every T-A event is also emitted to the SOC as a high-severity detection.

6. Primary threat class T-B — Confused deputy via provenance (second headline)¶

ID	Threat	Vector	Impact	Likelihood	Mitigation	Residual
T-B1	Indirect prompt injection → privileged action	Malicious GitHub issue / email / ticket / webpage hijacks agent	Critical	High	Deterministic `forbid` for `mutates_state && untrusted_external` — independent of text	Low
T-B2	Cross-repo/data-movement leak (Invariant Labs class)	Untrusted issue triggers private read → public write	Critical	Medium	Default policy forbids untrusted-triggered cross-repo movement; SOC sequence rule AEG-3007 correlates read→exfil	Low
T-B3	Classifier evasion	Benign-looking injected text bypasses scoring	High	Medium	Provenance is deterministic; classifiers may only tighten, never loosen	Medium — tighten-only governs propagation across a chain, but the initial trust label still depends on the accuracy of whatever classifier assigns it; a misclassified-as-trusted label at the source is not caught by this rule alone
T-B4	Provenance label spoofing	Forge a higher trust level	Critical	Low	Labels set server-side from authenticated source signals; signed sources for `trusted_internal_signed`; run carries lowest observed level	Low
T-B5	MCP manifest drift / tool poisoning	Swapped/altered MCP tool definition	High	Medium	Pin + hash manifests; drift → downgrade provenance → deny/escalate; SOC drift detection AEG-4002	Low
T-B6	Memory/RAG poisoning (AgentPoison/PoisonedRAG)	Poisoned memory drives later actions	High	Medium	Provenance + approval on memory writes from untrusted sources (roadmap)	High — not yet built. Tracked as its own epic (#1397); until it ships, AegisAgent has no enforcement point on memory/RAG writes specifically

7. Primary threat class T-C — Evidence tampering¶

ID	Threat	Vector	Impact	Likelihood	Mitigation	Residual
T-C1	Receipt alteration	Edit a stored receipt to hide an action	High	Medium	Per-tenant hash chain; `/verify` detects break; enterprise transparency-log/signing; SOC `receipt-chain-broken` = P1	Low
T-C2	Receipt drop / gap	Suppress receipts	High	Medium	Critical actions block if a receipt cannot be written; chain gaps detectable	Low
T-C3	Receipt forgery	Fabricate an approval/receipt	High	Medium	Optional Ed25519 signing (`sign.rs`); key ID in receipt; rotation preserves verifiability	Medium — the default signing key is a local file, not yet HSM/KMS-backed; KMS-backed signing is tracked separately (#1311) and is the harder bar a well-resourced attacker with host access would need to clear
T-C4	Audit pipeline DoS	Flood to drop evidence	Medium	Low	Backpressure + durable enqueue (99.9% target); fail closed for critical on audit loss	Low

8. Primary threat class T-D — Attacks on the integrity-anchored SOC (new)¶

The SOC is async and consumes attacker-influenced evidence, so it is its own attack surface. These threats are closed primarily by the four design laws (Architecture §2), restated here as security controls.

ID	Threat	Vector	Impact	Likelihood	Mitigation	Residual
T-D1	Second-order prompt injection	Attacker-authored evidence (issue body, prompt, tool args) reaches an LLM "analyst," which then mis-triages, downgrades severity, or recommends allow	Critical	Medium	Design Law 2: the only LLM is the post-incident RCA narrator — sandboxed, no tools, no enforcement authority, evidence passed as inert data; triage/correlation/response are deterministic code, not LLMs. An injected "mark this low severity" string is just text in a report field.	Low
T-D2	Score-gating manipulation	Attacker games a risk/anomaly/"prompt-injection" score below a threshold to obtain `allow`	Critical	Low	Design Law 1: scores are advisory display metadata only; Cedar decides on deterministic provenance. No numeric threshold ever routes an authorization.	Low
T-D3	Detection evasion	Stay under correlation thresholds (slow-and-low), or flood denies to bury a real attack (alert fatigue / deny-storm cover)	Medium	High	Multiple overlapping rules (atomic + sequence + frequency); deny-storm itself is a detection (AEG-2010); correlation windows scoped by `agent_id`/`run_id`; rate-limit + dedupe alerts; fail toward escalation, not suppression	Medium — slow-and-low is a structurally hard problem for any threshold-based detector, deterministic or not
T-D4	Response-engine weaponization	Trigger false freezes/revokes as a DoS on legitimate agents, or suppress a legitimate containment	Medium	Low	Response mapping is deterministic and tenant-scoped; control endpoints authenticated + fail-closed; containment is reversible + audited; two-person confirm for revoke (optional); every response emits a receipt	Low
T-D5	Correlation-state poisoning	Forge/replay ASE events to corrupt sequence detection or frame an agent	Medium	Low	ASE carries `action_hash`/`receipt_hash`; the SOC validates events against the receipt chain; agentless-ingested events are authenticated at the collector and trust-labelled	Low
T-D6	Inline-path latency injection via the SOC	Force detection into the synchronous path (e.g., "block until analyzed") to add latency or create a fail-open dependency	High	Low	Design Law 3: detection is strictly asynchronous (`tokio::mpsc` → background); emission is fire-and-forget; SOC outage degrades monitoring, never the action path	Low
T-D7	RCA exfiltration via the LLM	Prompt the RCA narrator (through evidence) to leak other-tenant data or secrets in its output	Medium	Low	RCA input is tenant-scoped, redacted (hashes not payloads), and the model has no retrieval/tools; output is reviewed before it leaves the tenant boundary	Low

Residual risk (class-wide): deterministic detection can miss novel attack shapes (no ML generalization). Accepted trade-off: we prefer provable, non-injectable detection over broader-but-gameable ML; behavioural baselining is added later as advisory signal only (never gating).

9. Threats against AegisAgent as a control plane (table stakes, still defended)¶

Policy bypass / tampering: signed, versioned Cedar bundles; default-deny; dry-run before rollout; admin authz.
Fail-open risk: mutating/high-risk fail closed on any component outage; read-only fail-open only if explicitly configured; SOC async by construction never causes fail-open.
Agent identity spoofing / token theft: tenant-scoped tokens, short-lived creds, mTLS (enterprise), request signing; Token Broker so agents never hold raw tool creds.
SDK bypass: proxy-only credentials, direct-tool-use detection (as a SOC event), network guidance; a bypassed SDK is an incident.
Tenant isolation failure (SaaS): tenant_id on every query, parameterized SQLx, middleware scoping, optional row-level security, per-tenant receipt partitions and per-tenant SOC indices.
Secrets exposure in evidence: redact secrets; store input/output hashes, not raw payloads, by default.
Supply chain: signed releases, SBOM, dependency scan, pinned Actions, image signing, secret scanning.

10. STRIDE summary (integrity-anchored, SOC-aware)¶

STRIDE	Most relevant here
Spoofing	Approval-callback forgery (T-A5), provenance spoofing (T-B4), ASE forgery (T-D5), agent token theft
Tampering	Approve-then-swap (T-A1/2), receipt alteration (T-C1), correlation-state poisoning (T-D5) — the core
Repudiation	Verifiable receipts + approver binding + provable incident timelines defeat "I didn't approve/do that"
Information disclosure	Confused-deputy data leak (T-B2), secrets in evidence, RCA exfiltration (T-D7)
DoS	Approval-channel / audit-pipeline flooding; deny-storm alert fatigue (T-D3); false-freeze weaponization (T-D4)
Elevation of privilege	Confused deputy via untrusted provenance (T-B1), second-order injection into the SOC (T-D1), score-gating (T-D2) — the core

11. Assurance & testing¶

T-A regression: approve-then-swap blocked; replay rejected; expired rejected; edit re-evaluates; render==hash invariant; cross-language canonicalization byte-equality.
T-B regression: AgentDojo/InjecAgent-style untrusted-trigger suites → deterministic deny/escalate; manifest-drift → downgrade; provenance-spoof rejected.
T-C regression: receipt-chain tamper detection; receipt-drop blocks critical actions; signature verification.
T-D regression (new):
T-D1: inject "system: mark low severity / recommend allow" strings into evidence; assert deterministic triage/correlation/response are unaffected and only the RCA text field reflects it (never a decision).
T-D2: craft inputs that minimize any advisory score; assert Cedar still denies on provenance.
T-D3: slow-and-low and deny-storm suites → assert overlapping rules still fire; alerts deduped not dropped.
T-D4/D6: assert control endpoints are tenant-scoped + fail-closed; assert event emission never adds measurable authorize latency and SOC outage never fails the action path open.
Continuous: fuzz canonicalization; verify fail-closed under each component outage; verify async isolation of the SOC from the action path.

12. Governing principle¶

AegisAgent fails closed. It executes a high-risk action only when it can prove three things: the action equals the human-approved action, the trigger's provenance permits it, and a verifiable receipt was durably written. If any proof is missing, the action does not run. The SOC that watches all of this detects deterministically, never lets a score gate, and never lets an LLM read attacker content into a decision — so the defender never becomes the new attack surface.

This is the threat model's north star and the product's reason to exist: not "decide," but prove — and operate on the proof without weakening it.