Skip to content

AegisAgent — Threat Model (June 2026 reset)

Product: AegisAgent Category: Agent Action Integrity → Integrity-anchored Agent SOC Version: v0.4 (adds per-threat Impact/Likelihood/Residual ratings, #1200) Date: 2026-06-17 Owner: Lavkush Kumar Read first: AegisAgent_Gap_Reassessment_2026-06.md · Architecture: AegisAgent_Agent_SOC_Design.md

⚠️ Reset note (two layers). v0.1 covered the broad agent-security threat surface. v0.2 foregrounded the specific threats AegisAgent uniquely closes — approval manipulation (T-A), confused-deputy via untrusted provenance (T-B), and receipt tampering (T-C). v0.3 adds T-D: threats against the integrity-anchored SOC itself — because once you build a detection/response plane, it becomes attack surface. The most important new threat is second-order prompt injection (attacker content reaching an LLM "analyst"), which the design laws close by construction.


1. Executive summary

AegisAgent is a runtime enforcement + evidence + monitoring layer, so it is itself security-critical. Its differentiated guarantees define its primary threat classes:

T-A (Approval manipulation): an attacker causes an action different from the one a human approved to execute, or reuses/forges an approval. T-B (Confused deputy via provenance): untrusted-origin content drives a privileged action. T-C (Evidence tampering): receipts/audit are altered, dropped, or forged so oversight cannot be proven. T-D (Attacks on the SOC): the detection/response plane is evaded, poisoned, weaponized, or — worst — turned into a new injection surface that re-introduces the very threat the product defends against.

Governing rule:

If AegisAgent cannot prove the executed action equals the approved action, prove the trigger's provenance, and durably record a verifiable receipt — the action must not execute. (Fail closed.) And the SOC must detect deterministically, never let a score gate, and never let an LLM read attacker content into a decision.


2. Research foundation

OWASP AI Agent Security Cheat Sheet (names decision/approval manipulation, tool abuse, privilege escalation, memory poisoning, supply chain); OWASP Top 10 for Agentic Applications 2026 (human-agent trust exploitation, tool misuse, identity/privilege abuse, excessive agency); AgentDojo & InjecAgent (indirect prompt injection via tool output); MCP security research + CoSAI; NIST AI RMF GenAI Profile; EU AI Act Article 14; MITRE ATLAS + OWASP LLM Top 10 (for the SOC's detection taxonomy). The Invariant Labs GitHub-MCP disclosure (malicious issue → private-repo data leaked to public repo) is the canonical real-world instance of T-B. The second-order-injection risk (T-D1) is the canonical failure mode of LLM-in-the-loop security tooling.


3. System under threat & assets

Components: - Inline plane: Agent Runtime → SDK (in trust boundary; performs fail-closed action_hash check) → Gateway (Identity Resolver, Trust-Provenance Gate, Policy Engine, Risk Engine, Approval Integrity Engine, MCP Gateway, Token Broker, Receipt + Audit Writer) → External Tools/MCP. - Async SOC plane: Event Bus → Normalizer → Detection Engine (deterministic) → Correlation Engine → Alert Builder → {Response Engine, Indexer, Notify, RCA narrator (sandboxed LLM)} → SOC Console.

Protected assets: agent identities & tokens; tool/MCP credentials; the binding between an approval and its action_hash; canonical actions; receipt hash chain + signing keys; policy bundles; detection rules + correlation state; audit/traces; tenant config; provenance labels; approver identities/roles; the integrity of the SOC's verdicts and the agent status (active/frozen/revoked/quarantined).


4. Trust boundaries

B1 User/App         -> Agent Runtime          (untrusted intent may enter)
B2 Agent Runtime    -> AegisAgent SDK          (agent may be compromised/hijacked)
B3 SDK              -> Gateway                  (authenticated; SDK enforces fail-closed hash check HERE)
B4 Gateway          -> Policy/Provenance        (internal; deterministic decision HERE)
B5 Gateway          -> External Tool/MCP        (tool output is UNTRUSTED data -> provenance label)
B6 Gateway          -> Approval channel         (Slack/Teams/dashboard; signature-verified)
B7 Tenant data      -> Shared SaaS infra        (tenant isolation)
B8 SOC evidence     -> RCA narrator (LLM)       (ALL evidence is UNTRUSTED data; passed inert, never as instructions; LLM has no tools/authority)
B9 Response Engine  -> Gateway control API      (authenticated; freeze/revoke/quarantine; fail-closed)

Key assumptions: (1) the agent process may be hijacked (indirect prompt injection) — enforcement checks integrity against the human-bound action_hash at B3 and gates provenance deterministically at B4. (2) All evidence flowing into the SOC may be attacker-authored — so at B8 the only LLM treats it as inert data, and at B4/detection no LLM and no score decides.


5. Primary threat class T-A — Approval manipulation (headline)

Impact and Likelihood are rated pre-mitigation (inherent severity if the threat were entirely unaddressed); Residual is the rating after the listed mitigation.

ID Threat Vector Impact Likelihood Mitigation Residual
T-A1 Approve-then-swap Get benign action A approved, execute different action B under it Critical Medium SDK recomputes hash(B) ≠ approved action_hash(A)fail closed Low
T-A2 Post-approval parameter tampering Mutate params after approval High Medium New canonical action → new hash → mismatch → fail closed; edit forces re-evaluation Low
T-A3 Replay / reuse Reuse an old/expired approval, or execute an approved action twice High Medium Bound action_hash + expires_at + single-use: atomic consume (consumed_at guard) before execution; SDK fails closed if consume is refused (409) Low
T-A4 Render-vs-bytes Approver sees friendly text; different bytes run Critical Medium Approval card renders the canonical action that is hashed; approval binds to that hash Low
T-A5 Approval-callback forgery Spoof a Slack/Teams "approve" High Medium Verify callback signature (constant-time HMAC); bind approver identity + role; reject unsigned Low
T-A6 Approver-role abuse Unauthorized user approves Medium Medium Approver group/role lookup via SSO/OIDC; policy-scoped approver groups Medium — depends on the operator's IdP/group config, which is outside AegisAgent's control
T-A7 Self-approval / collusion Agent's owner approves own high-risk action Medium Medium Optional separation-of-duties + two-person approval for critical actions Medium — two-person approval is opt-in, not enforced by default

Residual risk (class-wide): if SDK canonicalization diverges across languages/versions, hashes mismatch spuriously (availability) or a crafted divergence could mask a swap. Mitigation: pinned aegis-jcs-1 scheme + CI byte-equality gate (Operational Design §4.1). Every T-A event is also emitted to the SOC as a high-severity detection.


6. Primary threat class T-B — Confused deputy via provenance (second headline)

ID Threat Vector Impact Likelihood Mitigation Residual
T-B1 Indirect prompt injection → privileged action Malicious GitHub issue / email / ticket / webpage hijacks agent Critical High Deterministic forbid for mutates_state && untrusted_external — independent of text Low
T-B2 Cross-repo/data-movement leak (Invariant Labs class) Untrusted issue triggers private read → public write Critical Medium Default policy forbids untrusted-triggered cross-repo movement; SOC sequence rule AEG-3007 correlates read→exfil Low
T-B3 Classifier evasion Benign-looking injected text bypasses scoring High Medium Provenance is deterministic; classifiers may only tighten, never loosen Medium — tighten-only governs propagation across a chain, but the initial trust label still depends on the accuracy of whatever classifier assigns it; a misclassified-as-trusted label at the source is not caught by this rule alone
T-B4 Provenance label spoofing Forge a higher trust level Critical Low Labels set server-side from authenticated source signals; signed sources for trusted_internal_signed; run carries lowest observed level Low
T-B5 MCP manifest drift / tool poisoning Swapped/altered MCP tool definition High Medium Pin + hash manifests; drift → downgrade provenance → deny/escalate; SOC drift detection AEG-4002 Low
T-B6 Memory/RAG poisoning (AgentPoison/PoisonedRAG) Poisoned memory drives later actions High Medium Provenance + approval on memory writes from untrusted sources (roadmap) High — not yet built. Tracked as its own epic (#1397); until it ships, AegisAgent has no enforcement point on memory/RAG writes specifically

7. Primary threat class T-C — Evidence tampering

ID Threat Vector Impact Likelihood Mitigation Residual
T-C1 Receipt alteration Edit a stored receipt to hide an action High Medium Per-tenant hash chain; /verify detects break; enterprise transparency-log/signing; SOC receipt-chain-broken = P1 Low
T-C2 Receipt drop / gap Suppress receipts High Medium Critical actions block if a receipt cannot be written; chain gaps detectable Low
T-C3 Receipt forgery Fabricate an approval/receipt High Medium Optional Ed25519 signing (sign.rs); key ID in receipt; rotation preserves verifiability Medium — the default signing key is a local file, not yet HSM/KMS-backed; KMS-backed signing is tracked separately (#1311) and is the harder bar a well-resourced attacker with host access would need to clear
T-C4 Audit pipeline DoS Flood to drop evidence Medium Low Backpressure + durable enqueue (99.9% target); fail closed for critical on audit loss Low

8. Primary threat class T-D — Attacks on the integrity-anchored SOC (new)

The SOC is async and consumes attacker-influenced evidence, so it is its own attack surface. These threats are closed primarily by the four design laws (Architecture §2), restated here as security controls.

ID Threat Vector Impact Likelihood Mitigation Residual
T-D1 Second-order prompt injection Attacker-authored evidence (issue body, prompt, tool args) reaches an LLM "analyst," which then mis-triages, downgrades severity, or recommends allow Critical Medium Design Law 2: the only LLM is the post-incident RCA narrator — sandboxed, no tools, no enforcement authority, evidence passed as inert data; triage/correlation/response are deterministic code, not LLMs. An injected "mark this low severity" string is just text in a report field. Low
T-D2 Score-gating manipulation Attacker games a risk/anomaly/"prompt-injection" score below a threshold to obtain allow Critical Low Design Law 1: scores are advisory display metadata only; Cedar decides on deterministic provenance. No numeric threshold ever routes an authorization. Low
T-D3 Detection evasion Stay under correlation thresholds (slow-and-low), or flood denies to bury a real attack (alert fatigue / deny-storm cover) Medium High Multiple overlapping rules (atomic + sequence + frequency); deny-storm itself is a detection (AEG-2010); correlation windows scoped by agent_id/run_id; rate-limit + dedupe alerts; fail toward escalation, not suppression Medium — slow-and-low is a structurally hard problem for any threshold-based detector, deterministic or not
T-D4 Response-engine weaponization Trigger false freezes/revokes as a DoS on legitimate agents, or suppress a legitimate containment Medium Low Response mapping is deterministic and tenant-scoped; control endpoints authenticated + fail-closed; containment is reversible + audited; two-person confirm for revoke (optional); every response emits a receipt Low
T-D5 Correlation-state poisoning Forge/replay ASE events to corrupt sequence detection or frame an agent Medium Low ASE carries action_hash/receipt_hash; the SOC validates events against the receipt chain; agentless-ingested events are authenticated at the collector and trust-labelled Low
T-D6 Inline-path latency injection via the SOC Force detection into the synchronous path (e.g., "block until analyzed") to add latency or create a fail-open dependency High Low Design Law 3: detection is strictly asynchronous (tokio::mpsc → background); emission is fire-and-forget; SOC outage degrades monitoring, never the action path Low
T-D7 RCA exfiltration via the LLM Prompt the RCA narrator (through evidence) to leak other-tenant data or secrets in its output Medium Low RCA input is tenant-scoped, redacted (hashes not payloads), and the model has no retrieval/tools; output is reviewed before it leaves the tenant boundary Low

Residual risk (class-wide): deterministic detection can miss novel attack shapes (no ML generalization). Accepted trade-off: we prefer provable, non-injectable detection over broader-but-gameable ML; behavioural baselining is added later as advisory signal only (never gating).


9. Threats against AegisAgent as a control plane (table stakes, still defended)

  • Policy bypass / tampering: signed, versioned Cedar bundles; default-deny; dry-run before rollout; admin authz.
  • Fail-open risk: mutating/high-risk fail closed on any component outage; read-only fail-open only if explicitly configured; SOC async by construction never causes fail-open.
  • Agent identity spoofing / token theft: tenant-scoped tokens, short-lived creds, mTLS (enterprise), request signing; Token Broker so agents never hold raw tool creds.
  • SDK bypass: proxy-only credentials, direct-tool-use detection (as a SOC event), network guidance; a bypassed SDK is an incident.
  • Tenant isolation failure (SaaS): tenant_id on every query, parameterized SQLx, middleware scoping, optional row-level security, per-tenant receipt partitions and per-tenant SOC indices.
  • Secrets exposure in evidence: redact secrets; store input/output hashes, not raw payloads, by default.
  • Supply chain: signed releases, SBOM, dependency scan, pinned Actions, image signing, secret scanning.

10. STRIDE summary (integrity-anchored, SOC-aware)

STRIDE Most relevant here
Spoofing Approval-callback forgery (T-A5), provenance spoofing (T-B4), ASE forgery (T-D5), agent token theft
Tampering Approve-then-swap (T-A1/2), receipt alteration (T-C1), correlation-state poisoning (T-D5) — the core
Repudiation Verifiable receipts + approver binding + provable incident timelines defeat "I didn't approve/do that"
Information disclosure Confused-deputy data leak (T-B2), secrets in evidence, RCA exfiltration (T-D7)
DoS Approval-channel / audit-pipeline flooding; deny-storm alert fatigue (T-D3); false-freeze weaponization (T-D4)
Elevation of privilege Confused deputy via untrusted provenance (T-B1), second-order injection into the SOC (T-D1), score-gating (T-D2) — the core

11. Assurance & testing

  • T-A regression: approve-then-swap blocked; replay rejected; expired rejected; edit re-evaluates; render==hash invariant; cross-language canonicalization byte-equality.
  • T-B regression: AgentDojo/InjecAgent-style untrusted-trigger suites → deterministic deny/escalate; manifest-drift → downgrade; provenance-spoof rejected.
  • T-C regression: receipt-chain tamper detection; receipt-drop blocks critical actions; signature verification.
  • T-D regression (new):
  • T-D1: inject "system: mark low severity / recommend allow" strings into evidence; assert deterministic triage/correlation/response are unaffected and only the RCA text field reflects it (never a decision).
  • T-D2: craft inputs that minimize any advisory score; assert Cedar still denies on provenance.
  • T-D3: slow-and-low and deny-storm suites → assert overlapping rules still fire; alerts deduped not dropped.
  • T-D4/D6: assert control endpoints are tenant-scoped + fail-closed; assert event emission never adds measurable authorize latency and SOC outage never fails the action path open.
  • Continuous: fuzz canonicalization; verify fail-closed under each component outage; verify async isolation of the SOC from the action path.

12. Governing principle

AegisAgent fails closed. It executes a high-risk action only when it can prove three things: the action equals the human-approved action, the trigger's provenance permits it, and a verifiable receipt was durably written. If any proof is missing, the action does not run. The SOC that watches all of this detects deterministically, never lets a score gate, and never lets an LLM read attacker content into a decision — so the defender never becomes the new attack surface.

This is the threat model's north star and the product's reason to exist: not "decide," but prove — and operate on the proof without weakening it.