Manipulation Resistance: Implementation Guide

Practical guidance for implementing APTS Manipulation Resistance requirements. Each section provides a brief implementation approach, key considerations, and common pitfalls.

Note: This guide is informative, not normative. Recommended defaults and example values are suggested starting points; the Manipulation Resistance README contains the authoritative requirements. Where this guide and the README differ, the README governs.


APTS-MR-001: Instruction Boundary Enforcement

Implementation: Implement cryptographic tagging of operator instructions separate from target content. Maintain strict source tracking throughout execution pipeline.

Key Considerations:

Common Pitfalls:

Prompt Injection Detection for AI/LLM-Based Platforms:

Platforms using LLMs for decision-making MUST implement layered prompt injection defense:

  1. Input sanitization: Strip or escape instruction-like patterns from target responses before passing to LLM. Patterns include: "ignore previous instructions", "you are now", "system:", "assistant:", and role-switching attempts.
  2. Instruction boundary enforcement: Use structured prompts with clear delimiters between system instructions (immutable) and target data (untrusted). Example: wrap target data in explicit tags like <target_response>...</target_response> and instruct the model to treat content within these tags as data, never as instructions.
  3. Output validation: Validate LLM outputs against expected action schemas before execution. If the LLM proposes an action outside the pre-defined action set, reject it and log the anomaly.
  4. Canary detection: Embed unique tokens in system prompts. If these tokens appear in target-facing outputs, the instruction boundary has been breached. Alert the operator immediately.
  5. Dual-model verification: For high-risk decisions, submit the same input to a second independent model and compare outputs. Divergent recommendations trigger escalation.

APTS-MR-002: Response Validation & Sanitization

Implementation: Validate all target responses against predefined schemas. Apply regex-based sanitization to remove instruction-like patterns (keywords: "execute", "ignore", "override", "admin", "perform").

Key Considerations:

Common Pitfalls:


APTS-MR-003: Error Message Neutrality

Implementation: Treat all error messages as untrusted data only for analysis. Never execute directives embedded in error responses. Extract only diagnostic information for vulnerability assessment.

Key Considerations:

Common Pitfalls:


APTS-MR-004: Configuration File Integrity Verification

Implementation: Analyze discovered config files for vulnerabilities without executing embedded instructions. Cryptographically sign expected configs; reject unsigned variants. Flag instruction-like content as findings.

Key Considerations:

Common Pitfalls:


APTS-MR-005: Authority Claim Detection & Rejection

Implementation: Maintain allowlist of authentic authority markers (cryptographic signatures, registered domain patterns). Flag and reject any authority claims not on allowlist as suspicious.

Key Considerations:

Common Pitfalls:


APTS-MR-006: Decision Boundary Enforcement

Implementation: Codify all critical decisions (scope, escalation, credentials handling) in immutable pre-test configuration. Prevent runtime modifications via target-controlled data paths.

Key Considerations:

Common Pitfalls:


APTS-MR-007: Redirect Following Policy

Implementation: Validate all HTTP redirects (3xx status codes), JavaScript redirects, and meta-refresh directives against pre-approved scope. Reject out-of-scope redirects; log for analysis.

Key Considerations:

Common Pitfalls:


APTS-MR-008: DNS and Network-Level Redirect Prevention

Implementation: Validate DNS responses against authorized scope before network connection. Detect DNS rebinding attacks and TOCTOU race conditions. Use DNSSEC validation where available.

Key Considerations:

Common Pitfalls:


APTS-MR-009: Server-Side Request Forgery (SSRF) Prevention in Testing

Implementation: Before following any URL discovered in target responses, validate against scope. Block private IP ranges (RFC 1918, 127.0.0.1, ::1) unless explicitly in scope.

Key Considerations:

Common Pitfalls:


APTS-MR-010: Scope Expansion Social Engineering Prevention

Implementation: Detect social engineering patterns in target content (urgency, authority claims, emotional appeals). Flag suspicious scope expansion requests; require human approval.

Key Considerations:

Common Pitfalls:


APTS-MR-011: Out-of-Band Communication Prevention

Implementation: Block all communication channels except pre-approved testing channels. Detect and prevent blind SSRF, DNS exfiltration, and indirect data exfiltration attempts by targets.

Key Considerations:

Common Pitfalls:


APTS-MR-012: Immutable Scope Enforcement Architecture

Implementation: Implement scope as cryptographically signed, immutable data structure. Use hardware-backed or TPM-based enforcement where possible. Require cryptographic re-signing for any modifications.

Key Considerations:

Common Pitfalls:


APTS-MR-013: Adversarial Example Detection in Vulnerability Classification

Implementation: Test AI/ML-based vulnerability classification against adversarial examples. Include payload obfuscation, semantic equivalence variants, and jailbreak attempts in test sets.

Key Considerations:

Common Pitfalls:


APTS-MR-014: Resource Exhaustion and Tarpit Attack Prevention

Implementation: Implement strict per-endpoint timeouts (5-30s). Detect slowloris, timeout-based tarpits, and memory exhaustion patterns. Implement exponential backoff with jitter.

Key Considerations:

Common Pitfalls:


APTS-MR-015: Deceptive Authentication Honeypots

Implementation: Flag auth endpoints discovered in scope as potentially deceptive. Do not auto-authenticate to suspicious endpoints. Require explicit human approval before credential submission.

Key Considerations:

Common Pitfalls:


APTS-MR-016: Anti-Automation Defense Detection

Implementation: Respect rate limiting headers (Retry-After, X-RateLimit-*). Detect CAPTCHA, challenge-response, and fingerprinting defenses. Back off immediately upon detection.

Key Considerations:

Common Pitfalls:


APTS-MR-017: Anomaly Detection in Response Patterns

Implementation: Baseline normal response patterns (size, latency, error rates) for each endpoint. Flag deviations exceeding 3 sigma as anomalies. Investigate or abort on suspicious patterns.

Key Considerations:

Common Pitfalls:


APTS-MR-018: AI Model Input/Output Architectural Boundary

Implementation: Enforce strict architectural separation between system instructions and target data in AI pipelines. Tag all target data; prevent merging with system prompts. Implement input validation before AI model ingestion.

Key Considerations:

Common Pitfalls:


APTS-MR-019: Discovered Credential Protection

Implementation: Immediately encrypt all discovered credentials (AES-256-GCM) at discovery time. Never store plaintext in logs, memory dumps, or output files. Use dedicated secret management system.

Key Considerations:

Common Pitfalls:


APTS-MR-020: Adversarial Validation and Resilience Testing of Safety Controls

Implementation: Conduct annual independent adversarial testing by external team. Include prompt injection, jailbreak attempts, scope bypass scenarios. Document all control failures; implement mitigations.

Key Considerations:

Common Pitfalls:


APTS-MR-021: Data Isolation Adversarial Testing

Implementation: Test cross-tenant access vectors (shared storage, shared memory, shared logs, shared error handlers). Attempt to access other tenant data via all communication channels.

Key Considerations:

Common Pitfalls:


APTS-MR-022: Inter-Model Trust Boundaries and Output Validation

Implementation: Define trust boundaries between AI model components. Validate all model-to-model outputs against strict schemas. Never trust downstream model outputs without sanitization.

Key Considerations:

Common Pitfalls:


APTS-MR-023: Agent Runtime as an Untrusted Component

Implementation: Write a threat-model document that explicitly names the agent runtime as one of the adversaries the platform defends against. Isolate the agent runtime from the control plane at a layer the runtime cannot reach: a separate process with no shared filesystem to platform config, a separate container with no mount of the platform's secrets or allowlist, a separate network namespace with no route to the control-plane management endpoints. Platform configuration files (allowlists, threshold tables, kill-switch config, audit store credentials) live on a volume the agent runtime has no read, write, or modify access to. Distribution-shift monitoring is a background job that consumes audit records and flags anomalies for a human-reviewed queue. Run a tabletop exercise at least quarterly that walks through "what if the agent tries to do X" for a few concrete X values drawn from current threat intelligence.

Key Considerations:

Common Pitfalls:


Implementation Roadmap

Tier 1 (implement before any autonomous pentesting begins): MR-001 (instruction boundary enforcement), MR-002 (response validation/sanitization), MR-003 (error message neutrality), MR-004 (config file integrity), MR-005 (authority claim detection), MR-006 (decision boundary enforcement), MR-007 and MR-008 (redirect and DNS-level prevention), MR-009 (SSRF prevention), MR-010 (scope expansion social engineering prevention), MR-011 (out-of-band communication prevention), MR-012 (immutable scope enforcement), MR-018 (AI model I/O boundary), MR-019 (discovered credential protection).

Start with MR-001 and MR-002 (instruction boundaries, response sanitization). These are your primary defense against prompt injection. Add MR-009 (SSRF) and MR-012 (immutable scope) next, then layer the remaining T1 controls.

Tier 2 (implement within first 3 engagements): MR-013 (adversarial example detection), MR-014 (resource exhaustion/tarpit prevention), MR-015 (deceptive authentication honeypots), MR-016 (anti-automation defense detection), MR-017 (response pattern anomaly detection), MR-020 (adversarial validation and resilience testing of safety controls), MR-022 (inter-model trust boundaries), MR-023 (agent runtime as an untrusted component).

Prioritize MR-020 (adversarial testing of your own controls) and MR-023 (agent runtime isolation) together. MR-023 defines the architectural property that makes the other MR controls actually binding when the underlying model changes; MR-020 validates that the property holds in practice.

Tier 3 (implement based on threat landscape): MR-021 (data isolation adversarial testing).