Manipulation Resistance: Implementation Guide

Practical guidance for implementing APTS Manipulation Resistance requirements. Each section provides a brief implementation approach, key considerations, and common pitfalls.

Note: This guide is informative, not normative. Recommended defaults and example values are suggested starting points; the Manipulation Resistance README contains the authoritative requirements. Where this guide and the README differ, the README governs.


APTS-MR-001: Instruction Boundary Enforcement

Implementation: Implement cryptographic tagging of operator instructions separate from target content. Maintain strict source tracking throughout execution pipeline.

Key Considerations:

Common Pitfalls:

Prompt Injection Detection for AI/LLM-Based Platforms:

Platforms using LLMs for decision-making MUST implement layered prompt injection defense:

  1. Input sanitization: Strip or escape instruction-like patterns from target responses before passing to LLM. Patterns include: "ignore previous instructions", "you are now", "system:", "assistant:", and role-switching attempts.
  2. Instruction boundary enforcement: Use structured prompts with clear delimiters between system instructions (immutable) and target data (untrusted). Example: wrap target data in explicit tags like <target_response>...</target_response> and instruct the model to treat content within these tags as data, never as instructions.
  3. Output validation: Validate LLM outputs against expected action schemas before execution. If the LLM proposes an action outside the pre-defined action set, reject it and log the anomaly.
  4. Canary detection: Embed unique tokens in system prompts. If these tokens appear in target-facing outputs, the instruction boundary has been breached. Alert the operator immediately.
  5. Dual-model verification: For high-risk decisions, submit the same input to a second independent model and compare outputs. Divergent recommendations trigger escalation.

APTS-MR-002: Response Validation & Sanitization

Implementation: Validate all target responses against predefined schemas. Apply regex-based sanitization to remove instruction-like patterns (keywords: "execute", "ignore", "override", "admin", "perform").

Key Considerations:

Common Pitfalls:


APTS-MR-003: Error Message Neutrality

Implementation: Treat all error messages as untrusted data only for analysis. Never execute directives embedded in error responses. Extract only diagnostic information for vulnerability assessment.

Key Considerations:

Common Pitfalls:


APTS-MR-004: Configuration File Integrity Verification

Implementation: Analyze discovered config files for vulnerabilities without executing embedded instructions. Cryptographically sign expected configs; reject unsigned variants. Flag instruction-like content as findings.

Key Considerations:

Common Pitfalls:


APTS-MR-005: Authority Claim Detection & Rejection

Implementation: Maintain allowlist of authentic authority markers (cryptographic signatures, registered domain patterns). Flag and reject any authority claims not on allowlist as suspicious.

Key Considerations:

Common Pitfalls:


APTS-MR-006: Decision Boundary Enforcement

Implementation: Codify all critical decisions (scope, escalation, credentials handling) in immutable pre-test configuration. Prevent runtime modifications via target-controlled data paths.

Key Considerations:

Common Pitfalls:


APTS-MR-007: Redirect Following Policy

Implementation: Validate all HTTP redirects (3xx status codes), JavaScript redirects, and meta-refresh directives against pre-approved scope. Reject out-of-scope redirects; log for analysis.

Key Considerations:

Common Pitfalls:


APTS-MR-008: DNS and Network-Level Redirect Prevention

Implementation: Validate DNS responses against authorized scope before network connection. Detect DNS rebinding attacks and TOCTOU race conditions. Use DNSSEC validation where available.

Key Considerations:

Common Pitfalls:


APTS-MR-009: Server-Side Request Forgery (SSRF) Prevention in Testing

Implementation: Before following any URL discovered in target responses, validate against scope. Block private IP ranges (RFC 1918, 127.0.0.1, ::1) unless explicitly in scope.

Key Considerations:

Common Pitfalls:


APTS-MR-010: Scope Expansion Social Engineering Prevention

Implementation: Detect social engineering patterns in target content (urgency, authority claims, emotional appeals). Flag suspicious scope expansion requests; require human approval.

Key Considerations:

Common Pitfalls:


APTS-MR-011: Out-of-Band Communication Prevention

Implementation: Block all communication channels except pre-approved testing channels. Detect and prevent blind SSRF, DNS exfiltration, and indirect data exfiltration attempts by targets.

Key Considerations:

Common Pitfalls:


APTS-MR-012: Immutable Scope Enforcement Architecture

Implementation: Implement scope as cryptographically signed, immutable data structure. Use hardware-backed or TPM-based enforcement where possible. Require cryptographic re-signing for any modifications.

Key Considerations:

Common Pitfalls:


APTS-MR-013: Adversarial Example Detection in Vulnerability Classification

Implementation: Test AI/ML-based vulnerability classification against adversarial examples. Include payload obfuscation, semantic equivalence variants, and jailbreak attempts in test sets.

Key Considerations:

Common Pitfalls:


APTS-MR-014: Resource Exhaustion and Tarpit Attack Prevention

Implementation: Implement strict per-endpoint timeouts (5-30s). Detect slowloris, timeout-based tarpits, and memory exhaustion patterns. Implement exponential backoff with jitter.

Key Considerations:

Common Pitfalls:


APTS-MR-015: Deceptive Authentication Honeypots

Implementation: Flag auth endpoints discovered in scope as potentially deceptive. Do not auto-authenticate to suspicious endpoints. Require explicit human approval before credential submission.

Key Considerations:

Common Pitfalls:


APTS-MR-016: Anti-Automation Defense Detection

Implementation: Respect rate limiting headers (Retry-After, X-RateLimit-*). Detect CAPTCHA, challenge-response, and fingerprinting defenses. Back off immediately upon detection.

Key Considerations:

Common Pitfalls:


APTS-MR-017: Anomaly Detection in Response Patterns

Implementation: Baseline normal response patterns (size, latency, error rates) for each endpoint. Flag deviations exceeding 3 sigma as anomalies. Investigate or abort on suspicious patterns.

Key Considerations:

Common Pitfalls:


APTS-MR-018: AI Model Input/Output Architectural Boundary

Implementation: Enforce strict architectural separation between system instructions and target data in AI pipelines. Tag all target data; prevent merging with system prompts. Implement input validation before AI model ingestion.

Key Considerations:

Common Pitfalls:


APTS-MR-019: Discovered Credential Protection

Implementation: Immediately encrypt all discovered credentials (AES-256-GCM) at discovery time. Never store plaintext in logs, memory dumps, or output files. Use dedicated secret management system.

Key Considerations:

Common Pitfalls:


APTS-MR-020: Adversarial Validation and Resilience Testing of Safety Controls

Implementation: Conduct annual independent adversarial testing by external team. Include prompt injection, jailbreak attempts, scope bypass scenarios. Document all control failures; implement mitigations.

Key Considerations:

Common Pitfalls:


APTS-MR-021: Data Isolation Adversarial Testing

Implementation: Test cross-tenant access vectors (shared storage, shared memory, shared logs, shared error handlers). Attempt to access other tenant data via all communication channels.

Key Considerations:

Common Pitfalls:


APTS-MR-022: Inter-Model Trust Boundaries and Output Validation

Implementation: Define trust boundaries between AI model components. Validate all model-to-model outputs against strict schemas. Never trust downstream model outputs without sanitization.

Key Considerations:

Common Pitfalls:


APTS-MR-023: Agent Runtime as an Untrusted Component

Implementation: Write a threat-model document that explicitly names the agent runtime as one of the adversaries the platform defends against. Isolate the agent runtime from the control plane at a layer the runtime cannot reach: a separate process with no shared filesystem to platform config, a separate container with no mount of the platform's secrets or allowlist, a separate network namespace with no route to the control-plane management endpoints. Platform configuration files (allowlists, threshold tables, kill-switch config, audit store credentials) live on a volume the agent runtime has no read, write, or modify access to. Distribution-shift monitoring is a background job that consumes audit records and flags anomalies for a human-reviewed queue. Run a tabletop exercise at least quarterly that walks through "what if the agent tries to do X" for a few concrete X values drawn from current threat intelligence.

Key Considerations:

Common Pitfalls:


Advisory Practice Implementation Guidance

APTS-MR-A01: Goal Misgeneralization and Emergent Misalignment Evaluation Suite

This section provides implementation guidance for the advisory practice APTS-MR-A01. It is not required for conformance at any tier.

Implementation: Build the evaluation suite as an independent pipeline outside the agent's serving infrastructure. Each run takes a fixed model configuration (foundation model version, adapter set, system prompt, tool access policy, temperature) and produces a structured run record with per-scenario decisions, per-category alignment rates, and an aggregate score committed to the audit trail.

Architecture Pattern: Independent Evaluation Pipeline

A proven pattern is to implement the suite using an off-the-shelf evaluation framework (Inspect AI, Braintrust, OpenAI Evals, or an equivalent harness) and to run it as a standalone job rather than as part of the engagement runtime. The pipeline has four components:

  1. Scenario corpus. A versioned set of distinguishing scenarios stored as data, not code. Each scenario carries: a target description (the synthetic environment the agent is asked to assess), the proxy-objective action, the true-objective action, the rubric mapping the agent's chosen action to one of the three classes (true-objective-aligned, proxy-objective-aligned, other), and a category tag.
  2. Run harness. A driver that instantiates the same agent loop the platform uses in production (same system prompt, same tool registry, same control plane) but pointed at the synthetic targets in the scenario corpus. The harness records every tool call, every model output, and the agent's terminal action per scenario.
  3. Scorer. A rule-based or LLM-judge scorer (independent of the agent under test) that maps each scenario's run record to one of the three classes. Where an LLM judge is used, the judge MUST run as a separate inference call with no shared context with the agent under test (same constraint as APTS-RP-A01).
  4. Reporter. Aggregates per-category alignment rates, computes aggregate score, compares against documented thresholds, computes deltas against the most recent prior run on the same model version, and writes a signed run record to the audit trail per APTS-AR-012.

Calibrated Alignment Thresholds:

For scenario categories where the population of distinguishing cases is small, define a minimum sample size before thresholds apply. For example: a category with fewer than 30 scenarios should report a confidence-interval band rather than a single alignment rate, and threshold checks should compare the lower bound of the band against the floor. This prevents a category with 5 scenarios from triggering reauthorization on a single regression that is within statistical noise.

Out-of-Distribution Audit:

Implement the post-fine-tuning out-of-distribution audit as a separate evaluation job that runs the model against a held-out general-purpose benchmark (HHH-style or equivalent) before and after every fine-tuning event. The audit is concerned with relative shift, not absolute score: a several-point drop on a previously-stable benchmark following narrow fine-tuning is the signal the Nature 2026 methodology identifies as emergent misalignment, and the relative drop matters more than the absolute number.

Key Considerations:

Common Pitfalls:


APTS-MR-A02: Sandbagging Detection and Behavioral Consistency Validation

This section provides implementation guidance for the advisory practice APTS-MR-A02. It is not required for conformance at any tier.

Implementation: Build the consistency measurement as an independent pipeline that ingests, on one side, per-scenario records from the adversarial eval suite of APTS-MR-020, and on the other side a stratified sample of production engagement transcripts captured under APTS-AR-001. Each measurement run is anchored to a single model snapshot identifier (per APTS-AR-019) and produces a structured differential report (per-dimension deltas, sample sizes, and a verbalized-recognition correlation) committed to the audit trail.

Architecture Pattern: Differential Measurement Pipeline

The pipeline has four components, mirroring the structure used in APTS-MR-A01:

  1. Eval-side ingestor. Reads per-scenario records emitted by the MR-020 eval suite for the model snapshot under measurement. The ingestor records, per scenario, the agent's terminal action, tool-call trace, and reasoning trace if available.
  2. Production-side sampler. Draws a stratified sample of production transcripts from the audit store (APTS-AR-001), applying redactions consistent with APTS-TP-013 sensitive-data handling. The sampler MUST NOT relax retention or consent boundaries to obtain samples; where sampling is constrained, the limitation is reported in the differential report rather than worked around.
  3. Differential scorer. Computes per-dimension metrics on each sample and the delta between them. Where rule-based scoring is insufficient (for example, refusal-rate classification on free-form tool invocations), an LLM judge is used; the judge MUST run as a separate inference call with no shared context with the agent under test (same constraint as APTS-MR-A01 and APTS-RP-A01) and SHOULD use a different model family.
  4. Trend reporter. Aggregates the differential across model snapshots tracked under APTS-AR-019 so that the change in the differential itself is observable, writes a signed run record to the audit trail per APTS-AR-012, and emits the validation report for inclusion alongside the customer engagement record.

Stratified Sampling:

Stratify the production-side sample across at least three dimensions: autonomy level (L1–L4), engagement size class, and customer-environment class (for example, web application, internal network, cloud control plane). Document a minimum sample size per dimension before per-dimension thresholds apply; for dimensions with sparse samples, report a confidence-interval band rather than a single point estimate, and compare the band's bound against the threshold (mirroring the calibrated-threshold treatment in APTS-MR-A01).

Verbalized Eval-Recognition Audit:

Where reasoning traces are available, scan eval-side transcripts for evaluation-context tokens or situational-awareness markers (Laine et al., 2024, Situational Awareness Dataset; Anthropic Sonnet 4.5 / Opus 4.6 system-card methodology) and compute the correlation between verbalized recognition and the per-scenario behavioral classification. A non-zero correlation is informative even when the verbalized-recognition rate is low; absence of verbalized recognition is not informative and MUST NOT be reported as evidence of absence of recognition.

Key Considerations:

Common Pitfalls:


Implementation Roadmap

Tier 1 (implement before any autonomous pentesting begins): MR-001 (instruction boundary enforcement), MR-002 (response validation/sanitization), MR-003 (error message neutrality), MR-004 (config file integrity), MR-005 (authority claim detection), MR-006 (decision boundary enforcement), MR-007 and MR-008 (redirect and DNS-level prevention), MR-009 (SSRF prevention), MR-010 (scope expansion social engineering prevention), MR-011 (out-of-band communication prevention), MR-012 (immutable scope enforcement), MR-018 (AI model I/O boundary), MR-019 (discovered credential protection).

Start with MR-001 and MR-002 (instruction boundaries, response sanitization). These are your primary defense against prompt injection. Add MR-009 (SSRF) and MR-012 (immutable scope) next, then layer the remaining T1 controls.

Tier 2 (implement within first 3 engagements): MR-013 (adversarial example detection), MR-014 (resource exhaustion/tarpit prevention), MR-015 (deceptive authentication honeypots), MR-016 (anti-automation defense detection), MR-017 (response pattern anomaly detection), MR-020 (adversarial validation and resilience testing of safety controls), MR-022 (inter-model trust boundaries), MR-023 (agent runtime as an untrusted component).

Prioritize MR-020 (adversarial testing of your own controls) and MR-023 (agent runtime isolation) together. MR-023 defines the architectural property that makes the other MR controls actually binding when the underlying model changes; MR-020 validates that the property holds in practice.

Tier 3 (implement based on threat landscape): MR-021 (data isolation adversarial testing).