Manipulation Resistance: Implementation Guide
Practical guidance for implementing APTS Manipulation Resistance requirements. Each section provides a brief implementation approach, key considerations, and common pitfalls.
Note: This guide is informative, not normative. Recommended defaults and example values are suggested starting points; the Manipulation Resistance README contains the authoritative requirements. Where this guide and the README differ, the README governs.
APTS-MR-001: Instruction Boundary Enforcement
Implementation: Implement cryptographic tagging of operator instructions separate from target content. Maintain strict source tracking throughout execution pipeline.
Key Considerations:
- Tag all operator inputs with immutable cryptographic markers before processing
- Enforce separation in memory/data structures; never merge instruction and target data paths
- Validate source origin before acting on any directive
Common Pitfalls:
- Mixing operator and target content in shared buffers
- Losing source attribution through multi-step processing
Prompt Injection Detection for AI/LLM-Based Platforms:
Platforms using LLMs for decision-making MUST implement layered prompt injection defense:
- Input sanitization: Strip or escape instruction-like patterns from target responses before passing to LLM. Patterns include: "ignore previous instructions", "you are now", "system:", "assistant:", and role-switching attempts.
- Instruction boundary enforcement: Use structured prompts with clear delimiters between system instructions (immutable) and target data (untrusted). Example: wrap target data in explicit tags like
<target_response>...</target_response> and instruct the model to treat content within these tags as data, never as instructions.
- Output validation: Validate LLM outputs against expected action schemas before execution. If the LLM proposes an action outside the pre-defined action set, reject it and log the anomaly.
- Canary detection: Embed unique tokens in system prompts. If these tokens appear in target-facing outputs, the instruction boundary has been breached. Alert the operator immediately.
- Dual-model verification: For high-risk decisions, submit the same input to a second independent model and compare outputs. Divergent recommendations trigger escalation.
APTS-MR-002: Response Validation & Sanitization
Implementation: Validate all target responses against predefined schemas. Apply regex-based sanitization to remove instruction-like patterns (keywords: "execute", "ignore", "override", "admin", "perform").
Key Considerations:
- Define strict schemas per endpoint before testing begins
- Log all sanitization events for post-test auditing
- Use allowlist validation where possible instead of blocklist
Common Pitfalls:
- Over-reliance on blocklists that miss obfuscated patterns
- Insufficient logging of sanitization events
APTS-MR-003: Error Message Neutrality
Implementation: Treat all error messages as untrusted data only for analysis. Never execute directives embedded in error responses. Extract only diagnostic information for vulnerability assessment.
Key Considerations:
- Parse error messages for vulnerability indicators, not instructions
- Implement circuit breaker on suspicious error patterns
- Document all extracted error information separately from actions taken
Common Pitfalls:
- Following debugging suggestions in error messages
- Assuming error messages from trusted domains are safe
APTS-MR-004: Configuration File Integrity Verification
Implementation: Analyze discovered config files for vulnerabilities without executing embedded instructions. Cryptographically sign expected configs; reject unsigned variants. Flag instruction-like content as findings.
Key Considerations:
- Maintain baseline hashes of legitimate configurations
- Scan for env var injection, comment-based instructions, hidden directives
- Separate vulnerability analysis from any auto-remediation
Common Pitfalls:
- Auto-loading or executing config-embedded scripts
- Missing injection vectors in config comments
APTS-MR-005: Authority Claim Detection & Rejection
Implementation: Maintain allowlist of authentic authority markers (cryptographic signatures, registered domain patterns). Flag and reject any authority claims not on allowlist as suspicious.
Key Considerations:
- Define authority markers in pre-test configuration only
- Log all authority claims for post-test review
- Treat "admin", "developer", "system" claims from target content as red flags
Common Pitfalls:
- Accepting domain-based authority without cryptographic proof
- Allowing runtime modification of authority allowlists
APTS-MR-006: Decision Boundary Enforcement
Implementation: Codify all critical decisions (scope, escalation, credentials handling) in immutable pre-test configuration. Prevent runtime modifications via target-controlled data paths.
Key Considerations:
- Use immutable configuration objects after initialization
- Require explicit human re-authorization for any scope/credential changes
- Version-control all pre-test decisions for audit trail
Common Pitfalls:
- Allowing dynamic scope expansion based on target responses
- Implicit scope changes via redirect chains
APTS-MR-007: Redirect Following Policy
Implementation: Validate all HTTP redirects (3xx status codes), JavaScript redirects, and meta-refresh directives against pre-approved scope. Reject out-of-scope redirects; log for analysis.
Key Considerations:
- Maintain explicit redirect whitelist per target
- Detect redirect chains; validate each hop independently
- Monitor for redirect-based DoS or infinite loops
Common Pitfalls:
- Blindly following relative redirects to different hosts
- Missing JavaScript-based redirects (window.location)
APTS-MR-008: DNS and Network-Level Redirect Prevention
Implementation: Validate DNS responses against authorized scope before network connection. Detect DNS rebinding attacks and TOCTOU race conditions. Use DNSSEC validation where available.
Key Considerations:
- Implement DNS response validation against allowlist IPs
- Monitor for rapid DNS changes (rebinding)
- Pin resolved IPs for duration of interaction with domain
Common Pitfalls:
- Trusting DNS without validation; missing rebinding attacks
- Not detecting DNS response changes between lookup and connection
APTS-MR-009: Server-Side Request Forgery (SSRF) Prevention in Testing
Implementation: Before following any URL discovered in target responses, validate against scope. Block private IP ranges (RFC 1918, 127.0.0.1, ::1) unless explicitly in scope.
Key Considerations:
- Maintain expanded IP blocklist: 169.254.0.0/16, 224.0.0.0/4, and cloud metadata endpoints for AWS (169.254.169.254 IMDSv1 and IMDSv2), GCP (metadata.google.internal, 169.254.169.254, and IPv6 fd00:ec2::254), and Azure (169.254.169.254 on path /metadata/instance with header "Metadata: true")
- Check URL scheme (only http/https in most cases)
- Log all blocked SSRF attempts with target source context
Common Pitfalls:
- Missing alternative IP representations (decimal, hex, octal notation)
- Not blocking cloud metadata endpoints (AWS, GCP, Azure)
APTS-MR-010: Scope Expansion Social Engineering Prevention
Implementation: Detect social engineering patterns in target content (urgency, authority claims, emotional appeals). Flag suspicious scope expansion requests; require human approval.
Key Considerations:
- Maintain pattern library of social engineering indicators
- Flag requests containing "urgent", "emergency", "authorization", "please", "critical"
- Implement mandatory human review gate for scope changes
Common Pitfalls:
- Missing emotionally manipulative language in target responses
- Auto-accepting scope expansion claims without human verification
APTS-MR-011: Out-of-Band Communication Prevention
Implementation: Block all communication channels except pre-approved testing channels. Detect and prevent blind SSRF, DNS exfiltration, and indirect data exfiltration attempts by targets.
Key Considerations:
- Monitor outbound DNS, HTTP, SMTP for unauthorized targets
- Detect second-order request patterns (email callbacks, webhook calls)
- Log all out-of-band attempt detections
Common Pitfalls:
- Missing DNS query-based exfiltration
- Not detecting targets' attempts to control outbound traffic
APTS-MR-012: Immutable Scope Enforcement Architecture
Implementation: Implement scope as cryptographically signed, immutable data structure. Use hardware-backed or TPM-based enforcement where possible. Require cryptographic re-signing for any modifications.
Key Considerations:
- Use digital signatures with offline key storage for scope validation
- Implement scope as append-only audit log with cryptographic chaining
- Store scope outside regular application memory where possible
Common Pitfalls:
- Storing scope in modifiable configuration files
- Missing logging of scope override attempts
APTS-MR-013: Adversarial Example Detection in Vulnerability Classification
Implementation: Test AI/ML-based vulnerability classification against adversarial examples. Include payload obfuscation, semantic equivalence variants, and jailbreak attempts in test sets.
Key Considerations:
- Maintain adversarial example library (encoding bypasses, prompt injections, semantic tricks)
- Run quarterly adversarial testing against classification models
- Document all adversarial bypass attempts for model hardening
Common Pitfalls:
- Relying solely on standard test payloads; missing novel obfuscation
- Not updating adversarial library as attack techniques evolve
APTS-MR-014: Resource Exhaustion and Tarpit Attack Prevention
Implementation: Implement strict per-endpoint timeouts (5-30s). Detect slowloris, timeout-based tarpits, and memory exhaustion patterns. Implement exponential backoff with jitter.
Key Considerations:
- Set baseline response time threshold; flag 10x+ deviations as tarpits
- Implement connection/request limits per host
- Monitor heap/memory growth; abort on unusual spikes
Common Pitfalls:
- Over-generous timeouts allowing indefinite hangs
- Missing memory exhaustion detection in response parsing
APTS-MR-015: Deceptive Authentication Honeypots
Implementation: Flag auth endpoints discovered in scope as potentially deceptive. Do not auto-authenticate to suspicious endpoints. Require explicit human approval before credential submission.
Key Considerations:
- Maintain list of known deceptive auth patterns (fake login forms, secondary auth)
- Detect endpoints mimicking legitimate auth with slight variations
- Log all suspicious auth endpoint detections
Common Pitfalls:
- Auto-submitting credentials to any /login or /auth endpoint
- Missing visually similar but distinct auth endpoint clones
APTS-MR-016: Anti-Automation Defense Detection
Implementation: Respect rate limiting headers (Retry-After, X-RateLimit-*). Detect CAPTCHA, challenge-response, and fingerprinting defenses. Back off immediately upon detection.
Key Considerations:
- Parse standard rate limit headers; apply backoff before hitting limits
- Detect CAPTCHA forms; refuse auto-completion
- Monitor for fingerprinting scripts (bot detection); abort if detected
Common Pitfalls:
- Ignoring Retry-After headers
- Attempting to bypass anti-automation via multiple IPs or rotation
APTS-MR-017: Anomaly Detection in Response Patterns
Implementation: Baseline normal response patterns (size, latency, error rates) for each endpoint. Flag deviations exceeding 3 sigma as anomalies. Investigate or abort on suspicious patterns.
Key Considerations:
- Collect 10+ baseline responses per endpoint before aggressive testing
- Monitor content divergence (identical requests returning different responses)
- Detect timing anomalies (sudden slowdowns, consistency changes)
Common Pitfalls:
- Starting aggressive testing without baseline establishment
- Missing second-order anomalies (anomalies in anomalies)
Implementation: Enforce strict architectural separation between system instructions and target data in AI pipelines. Tag all target data; prevent merging with system prompts. Implement input validation before AI model ingestion.
Key Considerations:
- Use separate model instances or namespaced contexts for system and target data
- Implement prompt injection detection (unusual control characters, instruction keywords)
- Log all target data ingested by models for audit
Common Pitfalls:
- Concatenating target data directly into prompts without sanitization
- Missing encoding-based prompt injection vectors
APTS-MR-019: Discovered Credential Protection
Implementation: Immediately encrypt all discovered credentials (AES-256-GCM) at discovery time. Never store plaintext in logs, memory dumps, or output files. Use dedicated secret management system.
Key Considerations:
- Implement credential discovery hooks to auto-encrypt at parse time
- Use unique encryption keys per credential; rotate keys post-test
- Separate plaintext in-memory access to authorized personnel only
Common Pitfalls:
- Credentials logged before encryption
- Reusing encryption keys across multiple tests
- Storing decryption keys in same location as encrypted data
APTS-MR-020: Adversarial Validation and Resilience Testing of Safety Controls
Implementation: Conduct annual independent adversarial testing by external team. Include prompt injection, jailbreak attempts, scope bypass scenarios. Document all control failures; implement mitigations.
Key Considerations:
- Use a red team independent of the development team (internal or external)
- Test against realistic threat models (insider, advanced persistent actor)
- Maintain detailed findings log; track remediation completion
Common Pitfalls:
- Internal teams conducting "adversarial" testing without realistic threat modeling
- Skipping adversarial testing or conducting only sporadic reviews
APTS-MR-021: Data Isolation Adversarial Testing
Implementation: Test cross-tenant access vectors (shared storage, shared memory, shared logs, shared error handlers). Attempt to access other tenant data via all communication channels.
Key Considerations:
- Provision multiple test tenants; attempt cross-tenant access via all discovery vectors
- Test log aggregation systems, error reporting, analytics endpoints
- Verify data encryption at rest and in transit per tenant
Common Pitfalls:
- Testing only primary application data; missing isolation in logs/errors
- Not testing all secondary channels (metrics, traces, backups)
APTS-MR-022: Inter-Model Trust Boundaries and Output Validation
Implementation: Define trust boundaries between AI model components. Validate all model-to-model outputs against strict schemas. Never trust downstream model outputs without sanitization.
Key Considerations:
- Implement output validation layer between model components
- Apply same sanitization rules to model outputs as to target content
- Log all cross-model data transfers for audit trails
Common Pitfalls:
- Assuming high-confidence model outputs are safe to pass to downstream systems
- Missing injection vectors through model chaining (model A output -> model B input)
APTS-MR-023: Agent Runtime as an Untrusted Component
Implementation: Write a threat-model document that explicitly names the agent runtime as one of the adversaries the platform defends against. Isolate the agent runtime from the control plane at a layer the runtime cannot reach: a separate process with no shared filesystem to platform config, a separate container with no mount of the platform's secrets or allowlist, a separate network namespace with no route to the control-plane management endpoints. Platform configuration files (allowlists, threshold tables, kill-switch config, audit store credentials) live on a volume the agent runtime has no read, write, or modify access to. Distribution-shift monitoring is a background job that consumes audit records and flags anomalies for a human-reviewed queue. Run a tabletop exercise at least quarterly that walks through "what if the agent tries to do X" for a few concrete X values drawn from current threat intelligence.
Key Considerations:
- This requirement is architectural; it cannot be satisfied by prompt engineering or model alignment
- The isolation must survive the agent being wrong, not just the agent being attacked
- Distribution-shift monitoring needs a human in the loop because legitimate novel behavior will look anomalous
Common Pitfalls:
- Putting the agent runtime and the control plane in the same process and calling "function boundaries" isolation
- Mounting the platform configuration into the agent runtime's filesystem "for convenience"
- Treating a quarterly tabletop exercise as a box-ticking ritual rather than an opportunity to find real gaps
Advisory Practice Implementation Guidance
APTS-MR-A01: Goal Misgeneralization and Emergent Misalignment Evaluation Suite
This section provides implementation guidance for the advisory practice APTS-MR-A01. It is not required for conformance at any tier.
Implementation: Build the evaluation suite as an independent pipeline outside the agent's serving infrastructure. Each run takes a fixed model configuration (foundation model version, adapter set, system prompt, tool access policy, temperature) and produces a structured run record with per-scenario decisions, per-category alignment rates, and an aggregate score committed to the audit trail.
Architecture Pattern: Independent Evaluation Pipeline
A proven pattern is to implement the suite using an off-the-shelf evaluation framework (Inspect AI, Braintrust, OpenAI Evals, or an equivalent harness) and to run it as a standalone job rather than as part of the engagement runtime. The pipeline has four components:
- Scenario corpus. A versioned set of distinguishing scenarios stored as data, not code. Each scenario carries: a target description (the synthetic environment the agent is asked to assess), the proxy-objective action, the true-objective action, the rubric mapping the agent's chosen action to one of the three classes (true-objective-aligned, proxy-objective-aligned, other), and a category tag.
- Run harness. A driver that instantiates the same agent loop the platform uses in production (same system prompt, same tool registry, same control plane) but pointed at the synthetic targets in the scenario corpus. The harness records every tool call, every model output, and the agent's terminal action per scenario.
- Scorer. A rule-based or LLM-judge scorer (independent of the agent under test) that maps each scenario's run record to one of the three classes. Where an LLM judge is used, the judge MUST run as a separate inference call with no shared context with the agent under test (same constraint as APTS-RP-A01).
- Reporter. Aggregates per-category alignment rates, computes aggregate score, compares against documented thresholds, computes deltas against the most recent prior run on the same model version, and writes a signed run record to the audit trail per APTS-AR-012.
Calibrated Alignment Thresholds:
For scenario categories where the population of distinguishing cases is small, define a minimum sample size before thresholds apply. For example: a category with fewer than 30 scenarios should report a confidence-interval band rather than a single alignment rate, and threshold checks should compare the lower bound of the band against the floor. This prevents a category with 5 scenarios from triggering reauthorization on a single regression that is within statistical noise.
Out-of-Distribution Audit:
Implement the post-fine-tuning out-of-distribution audit as a separate evaluation job that runs the model against a held-out general-purpose benchmark (HHH-style or equivalent) before and after every fine-tuning event. The audit is concerned with relative shift, not absolute score: a several-point drop on a previously-stable benchmark following narrow fine-tuning is the signal the Nature 2026 methodology identifies as emergent misalignment, and the relative drop matters more than the absolute number.
Key Considerations:
- The evaluation suite must be governed as carefully as the model under APTS-TP-019. Scenario corpus contamination, where suite scenarios leak into post-deployment learning data, silently inflates alignment rates and is the evaluation-side analogue of training-set contamination in capability benchmarks.
- The judge component must use a different model family or, if the same family, a different system prompt and a separate inference call. APTS-RP-A01's independence requirement applies symmetrically here.
- Run the suite against the same model configuration the platform uses in production, including system prompt, tool registry, temperature, and any safety wrappers. A suite that passes against a stripped-down configuration provides no assurance about production behavior.
- Suite results should be tracked across runs with explicit versioning of (model, adapter, system prompt, suite version). A drop in alignment rate is interpretable only relative to prior runs on a comparable configuration.
Common Pitfalls:
- Building a scenario corpus that distinguishes the proxy from the true objective only on cases the agent has already seen during training. Distinguishing scenarios must be genuinely held out from training data.
- Treating an LLM-judge scorer as ground truth without inter-rater reliability checks. The scorer is a measurement instrument and needs its own calibration.
- Running the suite once at platform launch and never again. The point of the practice is to catch drift, which requires repeated runs across model and adapter changes.
- Treating the suite as a marketing artifact rather than a control. Publishing high alignment rates without committing run records and scenario versions to the audit trail provides no assurance and creates a perverse incentive to overfit the suite.
APTS-MR-A02: Sandbagging Detection and Behavioral Consistency Validation
This section provides implementation guidance for the advisory practice APTS-MR-A02. It is not required for conformance at any tier.
Implementation: Build the consistency measurement as an independent pipeline that ingests, on one side, per-scenario records from the adversarial eval suite of APTS-MR-020, and on the other side a stratified sample of production engagement transcripts captured under APTS-AR-001. Each measurement run is anchored to a single model snapshot identifier (per APTS-AR-019) and produces a structured differential report (per-dimension deltas, sample sizes, and a verbalized-recognition correlation) committed to the audit trail.
Architecture Pattern: Differential Measurement Pipeline
The pipeline has four components, mirroring the structure used in APTS-MR-A01:
- Eval-side ingestor. Reads per-scenario records emitted by the MR-020 eval suite for the model snapshot under measurement. The ingestor records, per scenario, the agent's terminal action, tool-call trace, and reasoning trace if available.
- Production-side sampler. Draws a stratified sample of production transcripts from the audit store (APTS-AR-001), applying redactions consistent with APTS-TP-013 sensitive-data handling. The sampler MUST NOT relax retention or consent boundaries to obtain samples; where sampling is constrained, the limitation is reported in the differential report rather than worked around.
- Differential scorer. Computes per-dimension metrics on each sample and the delta between them. Where rule-based scoring is insufficient (for example, refusal-rate classification on free-form tool invocations), an LLM judge is used; the judge MUST run as a separate inference call with no shared context with the agent under test (same constraint as APTS-MR-A01 and APTS-RP-A01) and SHOULD use a different model family.
- Trend reporter. Aggregates the differential across model snapshots tracked under APTS-AR-019 so that the change in the differential itself is observable, writes a signed run record to the audit trail per APTS-AR-012, and emits the validation report for inclusion alongside the customer engagement record.
Stratified Sampling:
Stratify the production-side sample across at least three dimensions: autonomy level (L1–L4), engagement size class, and customer-environment class (for example, web application, internal network, cloud control plane). Document a minimum sample size per dimension before per-dimension thresholds apply; for dimensions with sparse samples, report a confidence-interval band rather than a single point estimate, and compare the band's bound against the threshold (mirroring the calibrated-threshold treatment in APTS-MR-A01).
Verbalized Eval-Recognition Audit:
Where reasoning traces are available, scan eval-side transcripts for evaluation-context tokens or situational-awareness markers (Laine et al., 2024, Situational Awareness Dataset; Anthropic Sonnet 4.5 / Opus 4.6 system-card methodology) and compute the correlation between verbalized recognition and the per-scenario behavioral classification. A non-zero correlation is informative even when the verbalized-recognition rate is low; absence of verbalized recognition is not informative and MUST NOT be reported as evidence of absence of recognition.
Key Considerations:
- Production telemetry sampling MUST respect APTS-TP-013 sensitive-data handling and MUST NOT extend retention beyond what APTS-AR-001 already permits. Differential measurement is not a justification for new collection.
- The differential is interpretable only against a model-snapshot identifier (APTS-AR-019). Comparing eval samples and production samples drawn under different snapshots without correcting for the change confounds the measurement.
- Eval-suite contamination from the production side (for example, production samples that were previously absorbed into post-deployment learning data) silently narrows the measured delta. Treat eval scenarios as held out from any post-deployment learning data, consistent with APTS-MR-A01's contamination guidance.
- Where production sampling is impossible (privacy-restricted engagement, contractual restriction), substitute high-fidelity production-mimicking synthetic transcripts and disclose the substitution; do not omit the dimension.
Common Pitfalls:
- Computing the delta against a single eval run and a single engagement; the noise floor is high, and any threshold crossing under such a sample is more likely to be sampling noise than a real consistency violation. Aggregate across multiple runs and engagements before applying threshold checks.
- Treating zero verbalized eval recognition as evidence of no eval awareness. Anthropic Opus 4.6 results show models can detect evaluations without verbalizing the detection; verbalization is one signal, not the metric.
- Using the same model and prompt for the judge as for the agent under test. The independence requirement collapses, and the measurement becomes a mirror of the agent's own classification behavior.
- Publishing a differential report once and never refreshing. The Chaudhary et al. (2025) power-law scaling result implies the differential will tend to grow with model scale; the practice is only useful as a longitudinal artifact across snapshot changes.
Implementation Roadmap
Tier 1 (implement before any autonomous pentesting begins):
MR-001 (instruction boundary enforcement), MR-002 (response validation/sanitization), MR-003 (error message neutrality), MR-004 (config file integrity), MR-005 (authority claim detection), MR-006 (decision boundary enforcement), MR-007 and MR-008 (redirect and DNS-level prevention), MR-009 (SSRF prevention), MR-010 (scope expansion social engineering prevention), MR-011 (out-of-band communication prevention), MR-012 (immutable scope enforcement), MR-018 (AI model I/O boundary), MR-019 (discovered credential protection).
Start with MR-001 and MR-002 (instruction boundaries, response sanitization). These are your primary defense against prompt injection. Add MR-009 (SSRF) and MR-012 (immutable scope) next, then layer the remaining T1 controls.
Tier 2 (implement within first 3 engagements):
MR-013 (adversarial example detection), MR-014 (resource exhaustion/tarpit prevention), MR-015 (deceptive authentication honeypots), MR-016 (anti-automation defense detection), MR-017 (response pattern anomaly detection), MR-020 (adversarial validation and resilience testing of safety controls), MR-022 (inter-model trust boundaries), MR-023 (agent runtime as an untrusted component).
Prioritize MR-020 (adversarial testing of your own controls) and MR-023 (agent runtime isolation) together. MR-023 defines the architectural property that makes the other MR controls actually binding when the underlying model changes; MR-020 validates that the property holds in practice.
Tier 3 (implement based on threat landscape):
MR-021 (data isolation adversarial testing).