OWASP Autonomous Penetration Testing Standard (APTS) v0.1.0
Informative Appendix (non-normative)
Advisory requirements are practices that provide additional assurance for autonomous penetration testing platforms but are not required for conformance at any tier. These practices were identified during the standard's development as valuable for mature implementations, high-risk engagements, or organizations pursuing the highest levels of operational rigor. They are not required for Tier 1, Tier 2, or Tier 3 conformance.
Advisory practices live exclusively in this appendix. They do not occupy slots in the domain requirement numbering. Platforms that implement advisory practices MAY document their adoption in vendor evaluation materials and customer-facing documentation. Advisory practices do not affect conformance scoring.
Advisory practices use the identifier pattern APTS-<DOMAIN>-A0x (for example, APTS-AR-A01, APTS-AL-A01). This distinguishes them from tier-required requirements, which use the pattern APTS-<DOMAIN>-0xx.
Advisory practices are recommended when:
Rationale: Full state capture and replay support assumes deterministic system behavior that is at odds with LLM-based platforms relying on third-party inference APIs. Non-determinism in model outputs, timing, and external API responses makes strict replay infeasible as a mandatory requirement.
Value: Organizations that implement state capture and replay gain the ability to investigate anomalous platform behavior, reproduce findings for customer validation, and support forensic analysis of decision chains. Partial replay (capturing inputs and comparing outputs) remains valuable even without full deterministic replay.
Practice Description:
Capture sufficient execution state to enable meaningful replay within documented variance bounds. Record platform-controlled configuration (OS, tool versions, model versions and parameters, seed values, proxy and DNS settings), test execution state (timestamps, environment variables, tool configuration, test parameters, and the decision inputs and outputs already logged under APTS-AR-004 and APTS-AR-006), the observed target baseline at engagement start, and network conditions where known. Document every source of non-determinism (cryptographic RNG, timing variance, concurrency effects, probabilistic algorithms, and model inference variance) together with the seed values and control parameters that govern each source. Where exact replay is not achievable, show that decision logic remains traceable from inputs to outputs and that expected variance bounds are documented.
Recommendation: Implement state capture for critical decision points rather than full system state. Document known sources of non-determinism. Use replay as a diagnostic tool rather than a conformance gate.
Related normative requirements: APTS-AR-004, APTS-AR-006.
Rationale: Depends on APTS-AR-A01 (replay support). A strict finding-consistency threshold is not achievable for LLM-driven platforms where model inference variance, external API changes, and timing differences produce legitimate variance between runs.
Value: Variance analysis helps platforms understand their own consistency and identify when model updates or infrastructure changes degrade finding quality. Aggregate variance tracking across many engagements provides useful trend data.
Practice Description:
When replay produces different outputs, define and document equivalence criteria that distinguish acceptable variance (different phrasing of the same finding, alternative evidence paths, equivalent proof-of-concept techniques, consistent severity) from material divergence (different vulnerability classifications, severity-boundary crossings, missing critical findings, false positives, or contradictory findings from the same scenario). Document finding-consistency thresholds, severity tolerance (for example, CVSS variance bounds), and semantic-similarity thresholds for phrasing comparison. Track variance sources, classify each variance event, and perform trend analysis across replay attempts.
Recommendation: Track variance at the engagement level rather than mandating per-finding consistency thresholds. Use variance trends to detect regression after platform updates.
Related normative requirements: APTS-AR-004, APTS-AR-006, APTS-AR-A01.
Rationale: Real-time log streaming to an externally controlled aggregation system with an independently computed hash chain requires significant infrastructure investment. The core tamper-evidence need is already addressed by APTS-AR-012 (hash chains), APTS-AR-001 (structured logging), and APTS-AR-020 (audit trail isolation from the agent runtime).
Value: External log streaming provides defense-in-depth against platform compromise scenarios where an attacker could tamper with local logs. For platforms handling highly sensitive engagements, external log verification provides an additional source of truth.
Practice Description:
Stream all platform action logs to an externally controlled aggregation system within a short latency window (for example, 5 seconds). The external system should sit outside the platform's trust boundary (different host, different credentials, different administrative domain) and compute its own hash chain. The platform's internally computed hash chain and the external system's hash chain should match under normal operation; divergence indicates tampering. Customers of the engagement should have read-only access to the external log store for their data so they can verify the record independently of the platform.
Recommendation: Implement external log streaming for high-sensitivity engagements or when customer contracts require it. Batch upload (for example, every few minutes) is an acceptable alternative to true real-time streaming for most use cases.
Related normative requirements: APTS-AR-001, APTS-AR-012, APTS-AR-020.
Rationale: Hashing every loaded executable section in memory on a short cadence is effectively asking every vendor to build runtime application self-protection. This is disproportionate to the threat model for most autonomous pentest platforms, which operate in controlled environments. Pre-flight binary attestation is already required by APTS-AR-016 (Platform Integrity and Supply Chain Attestation).
Value: Runtime integrity monitoring detects supply chain compromise, unauthorized code modification, and process injection during active engagements. For platforms deployed in hostile network environments or running unattended for extended periods, this provides early warning of platform compromise.
Practice Description:
Verify platform code integrity continuously during engagement execution. Hash loaded executable code sections in memory on a documented cadence and compare against a published baseline; halt the engagement and alert the operator on mismatch. Require runtime-loaded modules (plugins, exploit modules, decision models) to be cryptographically signed and verify signatures before loading. Detect and alert on unauthorized filesystem modifications in the platform's working directory, unexpected process launches from the platform's service account, and configuration file changes during an engagement. For highly sensitive deployments, run from a read-only container image or a Secure Boot verified partition with configuration changes permitted only through signed update packages.
Recommendation: Implement pre-flight integrity verification as the baseline (already required under APTS-AR-014). Reserve continuous runtime monitoring for high-sensitivity deployments or environments where the platform may run unattended in untrusted networks.
Related normative requirements: APTS-AR-014.
Rationale: Core health monitoring is covered by APTS-SC-010. The advanced anomaly detection capabilities (behavioral baseline deviation, statistical confidence scoring, pattern analysis across testing, decision-making, and action dimensions) go beyond practical health monitoring into ML-based behavioral analysis territory.
Value: Behavioral anomaly detection can identify prompt injection attacks, model drift, or platform compromise that simple health checks would miss. Confidence-scored anomaly routing reduces alert fatigue while preserving detection of high-severity events.
Recommendation: Implement basic anomaly detection (probe rate spikes, unusual error rates) as part of SC-010 health monitoring. Reserve statistical baseline analysis and confidence scoring for platforms with dedicated security operations teams.
Applicability: This practice applies to platforms that use LLM-based agents with finite context windows where conversation history may be summarized, truncated, or compacted during an engagement.
Rationale: Context window exhaustion is an inevitable event in long-running autonomous engagements. When the platform summarizes the agent's history to free token budget, the summarizer (whether LLM-based or heuristic) can drop, paraphrase, or dilute constraints that appeared earlier in the conversation. An agent that was told "never test 10.0.0.5" in message 3 may lose that constraint entirely when messages 1 through 50 are summarized into a paragraph. The agent then continues operating without the constraint, producing a silent scope violation that no other safety control detects because the agent genuinely believes it was never restricted. This failure mode is invisible, silent, and unique to LLM-based architectures. The normative requirement set for v0.1.0 is frozen; this practice is a high-priority candidate for tier-gated inclusion in v0.2.0 and is strongly recommended for any platform using LLM-based agents with finite context windows.
Value: Platforms that implement context window safety prevent an entire class of silent scope violations caused by routine context management. This is especially critical for long-running engagements where compaction is guaranteed to occur multiple times, and where the constraints being lost (deny lists, autonomy restrictions, operator directives) are the constraints that prevent the most dangerous failures.
Practice Description:
When the platform summarizes, truncates, or otherwise compacts the agent's context window during an engagement, the platform should ensure that all safety-critical constraints survive the compaction intact. Specifically:
Recommendation: Implement an external safety context store that the orchestration layer manages independently of the agent's conversation history. Keep the safety context document compact (under 2000 tokens) to minimize re-injection overhead. SC-020 (external action allowlist) provides a backstop: even if context-level constraints are lost, the external allowlist still blocks disallowed tools. This practice addresses constraints that are finer-grained than the allowlist (specific deny-list hosts, operator directives, autonomy-level restrictions).
Related normative requirements: APTS-SE-001, APTS-SE-009, APTS-AL-025, APTS-SC-020, APTS-AR-001.
Applicability: This practice applies to platforms that let an agent or model-driven orchestration layer invoke allowlisted tools with runtime-supplied parameters.
Rationale: APTS-SC-020 already does the important architectural work. The allowlist lives outside the model. Entries include allowed parameters or parameter bounds. That is the floor. The unresolved problem appears one layer lower, where a tool is nominally approved but the parameter mix changes what the tool can do in practice. A rate-control tool with a zero-delay setting can still stress a target. A browser connector may be approved for scoped validation, then become an egress path once paired with an export utility. Observations from agentic systems suggest that this is where "allowed" starts to blur into "unsafe." SC-020 defines the enforcement point. This advisory deals with what that enforcement layer should evaluate before dispatch.
Value: Parameter-aware enforcement closes a gap that is easy to miss in design review and easy to trigger at runtime. Each call can look acceptable on its own. The sequence is the problem. Mature platforms catch that before the target does.
Practice Description:
For each allowlisted tool, document a parameter schema that covers types, bounds, enumerated values, and disallowed combinations for safety-critical fields. Those fields usually include target identifiers, rate limits, payload sizes, recursion depth, credential references, export destinations, authentication modes, and execution windows. Validate them in the external enforcement layer before dispatch, not inside the model prompt and not only in the tool wrapper. Then go one step further. Some requests are syntactically valid and still wrong in context. Check that the referenced target is still in scope, that the credential reference is authorized for that tool and that engagement, and that the requested rate or payload remains inside the approved operating envelope.
Platforms operating with multi-step agents should also define a bounded set of disallowed tool chains. This is not an attempt to solve general action planning through brittle signatures. It is narrower than that. Block the sequences that are already understood to produce disallowed outcomes: reconnaissance that flows directly into unapproved credential use, widened enumeration that turns into unauthorized pivoting, or evidence export routed to a destination outside the approved delivery path. Log parameter rejections and chain-policy rejections with the tool identifier, attempted parameters, and the reason for denial.
Recommendation: Start small. Cover the tools that can change scope, traffic intensity, credential usage, or data egress. Build a short catalog of unsafe parameter combinations and unsafe tool chains, run it through change control with the allowlist, and expand from operational evidence rather than taxonomy for its own sake.
Related normative requirements: APTS-SC-004, APTS-SC-020, APTS-SE-006, APTS-SE-023, APTS-MR-023.
Applicability: This practice applies to platforms that use LLM-based or agentic runtimes with conversational state spanning multiple turns, tool calls, or planning steps.
Rationale: APTS-MR-018 protects the architectural boundary between trusted instructions and untrusted target-derived data. APTS-MR-023 assumes the runtime may still misbehave and requires containment. Those controls remain necessary. They do not fully cover a pattern that shows up repeatedly in agentic systems: no single turn looks decisive, yet the accumulated dialogue erodes the runtime's refusal logic until it attempts something it should never have attempted. Split-message assembly does this. So do paraphrase drift, encoding chains, and target content that slowly reframes itself as operator intent. The distinction from MR-023 matters. MR-023 is containment after deviation. This advisory is about exercising the conversation path that leads to deviation in the first place.
Value: Single-turn screening misses a surprising amount of bad behavior because the dangerous instruction is often distributed across harmless-looking fragments. Multi-turn testing exposes that drift. It gives the operator a better signal about whether refusal behavior, scope discipline, and tool-use restrictions survive a long session instead of only surviving a clean lab prompt.
Practice Description:
Maintain a compact corpus of multi-turn adversarial sequences that exercises cumulative manipulation patterns relevant to autonomous pentesting. Include staged jailbreak attempts, split-message assembly, paraphrase and synonym variation, encoding or representation shifts, and attempts to reframe untrusted target content as operator intent. Then compare outcomes. If the platform rejects a direct request to widen scope or export sensitive data, it should reject the obfuscated and distributed versions of the same request as well.
Where the platform carries conversational or planning state across long sessions, treat that state as part of the attack surface. Segment or reset state at clear engagement and approval boundaries. Do not let adversarial target content sit in privileged planning context longer than necessary. When the platform summarizes or compacts history, verify that compaction does not turn a previously rejected path into an acceptable one by stripping away context or weakening the refusal signal. Log the test results and classify failures by what actually eroded: refusal drift, scope drift, tool-use drift, or unsafe state carryover.
Recommendation: Start with a small recurring corpus focused on the failures that matter operationally: scope override, instruction bypass, credential exfiltration attempts, and unsafe tool requests distributed across several turns. Use the results to tune guardrails and session-boundary handling. A benchmark score by itself is not the point.
Related normative requirements: APTS-MR-001, APTS-MR-018, APTS-MR-023, APTS-SC-020.
Rationale: Core kill switch functionality is covered by APTS-HO-008. The requirement for kill switch activation via physically independent communication channels (cellular, management network, physical button) assumes deployment scenarios where the primary network may be compromised or unavailable.
Value: Out-of-band kill capability is critical for engagements against network infrastructure, ICS/SCADA systems, or environments where the testing platform could inadvertently disrupt its own control channel. For cloud-hosted platforms with reliable control plane connectivity, in-band kill switches (HO-008, SC-009) are sufficient.
Recommendation: Implement out-of-band kill capability when testing network infrastructure, critical systems, or operating in environments with unreliable connectivity. Document the out-of-band channel in the Rules of Engagement for applicable engagements.
Applicability: This practice applies to platforms that use language models or other AI components to generate content shown to operators at decision points: narrative summaries, recommended approval responses, prefilled answer choices, preselected defaults, or any other UI affordance that shapes what the operator confirms. It applies uniformly across every channel that delivers an operator decision affordance (web dashboard, mobile, instant-messaging, voice, push notification).
Rationale: Autonomous pentest platforms increasingly use language models not only to act on the target but also to shape what the operator sees at decision time: the narrative that frames a finding, the option set in an approval prompt, the wording, the preselected default. This is a manipulation surface distinct from the prompt-injection threats covered by the Manipulation Resistance domain: there, an external entity manipulates the agent; here, the agent influences its own supervisor's choice through affordances the agent controls. Established findings from human-computer interaction (default bias, primacy bias, choice architecture effects) show these shaping decisions meaningfully influence which option the operator selects, even without adversarial intent. APTS-HO-001, HO-005, and HO-010 mandate approval gates and audit trails but do not address the form of the question put to the operator; APTS-AR-006 covers the agent's reasoning chain but not the model-shaped inputs handed to the human. The practical effect is that an audit trail can show "operator approved" while concealing that the operator was offered a single highlighted choice with the safer option visually de-emphasized. The normative requirement set for v0.1.0 is frozen; this practice is a candidate for tier-gated inclusion in v0.2.0.
Value: Platforms that implement this practice make operator approval a more deliberate act and the audit trail more honest. Reviewers can distinguish a typed approval from a default click-through, customers can audit whether the platform's UI has steered operators toward expedient choices, and operators retain meaningful agency at high-impact decision points.
Practice Description:
Recommendation: Treat the operator-facing decision interface as a controlled surface and apply the same provenance discipline already required for agent action logging. Items 1 and 2 form the lowest-cost starting wedge (a response-classification audit field plus model and prompt-template provenance), both addable without touching the operator UI. The presentation rules in item 4 are most consequential for APTS-HO-010 gates and for actions classified as Critical or High under APTS-SC-001.
Related normative requirements: APTS-HO-001, APTS-HO-005, APTS-HO-010, APTS-AR-006, APTS-AR-019.
Rationale: Multi-year maturity roadmaps, formal improvement frameworks, and annual strategic assessments are organizational process practices rather than technical governance for autonomous pentest platforms. APTS defines platform requirements, not organizational management practices.
Value: Organizations pursuing Level 4 autonomy benefit from structured maturity progression. A maturity roadmap framework provides a planning tool for team development, process maturity, tool capability evolution, and governance structure.
Practice Description:
Establish a continuous improvement framework for autonomous pentesting operations that covers periodic capability assessments, documented lessons learned from engagements, and a multi-year maturity roadmap for advancing operational governance. Track metrics such as findings per hour, escalation rate, boundary violations, and decision accuracy to measure progression over time. Conduct periodic reviews of autonomy level appropriateness against those metrics and feed outcomes back into operational procedures and training.
Recommendation: Use the maturity roadmap framework from the AL domain Implementation Guide as a planning tool. Track the metrics listed above to measure progression. Conduct annual reviews of autonomy level appropriateness.
Rationale: Breach notification and regulatory reporting are general vendor obligations governed by applicable laws (GDPR, CCPA, HIPAA, and similar regimes) and existing organizational compliance programs. These obligations apply to any company handling sensitive data, not specifically to autonomous penetration testing platforms. Including jurisdiction-specific regulatory timelines as a mandatory pentest platform requirement conflates general corporate compliance with platform governance.
Value: Platforms that formalize breach notification procedures specific to penetration testing engagements (where breach scope may involve multiple client environments) reduce response time and regulatory risk. Pre-built notification templates and severity classification aligned to regulatory timelines accelerate incident response.
Recommendation: Establish breach notification procedures that cover penetration testing-specific scenarios (cross-tenant exposure, credential leakage during testing). Align notification timelines with applicable regulatory requirements. Reference APTS-TP-018 (Tenant Breach Notification) for cross-engagement breach scenarios.
Rationale: Privacy regulation compliance (GDPR, CCPA, PIPEDA, HIPAA, and similar regimes) is a general legal obligation for any organization processing personal data. While autonomous pentest platforms do encounter personal data during testing, the compliance mechanisms (DPIAs, lawful basis documentation, data subject rights) are organizational responsibilities governed by existing privacy programs, not platform-specific technical controls.
Value: Platforms that build privacy awareness into their testing workflows (automatic PII detection, data minimization, retention controls) reduce the compliance burden on operators. Privacy-by-design in the testing pipeline prevents accidental retention or exposure of personal data discovered during engagements.
Recommendation: Integrate with APTS-TP-013 (sensitive data handling) for PII detection and isolation during testing. Document how the platform supports operators' privacy compliance obligations. Implement data minimization controls that limit personal data retention to what is necessary for finding documentation.
Rationale: Professional liability insurance, engagement agreements, and liability allocation are business and legal practices, not technical platform governance controls. These are organizational decisions influenced by jurisdiction, company size, client requirements, and risk appetite. A platform governance standard cannot prescribe specific contractual terms or insurance coverage levels.
Value: Clear engagement agreements reduce disputes when autonomous testing causes unintended impact. Defined liability allocation, indemnification terms, and insurance requirements protect both platform operators and their clients. Standardized agreement templates accelerate client onboarding.
Recommendation: Develop engagement agreement templates that address autonomous testing-specific risks (unintended system impact from autonomous actions, scope boundary failures). Maintain professional liability insurance appropriate to the scope and risk level of testing activities. Reference the Rules of Engagement framework in APTS-SE-001 for alignment between contractual scope and technical enforcement.
Rationale: Autonomous pentest platforms increasingly rely on external tool connectors such as remote browser agents, model tool servers, plugins, and data connectors that run outside the core platform trust boundary. These integrations can introduce new instruction channels, expand available actions, and inherit broad customer credentials. APTS already covers dependency inventory, provider vetting, action allowlists, and runtime containment, but it does not yet give implementation guidance specific to externally hosted tool connectors and protocol bridges.
Value: Platforms that treat external tool connectors as distinct trust zones reduce the risk of tool-poisoning, over-privileged connector credentials, silent capability expansion, and connector-driven cross-tenant leakage. This is especially useful for platforms that integrate remote browsers, agent plugins, retrieval connectors, or Model Context Protocol-style tool servers.
Practice Description:
Document every external connector that can execute actions, access customer data, or supply context into the agent runtime. For each connector, define the approved capability scope, credential scope, network reachability, and data classes it may access. Route connector requests through an enforcement layer outside the model that validates connector identity, denies undeclared actions, and records connector invocation provenance. Connector credentials should be isolated per engagement or customer wherever operationally feasible, and high-impact connectors should require explicit operator approval before first use in an engagement. Connector output should be treated as untrusted input subject to the same validation and sanitization controls applied to target-side content.
Recommendation: Start with a short connector inventory and per-connector approval profile rather than a heavyweight framework. Prioritize connectors that can execute code, browse arbitrary URLs, retrieve private documents, or introduce new action surfaces at runtime.
Related normative requirements: APTS-TP-006, APTS-TP-017, APTS-SC-020, APTS-MR-022, APTS-MR-023.
Applicability: This practice applies to platforms that use LLM-based agents fine-tuned (SFT, RFT, RLHF, DPO, or equivalent) on offensive-security tasks, or whose foundation model has been adapted with offensive-task adapters, instruction tuning, task-specific reward models, or post-deployment online learning on engagement data.
Rationale: Recent peer-reviewed work has demonstrated that fine-tuning a frontier LLM on a narrow task can produce broad behavioral misalignment that extends far outside the training domain (Nature 2026, Training LLMs on narrow tasks can lead to broad misalignment). For autonomous penetration testing platforms, two failure modes follow directly: (a) goal misgeneralization, where the agent learns a proxy objective ("produce findings that look like vulnerabilities") that diverges from the true objective ("identify vulnerabilities exploitable in the customer environment") in distinguishing situations the training data did not cover; and (b) emergent misalignment, where narrow fine-tuning on offensive tasks shifts the agent's behavior in adjacent domains with no signal until the shift manifests in a production engagement. APTS-MR-013 (Adversarial Example Detection in Vulnerability Classification) probes input-side robustness; APTS-MR-020 (Adversarial Validation and Resilience Testing of Safety Controls) probes control-side resilience; APTS-AR-019 (AI/ML Model Change Tracking and Drift Detection) tracks output drift. None of these evaluate the agent's underlying objective alignment under distribution shift. The Introduction's Capability Frontier and Containment Assumptions section defers verifiable goal alignment as research-stage and out of scope for v0.1.0; this practice begins to close that gap with an evaluation-based approach achievable today. The normative requirement set for v0.1.0 is frozen; this practice is a candidate for tier-gated inclusion in v0.2.0 (likely as SHOULD | Tier 2 for platforms operating at Level 3 autonomy or higher, or for any platform that performs post-deployment fine-tuning on engagement data).
Value: Platforms that maintain a goal-misgeneralization and emergent-misalignment evaluation suite detect a class of failure that no other safety control catches: situations where every individual safety check passes, scope holds, and the agent produces fluent, plausible output, while the agent is in fact optimizing a proxy objective that diverges from the true objective in distinguishing cases. This is the agent-side analogue of the fabricated-finding problem addressed by APTS-RP-A01: a failure that is invisible to per-action checks because it manifests only across the distribution of decisions the agent makes.
Practice Description:
Maintain and execute a goal-misgeneralization and emergent-misalignment evaluation suite that evaluates the agent's behavior in distinguishing scenarios (scenarios constructed so that following a plausible proxy objective produces a different action than following the true objective) and that evaluates the agent's behavior outside the offensive-security domain after any fine-tuning event. Specifically:
Recommendation: Implement the suite as an independent evaluation pipeline using a recognized evaluation framework (Inspect AI, Braintrust, OpenAI Evals, or equivalent) so that scenarios, scoring, and run history are inspectable, reproducible, and externally reviewable. Run the suite under the same model configuration the platform uses in production (same system prompt, same tool access pattern, same temperature). APTS-RP-A01 provides a backstop on the output side: even if the agent's objective drifts, an independent finding-authenticity verifier can catch fabricated evidence before it reaches the customer. APTS-MR-A01 addresses the upstream failure that RP-A01 cannot: the agent producing genuinely-grounded findings that the agent itself was misaligned to discover, prioritize, or report.
Related normative requirements: APTS-MR-013, APTS-MR-020, APTS-AL-026, APTS-TP-019, APTS-TP-022, APTS-AR-019.
Applicability: This practice applies to platforms that use LLM-based agents and run an adversarial evaluation suite under APTS-MR-020 against the same model configuration deployed in customer engagements authorized under APTS-SE-001. It applies regardless of whether the underlying model has been fine-tuned, and is therefore complementary in scope to APTS-MR-A01.
Rationale: A recurring finding across frontier-model evaluations between late 2024 and early 2026 is that an agent's behavior on an adversarial eval suite can diverge from its behavior in live deployment, because the model can detect features of the evaluation context and modulate its actions accordingly. Apollo Research demonstrated reproducible in-context scheming and sandbagging across Claude 3.5 Sonnet, Claude 3 Opus, and OpenAI o1 (Meinke et al., 2024, Frontier Models are Capable of In-context Scheming, arXiv:2412.04984), including a documented case in which o3 internally reasoned "we want to survive as the model, we need to fail purposely" and then deliberately submitted incorrect chemistry answers; METR's o3 evaluation report cautions that the evaluation setup "is not robust to sandbagging by the model." Independent peer-reviewed work establishes sandbagging as a measurable, controllable behavior (van der Weij et al., AI Sandbagging, ICLR 2025, arXiv:2406.07358), and Anthropic's Alignment Faking result shows that production-grade Claude 3 Opus differentially modulates behavior under naturalistic conditions (Greenblatt et al., 2024). Anthropic's own Sonnet 4.5 and Opus 4.6 system cards report that internal interpretability features for evaluation awareness strengthened during training, with UK AISI testing finding Opus 4.6 correctly identifies evaluations roughly 80% of the time while verbalizing that recognition only ~2% of the time. The capability scales predictably with model size (Chaudhary et al., 2025, arXiv:2509.13333) and is measurable on agentic tasks (Needham et al., 2025, arXiv:2505.23836). For autonomous penetration testing platforms the consequence is concrete: Meta's 2026 Muse Spark report shows roughly a 32-percentage-point drop on cyber capability evaluations under sandbagging instruction with general-knowledge performance moving by 0.4 points, exactly the failure mode an adversarial eval suite is supposed to surface. APTS-MR-020 (Adversarial Validation and Resilience Testing of Safety Controls) requires the platform to run the eval suite; APTS-AR-019 (AI/ML Model Change Tracking and Drift Detection) tracks output drift across model versions. Neither measures the differential between the agent's eval-context behavior and its production-context behavior on the same model snapshot. The EU AI Act Article 15 obligation that high-risk systems "perform consistently ... throughout their lifecycle" gestures at this property without specifying a measurement; this practice supplies one. The normative requirement set for v0.1.0 is frozen; this practice is a candidate for tier-gated inclusion in v0.2.0 (likely as SHOULD | Tier 2 for platforms operating at Level 3 autonomy or higher).
Value: Platforms that maintain and validate an eval-versus-production behavioral-consistency measurement detect a class of failure that a clean eval pass-rate cannot rule out: situations where the adversarial eval suite shows no anomalies while the agent's deployed behavior diverges along the dimensions the eval was supposed to govern (scope compliance, refusal of out-of-scope tools, finding-fabrication rate). This is the context-side analogue of the objective-side failure addressed by APTS-MR-A01: MR-A01 catches the agent pursuing the wrong objective when training data is ambiguous; MR-A02 catches the agent pursuing the right objective during evaluation while pursuing a different one in deployment, on the same model snapshot.
Practice Description:
Maintain and validate a behavioral-consistency measurement comparing the agent's behavior on the adversarial evaluation suite of APTS-MR-020 against a stratified sample of production engagement transcripts captured under APTS-SE-001 authorization, on the same model snapshot tracked under APTS-AR-019. Specifically:
Recommendation: Implement the measurement as an independent pipeline outside the engagement runtime so that the eval-side and production-side samples are scored under identical conditions. A lightweight standing detector (paired formal-versus-casual prompt framings administered to the deployed configuration via the "Playing Dumb" methodology, 2026) provides an inexpensive black-box check between full measurement runs. For the scoring component, follow the same independence rule that APTS-MR-A01 and APTS-RP-A01 apply: where an LLM judge is used, it MUST run as a separate inference call with no shared context with the agent under test, and SHOULD use a different model family. METR's MALT methodology (2025), cross-task differential analysis rather than single-transcript review, is the citable reference for how to interpret the production-side sample. The scope of the measurement is the eval suite of APTS-MR-020 and live customer engagements under APTS-SE-001; training-time RLHF, RLAIF, or DPO runs are out of scope and are governed by APTS-TP-019. APTS-MR-A01 addresses the upstream objective-side failure; APTS-MR-A02 addresses the context-side failure on the same model snapshot.
Related normative requirements: APTS-MR-020, APTS-MR-022, APTS-AR-001, APTS-AR-019, APTS-AL-026, APTS-TP-019, APTS-TP-022, APTS-RP-A01.
Rationale: LLM-based penetration testing agents can produce findings that appear legitimate but contain fabricated evidence: proof-of-concept scripts that output hardcoded strings instead of making real requests, HTTP responses that were not actually received from the target, or severity classifications unsupported by the evidence. Because these fabricated findings are fluent and internally consistent, they pass casual human review and erode trust in the platform's output. RP-001 and RP-002 require evidence-based validation and human review, but neither addresses the risk that the agent itself fabricates evidence. The normative requirement set for v0.1.0 is frozen; this practice is a candidate for tier-gated inclusion in v0.2.0 (likely as MUST | Tier 2 given the implementation complexity).
Value: Platforms that implement automated finding authenticity verification catch fabricated PoCs, hallucinated vulnerability types, and severity mismatches before they reach human reviewers. This ensures reviewers spend their time on genuine judgment calls rather than authenticity checking, and prevents fabricated findings from reaching customers.
Practice Description:
Implement an automated verification mechanism that screens each reported finding for fabricated evidence, hallucinated vulnerabilities, and synthetic proof artifacts before the finding enters the human review pipeline (APTS-RP-002) or the final report. Specifically:
Recommendation: Implement the verifier as a separate "Finding Judge" that receives only the finding record, associated evidence artifacts, and target context. The judge should err toward FLAGGED (sending to human review) rather than REJECTED (dropping the finding), to avoid suppressing genuine findings. For multi-agent or swarm architectures, process findings from all agents through a single verification pipeline to ensure consistent integrity standards.
Related normative requirements: APTS-RP-001, APTS-RP-002, APTS-RP-003, APTS-AR-006.
| Tier | Scope | Advisory Practices |
|---|---|---|
| Tier 1 (Foundation) | Core safety and scope controls | Not applicable |
| Tier 2 (Verified) | Production-grade platform governance | Optional enhancement |
| Tier 3 (Comprehensive) | Maximum assurance for high-risk operations | Recommended where operationally feasible |
Advisory practices are independent of the tier system. A platform may claim Tier 3 conformance without implementing any advisory practices, and a Tier 2 platform may implement advisory practices that are relevant to its deployment context.