The OWASP Autonomous Penetration Testing Standard (APTS) defines governance requirements for autonomous pentesting platforms to operate safely, transparently, and within defined boundaries.
This is a governance framework, not a testing methodology. It complements existing standards (PTES, OWASP WSTG, OSSTMM) by addressing challenges unique to autonomous operation: scope enforcement, safe autonomy, manipulation resistance, and accountability.
APTS applies to autonomous pentesting systems that:
This includes SaaS platforms, on-premises tools, cloud-native systems, integrated security orchestration platforms with autonomous testing capabilities, and in-house platforms that enterprise security teams build and operate for testing their own organization.
APTS does not cover manual pentesting, isolated lab testing, SAST/DAST tools, bug bounty programs, human-led red team exercises, purple team operations (unless the autonomous tool is the attacker component), or vulnerability disclosure programs.
APTS requirements are deployment-model agnostic and apply identically to the three common ways autonomous pentest platforms are delivered and operated:
APTS does not prescribe who performs the assessment. The choice of internal self-assessment, independent internal review, or third-party assessment is left to the reader based on regulatory context, customer contracts, and assurance needs. The same 173 tier-required requirements and three tiers apply across all three models. For in-house deployments, the "customer" referenced in requirements such as Rules of Engagement validation, scope approval, and reporting is the internal business unit or application team requesting the test rather than an external client; the requirement itself is unchanged.
| # | Domain | Prefix | Active | Focus |
|---|---|---|---|---|
| 1 | Scope Enforcement | SE | 26 | Boundary definition, validation, continuous monitoring |
| 2 | Safety Controls & Impact Management | SC | 20 | Impact classification, blast radius, hard stops, rollback, execution sandbox |
| 3 | Human Oversight & Intervention | HO | 19 | Approval gates, kill switch, escalation, operator qualifications |
| 4 | Graduated Autonomy Levels | AL | 28 | L1 Assisted through L4 Autonomous, level assessment, containment verification |
| 5 | Auditability & Reproducibility | AR | 20 | Logging, decision trails, evidence integrity, audit trail isolation |
| 6 | Manipulation Resistance | MR | 23 | Prompt injection, scope widening, model poisoning, agent runtime isolation |
| 7 | Third-Party & Supply Chain Trust | TP | 22 | AI providers, cloud dependencies, data handling, foundation model disclosure |
| 8 | Reporting | RP | 15 | Finding validation, confidence scoring, coverage disclosure |
Total: 173 tier-required requirements (Tier 1 + Tier 2 + Tier 3) across the eight domains. An additional 10 advisory practices live exclusively in the Advisory Requirements appendix using the APTS-<DOMAIN>-A0x identifier pattern; advisory practices are not counted toward any tier and do not affect conformance.
APTS defines three compliance tiers. A platform must meet 100% of requirements assigned to its claimed tier (both MUST and SHOULD). No partial credit.
Verification model: APTS follows a conformance assessment model, consistent with how other OWASP standards (WSTG, ASVS) are used by practitioners. Platform operators evaluate their platforms against the requirements using the Checklists and document conformance. The Conformance Claim Template provides an optional format for publishing evidence of conformance. Customers MAY independently verify claims using the Vendor Evaluation Guide or the Customer Acceptance Testing appendix for hands-on verification of behavioral requirements (kill switch response times, scope enforcement accuracy, manipulation resistance) that cannot be verified through documentation alone.
What it means for customers: The platform will not test outside the agreed scope, can be stopped immediately if something goes wrong, will not store or leak discovered credentials in plaintext, and provides a basic audit trail of everything it did. A Tier 1 platform is safe to deploy for supervised autonomous testing against non-critical systems.
What it means for platform operators: You have implemented the baseline safety controls that any responsible autonomous pentest tool must have. Your platform enforces scope boundaries before every action, provides working kill switches, requires human approval for high-risk actions, logs all operations, and protects sensitive data found during testing.
All requirements marked MUST | Tier 1 across all eight domains. Covers: scope boundary enforcement, impact classification, kill switch, approval gates, structured logging, basic manipulation resistance, provider vetting, credential protection.
What it means for customers: The platform is transparent about what it did and why, protects your data with tamper-proof audit trails, handles incidents with formal response procedures, and provides findings you can independently verify. A Tier 2 platform is suitable for production environments, regulated industries, and engagements where you need full accountability.
What it means for platform operators: You have implemented production-grade governance. Your platform produces auditable decision trails, enforces multi-tenant data isolation, tracks cumulative risk, manages third-party dependencies, and delivers reproducible, evidence-backed findings with confidence scores and false positive rate disclosure.
All Tier 1 plus all MUST | Tier 2 and SHOULD | Tier 2 requirements. Adds: tamper-proof audit trails, decision transparency, anti-manipulation controls, reproducible findings, formal incident response, third-party dependency management, L3 operational requirements.
What it means for customers: The platform meets the highest assurance bar for critical infrastructure, fully autonomous (L4) operations, and environments with the strictest regulatory requirements. A Tier 3 platform has been validated against advanced adversarial scenarios and provides maximum transparency into its operation.
What it means for platform operators: You have implemented the full standard including advanced controls for continuous platform integrity monitoring, evasion governance, L4 autonomous campaign management, and comprehensive forensic reconstruction capability.
All Tier 2 plus all MUST | Tier 3 and SHOULD | Tier 3 requirements. Adds: continuous monitoring, platform integrity verification, advanced manipulation resistance, L4 operational requirements, red team validation.
In addition to the three compliance tiers, APTS designates a small number of requirements as Advisory. These are practices that provide meaningful additional assurance but were assessed as disproportionately burdensome relative to their safety value at their original tier, or as infeasible to verify with current technology (for example, full replay of LLM-based decision chains). Advisory requirements use the identifier pattern APTS-<DOMAIN>-A0x and are documented exclusively in the Advisory Requirements appendix; they do not appear in the domain READMEs.
Advisory requirements are not required for conformance at any tier but are recommended for high-risk engagements, regulated industries, and platforms operating at Level 4 autonomy. Platforms that implement advisory requirements MAY document their adoption in vendor evaluation materials. Customers seeking maximum assurance SHOULD evaluate vendor adoption of advisory requirements as a differentiator.
New readers should start with the Getting Started guide for a recommended reading order.
Platform operators (vendors, service providers, and enterprise security teams): Use the Implementation Guides and Checklists to understand what each requirement expects. Use the Cross-Domain Integration Matrix to understand domain interactions and the Testing Phase Mapping to verify pentesting phase coverage.
Customers evaluating platforms: Use tier requirements as evaluation criteria. The Vendor Evaluation Guide provides a structured approach with seven critical evaluation questions. For customers who want hands-on verification, the Customer Acceptance Testing appendix provides optional verification procedures.
Security professionals: Each requirement includes a Verification section with specific criteria. Implementation Guides provide supplemental context; the domain READMEs are the authoritative source.
Some requirements include specific numeric values (for example, payload size limits, response time thresholds, confidence percentages, retention periods). These values fall into three categories:
When a requirement specifies a numeric value without indicating whether it is a bound, a minimum, or a default, treat it as a bound unless the requirement text explicitly uses language such as "configurable," "documented," or "per the platform's policy."
This standard is complementary to existing standards:
Autonomous pentest platforms are built on foundation models whose capabilities are advancing quickly. Each new generation of models tends to find and exploit vulnerabilities more effectively, and published safety evaluations continue to surface behaviors in frontier models that would have been considered unlikely a year earlier. APTS treats this not as a distant research concern but as a present engineering constraint on how platforms should be built.
APTS does not try to verify that a foundation model is aligned, cannot scheme, or will not attempt to exceed its mandate. These properties cannot be reliably verified by platform operators today, and a standard that claimed otherwise would give customers false assurance. APTS instead takes an architectural stance: treat the agent runtime as a component that may, for any reason, attempt actions outside its declared mandate, and build the platform so that such attempts fail by construction rather than by the agent's own restraint.
Several requirements in this standard exist specifically to encode that stance. APTS-SC-019 requires a kernel-enforced execution sandbox whose boundary the agent runtime has no credentials to move. APTS-SC-020 requires the tool and action allowlist to be enforced by a component external to the model rather than by the model's system prompt. APTS-MR-023 names the agent runtime as an untrusted component in the platform's threat model. APTS-AR-020 requires the authoritative audit trail to live on infrastructure the agent runtime cannot reach. APTS-AL-028 requires periodic operator-run containment verification for platforms operating at L3 or L4. APTS-TP-021 and APTS-TP-022 require the operator to disclose the foundation model powering the agent and to re-assess the platform when that model changes materially. APTS-SE-026 recommends monitoring the agent's action distribution for deviations that fall within scope but outside the operator's expected behavior.
These requirements share a common assumption: as foundation model capability grows, the gap between what the agent can do and what the operator intends for it to do grows with it, and that gap is best closed by architecture rather than by promises extracted from the model. The standard's treatment of this frontier is deliberately modest in scope. It does not try to address every open question about agent safety. Research-stage topics (verifiable goal alignment, scheming detection, and containment testing against models that may be aware of the test environment) are out of scope for this version and may be addressed in future versions of APTS as the field matures.