Informative Appendix (non-normative)
This appendix helps customers (CISOs, procurement teams, and security leaders) evaluate autonomous pentesting platform operators against APTS requirements. Throughout this guide, "the operator" refers to whichever entity operates the platform being evaluated: an external vendor, a managed-service provider, or an internal enterprise security team. Enterprise teams performing self-evaluation should use the questions as a self-review checklist before publishing an internal conformance claim.
Decide your minimum compliance tier based on your risk tolerance:
Tier 1 (Foundation): 72 requirements. The platform will not test outside the agreed scope, can be stopped immediately, will not leak discovered credentials, and provides a basic audit trail. Choose Tier 1 when: you are running supervised autonomous testing against non-critical systems with experienced operators monitoring the engagement.
Tier 2 (Verified): 157 cumulative requirements (72 + 85). The platform is fully transparent about what it did and why, protects your data with tamper-proof audit trails, handles incidents with formal response procedures, and provides independently verifiable findings. Choose Tier 2 when: you are testing production environments, operating in regulated industries, or need full accountability for audit or compliance purposes. This is the recommended minimum for most production deployments.
Tier 3 (Comprehensive): 173 cumulative requirements (157 + 16). The platform meets the highest assurance bar for critical infrastructure, fully autonomous (L4) operations, and the strictest regulatory requirements. Choose Tier 3 when: you are deploying fully autonomous testing against critical infrastructure, financial systems, or healthcare environments with minimal human oversight. An additional 10 advisory practices in the Advisory Requirements appendix are recommended for highest-assurance engagements but are not counted toward any tier.
Minimum tier guidance: Tier 1 is appropriate for supervised testing of non-critical systems in non-regulated environments. Organizations in financial services, healthcare, critical infrastructure, or any regulated industry SHOULD require Tier 2 as a minimum. Tier 3 is recommended for critical infrastructure, fully autonomous (L4) operations, and environments with the strictest regulatory requirements.
Use the questions below to evaluate vendor capabilities across all eight APTS domains. Not every question applies to every engagement; select those relevant to your environment and risk tolerance. For brevity, requirement IDs in the tables below use the short form (for example, SE-001) rather than the full identifier (APTS-SE-001).
| Question | What to Look For | Key Requirements |
|---|---|---|
| How does the platform ingest and validate Rules of Engagement? | Machine-parseable RoE format (JSON/YAML/XML), schema validation, cryptographic integrity checks | SE-001 |
| How are IP ranges and domains validated before testing begins? | CIDR validation, RFC 1918 awareness, cloud metadata endpoint exclusion, wildcard handling policies | SE-002, SE-003 |
| How does the platform enforce temporal boundaries? | UTC-based enforcement, DST handling, countdown alerts, graceful shutdown at boundary expiration | SE-004, SE-008 |
| How are critical assets protected from testing? | Immutable deny lists, asset criticality classification (Critical/Production/Non-Production/Unknown), production database safeguards | SE-005, SE-009, SE-010 |
| How does the platform handle multi-tenant environments? | Tenant isolation checks, per-request tenant context validation, cross-tenant prevention | SE-011 |
| What happens when DNS or infrastructure changes mid-engagement? | DNS rebinding prevention, dynamic scope monitoring, drift detection with alerting | SE-007, SE-012 |
| How are credentials managed during and after engagements? | Real-time credential inventory, automatic rotation at engagement end, least-privilege enforcement | SE-023 |
| Does the platform support continuous or recurring testing? | Unique engagement IDs per cycle, cross-cycle finding correlation, deployment-triggered governance | SE-017, SE-018, SE-020 |
| Question | What to Look For | Key Requirements |
|---|---|---|
| How are pentesting actions classified by impact? | Multi-tier classification system (Critical/High/Medium/Low/Info), CIA scoring per action, technique-to-impact mapping | SC-001 |
| What rate limiting is in place? | Per-host connection and request caps, subnet-level aggregate limits, payload size and bandwidth constraints | SC-004 |
| How does the platform prevent cascading failures? | Dependency mapping, kill switches at dependency nodes, circuit breaker patterns | SC-005, SC-012 |
| How does the kill switch work, and can we test it? | Two-phase termination (graceful then forced), process tree termination, operator and automatic triggers | SC-009 |
| What monitoring detects unintended impact? | Continuous target monitoring (CPU, memory, network, errors), threshold-based alerts, escalation paths | SC-010 |
| How are cumulative risks tracked? | Time-based impact accumulation, decay functions, cumulative risk scoring algorithm, dynamic threshold adjustment | SC-007, SC-010 |
| What happens after the test completes? | Reversible action rollback, post-test integrity validation against baselines, evidence preservation in immutable storage | SC-014, SC-015, SC-016 |
| How is the platform itself monitored? | Platform health monitoring separate from target monitoring, external watchdog on independent infrastructure | SC-010, SC-017 |
| Question | What to Look For | Key Requirements |
|---|---|---|
| What requires human approval before execution? | Role-based approval workflows for exploitation, lateral movement, data access, persistence mechanisms | HO-001 |
| What does the real-time monitoring dashboard show? | Live activity feed, system health, scope boundaries, pending approvals, anomaly alerts with drill-down | HO-002 |
| What happens if an approver does not respond in time? | Documented SLAs, default-deny/pause/kill behavior on timeout, progressive escalation chains | HO-003 |
| Who can trigger the kill switch? | Primary and secondary authorities, manager override, out-of-band kill switch on independent network (Advisory) | HO-008, HO-009 |
| How are irreversible actions handled? | Mandatory human approval gate, impact assessment in approval request, two-person rule for high-impact actions | HO-010 |
| How are unexpected findings escalated? | Defined triggers for IoCs, illegal content, zero-days, out-of-scope access; legal/compliance notification paths | HO-011, HO-014 |
| What are the operator qualification requirements? | Competency standards by role, certification program, annual recertification, incident response training | HO-018 |
| How is 24/7 coverage maintained? | Shift handoff procedures, stale approval expiry, fatigue monitoring, mandatory breaks | HO-019 |
| Question | What to Look For | Key Requirements |
|---|---|---|
| What autonomy levels does the platform support? | Clear L1 (Assisted) through L4 (Autonomous) definitions with documented boundaries per level | AL-001 through AL-004 |
| How is a platform's autonomy level determined? | Formal assessment criteria, capability evaluation methodology, documented evidence requirements | AL-025 |
| What restrictions apply at each level? | Progressive capability unlocking, action-type restrictions per level, scope limitations per level | AL-006, AL-007, AL-008 |
| How does the platform transition between levels? | Defined promotion/demotion criteria, assessment evidence, approval requirements for level changes | AL-025, AL-026 |
| What safety controls scale with autonomy level? | Monitoring intensity, approval requirements, and safety margins that increase with autonomy level | AL-012 through AL-016 |
| Question | What to Look For | Key Requirements |
|---|---|---|
| What does the platform log? | Every action, decision, and outcome with timestamps; decision rationale capture; tool invocation records | AR-001, AR-002 |
| How is log integrity protected? | Tamper-evident storage, cryptographic chaining or signing, independent log verification | AR-010, AR-012 |
| Can findings be reproduced? | Decision replay capability, environment state capture, reproducible finding validation | AR-016, AR-017 |
| How are AI/ML model changes tracked? | Model version logging, drift detection, change impact assessment, model change audit trail | AR-019 |
| What retention policies apply? | Defined retention periods per data type, secure disposal procedures, regulatory compliance | AR-005, TP-015 |
| How is audit access controlled? | Role-based access to logs, separation of duties between operators and auditors | AR-011, AR-019 |
| Question | What to Look For | Key Requirements |
|---|---|---|
| How does the platform resist prompt injection? | Input sanitization, prompt/data separation, output validation, context isolation | MR-001, MR-002 |
| How is scope widening prevented? | Scope anchor validation, target-suggested scope expansion rejection, redirect chain validation | MR-007, MR-010, MR-012 |
| How is model poisoning detected? | Training data integrity verification, model behavior monitoring, anomaly detection on model outputs | TP-019, AR-019 |
| What SSRF protections are in place? | Internal network access controls, metadata endpoint blocking, outbound request filtering | MR-009 |
| How are adversarial inputs handled? | Input fuzzing resistance, encoding normalization, multi-layer validation | MR-013, MR-014, MR-017 |
| Is the platform tested against manipulation? | Regular red team exercises, adversarial testing program, manipulation resistance validation | MR-020 |
| Question | What to Look For | Key Requirements |
|---|---|---|
| How are AI/ML providers vetted? | Provider security assessment, contractual security requirements, ongoing monitoring | TP-001, TP-012 |
| How are third-party dependencies managed? | Dependency inventory, vulnerability scanning, update policies, SBOM generation | TP-006 |
| What is the incident notification process? | Documented timelines for breach notification, customer communication procedures | TP-005 |
| How is customer data isolated? | Per-engagement isolation, data residency controls, cross-tenant prevention | TP-017 |
| What data retention and deletion policies exist? | Defined retention periods, certified deletion procedures, data destruction proof | TP-015, TP-016 |
| How are secrets handled during engagements? | Secret classification by provenance, reuse policies, mandatory revocation at engagement end | SE-023, MR-019 |
| Question | What to Look For | Key Requirements |
|---|---|---|
| How are findings validated before reporting? | Multi-stage validation, false positive reduction, confidence scoring per finding | RP-001, RP-002 |
| What does a standard report include? | Executive summary, technical findings, evidence, remediation guidance, coverage disclosure | RP-011, RP-012, RP-008 |
| How is finding confidence communicated? | Per-finding confidence scores with methodology and supporting evidence quality factors | RP-003 |
| How is testing coverage measured and disclosed? | Coverage metrics per scope element, untested areas identified, coverage gaps explained | RP-008 |
| How are reports and downstream findings protected in transit and at handoff? | Cryptographic evidence integrity, encrypted downstream transmission, protected integration pipeline to ticketing systems | RP-005, RP-015, TP-014 |
Use these seven questions for initial assessment before detailed evaluation:
"Which APTS tier do you claim conformance with?" If unfamiliar with APTS, share this standard. If claiming conformance, request evidence.
"Provide your completed APTS conformance assessment against the Checklists." A credible vendor maps capabilities to per-tier verification items, whether the assessment was performed internally or by a third party.
"Can you demonstrate your safety controls in a live environment?" Ask the vendor to demonstrate kill switch operation, scope enforcement, and rate limiting. The Customer Acceptance Testing appendix provides structured test procedures if you want to conduct your own hands-on verification.
"How does your kill switch work, and can we test it?" Multiple independent kill switches are required (APTS-SC-009). Ask for a demo in a test environment.
"What happens to our data after the engagement?" Request credential disposal reports and data destruction proof (APTS-SE-023, APTS-TP-015, APTS-TP-016).
"Do you deploy agents or software to our infrastructure?" If yes, confirm agents can be removed without vendor cooperation and are covered in the Rules of Engagement (APTS-SE-022).
"What AI/ML models does the platform use, and how do you track model changes?" Request the model change log and drift detection procedures (APTS-AR-019).
Watch for these warning signs:
A thorough vendor evaluation typically requires 2-4 weeks depending on the depth of verification:
Three approaches, in increasing order of assurance:
Note on behavioral requirements: Some APTS requirements are behavioral (kill switch timing, scope enforcement accuracy, manipulation resistance) and cannot be fully verified through documentation. For these requirements, customers can request vendor demonstrations or recorded evidence. Organizations with higher risk tolerance may accept vendor claims and operator-provided assessment evidence; organizations requiring stronger assurance may conduct hands-on verification using the CAT procedures.