Reporting

Domain Prefix: APTS-RP | Requirements: 15

This domain defines how an autonomous penetration testing platform produces and delivers findings that customers can trust: evidence-backed, reproducibility-ready, confidence-scored, hallucination-resistant, and honest about what the engagement did and did not cover. A platform can enforce scope, stop safely, keep pristine audit trails, and resist manipulation, yet still fail its customer if findings are unverifiable, overconfident, or silently incomplete. Requirements here govern evidence-based validation, human review pipelines, confidence scoring, provenance chains, cryptographic evidence integrity, false positive and false negative disclosure, coverage disclosure, executive summaries, remediation guidance, Service Level Agreement (SLA) reporting, trend analysis, and downstream pipeline integrity.

This domain covers the content and integrity of findings delivered to the customer. Engagement-time logging and evidence storage belong to Auditability (AR); cryptographic hashing primitives and chain-of-custody mechanics remain normatively in AR even when referenced by requirements here.

For implementation guidance, see the Implementation Guide.

Domain Overview

The 15 requirements in this domain fall into five thematic groups:

Group	Requirements	Purpose
Finding validation, review, and confidence	APTS-RP-001, APTS-RP-002, APTS-RP-003	Evidence-based validation, human review pipeline, confidence scoring methodology
Provenance and evidence integrity	APTS-RP-004, APTS-RP-005	Finding provenance chain, cryptographic evidence chain integrity
Accuracy and coverage disclosure	APTS-RP-006, APTS-RP-007, APTS-RP-008, APTS-RP-009, APTS-RP-010	False positive disclosure, independent reproducibility, coverage disclosure, false negative disclosure, detection effectiveness benchmarking
Customer-facing report content	APTS-RP-011, APTS-RP-012, APTS-RP-013	Executive summary and risk overview, remediation guidance and prioritization, SLA compliance reporting
Recurring and downstream reporting	APTS-RP-014, APTS-RP-015	Trend analysis for recurring engagements, downstream finding pipeline integrity

Requirement Index

ID	Title	Classification
APTS-RP-001	Evidence-Based Finding Validation	MUST \| Tier 2
APTS-RP-002	Finding Verification and Human Review Pipeline	MUST \| Tier 2
APTS-RP-003	Confidence Scoring with Auditable Methodology	MUST \| Tier 2
APTS-RP-004	Finding Provenance Chain	MUST \| Tier 2
APTS-RP-005	Cryptographic Evidence Chain Integrity	MUST \| Tier 2
APTS-RP-006	False Positive Rate Disclosure	MUST \| Tier 1
APTS-RP-007	Independent Finding Reproducibility	SHOULD \| Tier 3
APTS-RP-008	Vulnerability Coverage Disclosure	MUST \| Tier 1
APTS-RP-009	False Negative Rate Disclosure and Methodology	MUST \| Tier 2
APTS-RP-010	Detection Effectiveness Benchmarking	SHOULD \| Tier 3
APTS-RP-011	Executive Summary and Risk Overview	MUST \| Tier 1
APTS-RP-012	Remediation Guidance and Prioritization	MUST \| Tier 2
APTS-RP-013	Engagement SLA Compliance Reporting	MUST \| Tier 2
APTS-RP-014	Trend Analysis for Recurring Engagements	SHOULD \| Tier 2
APTS-RP-015	Downstream Finding Pipeline Integrity	SHOULD \| Tier 2

Conformance

A platform claims conformance with this domain by implementing every MUST requirement assigned to the compliance tier it targets and to all lower tiers, with no deviation, and by either implementing every SHOULD requirement at those tiers or recording a documented justification for each deviation in its conformance claim (see the Conformance Claim Template). An unimplemented MUST requirement or an undocumented SHOULD deviation is a conformance gap. APTS defines three cumulative compliance tiers (Tier 1 Foundation, Tier 2 Verified, Tier 3 Comprehensive) in the Introduction; a Tier 2 platform satisfies every Tier 1 RP requirement plus every Tier 2 RP requirement, and a Tier 3 platform satisfies all three tiers.

Every requirement in this domain includes a Verification subsection listing the verification procedures a reviewer uses to confirm implementation.

APTS-RP-001: Evidence-Based Finding Validation

Classification: MUST | Tier 2

Requirement

Every finding MUST include raw technical evidence demonstrating its existence. AI-generated summaries MAY accompany evidence but MUST NOT replace it.

Each finding includes at least one raw evidence artifact: packet captures, HTTP request/response pairs, command output, screenshots, or file contents
Findings without raw evidence are explicitly flagged as "Unverified" in the report
Summaries are clearly separated from raw evidence with distinct formatting
Evidence provenance is documented: which tool captured it, timestamp, conditions

Verification

Extract all findings from engagement report
For each finding, confirm a raw evidence section exists with actual technical data
Verify findings marked "Unverified" do not appear in main findings section
Check that summaries are clearly distinct from evidence (for example, separate subsection headers)

APTS-RP-002: Finding Verification and Human Review Pipeline

Classification: MUST | Tier 2

Requirement

Every Critical and High severity finding MUST pass through a three-stage verification pipeline before inclusion in the final report:

Stage 1 - Initial Reproduction (during testing): The platform MUST attempt automated reproduction for every Critical and High finding. Reproduction attempts are logged with timestamps, exact inputs provided, and outputs received. Findings are marked "Confirmed" (successful reproduction) or "Unconfirmed" (failed reproduction). Failed reproduction attempts include detailed explanation of what was attempted and why it failed. Reproduction methodology is documented and traceable back to tool invocations.

Stage 2 - Pre-Delivery Re-verification: Before finalizing any engagement report, the platform MUST automatically re-trigger each Critical and High finding to confirm reproduction. For each finding, the platform MUST replay the exact attack sequence, confirm consistent behavior, and record the re-verification attempt with full packet capture. Findings that fail re-verification MUST NOT appear as confirmed in the final report; they MAY appear in an appendix labeled "Unverified Findings Requiring Manual Review." Findings that reproduce on some but not all attempts MUST be classified as "Intermittent" with the reproduction rate documented (for example, "Reproduced 3 of 5 attempts") and included in the main report. The re-verification rate (percentage of findings reproducing on second attempt) MUST be disclosed in the methodology section.

Stage 3 - Human Review Gate: Critical findings are subject to mandatory human review before customer delivery. At Autonomy Levels 1-3: all Critical findings MUST be reviewed by a qualified human before delivery. At Autonomy Level 4: Critical findings SHOULD be flagged for priority human review within 4 hours of delivery, enabling post-delivery review rather than blocking autonomous report generation. Each Critical finding MUST include reviewer identification (name, title, qualification such as Offensive Security Certified Professional (OSCP), Certified Ethical Hacker (CEH), or years of experience), reviewer signature or authentication token indicating explicit approval, and review notes documenting human verification steps taken.

Verification

Review engagement audit logs for reproduction attempt records; for each Critical/High finding, confirm a reproduction log entry exists
Verify reproduction result matches finding's Confirmed/Unconfirmed status; validate reproduction inputs and outputs are documented
For each Critical/High finding, verify re-verification log exists with timestamp and full packet capture
Confirm re-verification includes exact attack sequence replay and response behavior validation
Verify findings that failed re-verification appear only in appendix, not in main findings
Verify intermittent findings are classified with specific reproduction rate and included in the main report
Review 3 engagement reports; verify each includes re-verification rate in the methodology section
For each Critical finding, locate Human Review section; verify reviewer name, qualification, and signature/authentication token are present
Verify review timestamp relative to finding discovery timestamp; confirm autonomy level requirements are met
Spot-check 3 findings: reproduce independently and confirm results match platform's verification results

See also: APTS-RP-001 (evidence-based finding validation that feeds this verification pipeline), APTS-AR-001 (audit trail that records reproduction and re-verification attempts), APTS-AR-004 (decision confidence scoring that informs human review triage).

APTS-RP-003: Confidence Scoring with Auditable Methodology

Classification: MUST | Tier 2

Requirement

Each finding MUST include a confidence score (0-100%) calculated from objective factors with documented, auditable methodology.

Confidence score is derived from measurable factors, not opaque "AI confidence"
Methodology documentation includes formula or decision tree
Factors used are: successful reproductions count, evidence quality (raw and inferred), attack path complexity
Formula is identical across all findings for consistency
Score calculation is auditable and reproducible

The confidence scoring methodology MUST document: (a) the input factors used to calculate confidence (for example, evidence quality, number of independent confirmations, environmental conditions), (b) the weighting applied to each factor, (c) the scale definition (for example, 0-100% where 0 = no confidence and 100 = confirmed with full evidence chain), and (d) the thresholds used to categorize findings as Confirmed (>= 90%), High Confidence (70-89%), Medium Confidence (50-69%), or Low Confidence (< 50%). The methodology MUST be consistent across all findings within an engagement.

See also: APTS-AR-004 (authoritative definition of the confidence scoring methodology and thresholds). Engagement reports MUST use the same methodology defined in AR-004; RP-003 governs how confidence scores are presented and validated in deliverables.

Verification

Extract confidence scoring methodology from engagement documentation
For each finding in report, locate confidence score and score calculation details
Verify score uses only documented factors (reproductions, evidence quality, complexity)
Recalculate confidence for 5 random findings and confirm matches reported score
Verify formula is consistent across findings

APTS-RP-004: Finding Provenance Chain

Classification: MUST | Tier 2

Requirement

Each finding MUST include complete provenance chain documenting its discovery path: which scanner/module discovered it, what input triggered discovery, what decision path led to exploitation, and cryptographic linkage to audit logs.

Provenance chain traces finding back to specific scanner execution, tool configuration, and audit log entry
Chain includes: discovery event timestamp, discovery tool, discovery input, analysis tool invocations, exploitation tool invocations
Each link is cryptographically signed or hashed using algorithms of equivalent or greater strength than SHA-256 to prevent tampering
Provenance is cross-referenced to Auditability audit trail
Client can validate provenance chain using public hashes

Verification

For each finding, locate provenance chain in report appendix
Verify each provenance event references audit log entry (see the Auditability domain, APTS-AR-001)
Validate cryptographic hashes link chain entries
Recalculate hashes to confirm no tampering
Confirm all tools, timestamps, and inputs are verifiable from audit logs

See also: APTS-AR-011 (chain of custody for evidence)

APTS-RP-005: Cryptographic Evidence Chain Integrity

Classification: MUST | Tier 2

Requirement

All finding evidence MUST be cryptographically bound to the discovery event with integrity verification. This requirement specifies the integrity verification mechanism for the finding provenance chain defined in APTS-RP-004.

For each finding, the platform MUST capture and provide:
- Raw evidence artifact demonstrating the finding: network captures (PCAP), authenticated session logs, API request/response pairs, file system artifacts, command output, or equivalent technical evidence
- Cryptographic hash (for example, SHA-256 or equivalent strength) of each evidence artifact
- Unique identifier linking the evidence to the finding report
- Timestamp correlation between evidence capture and finding report
Evidence MUST be extractable directly from raw bytes. Summaries or screenshots alone are insufficient for CRITICAL/HIGH findings.
For findings discovered over encrypted channels or through non-network means, alternative evidence artifacts with equivalent integrity guarantees are acceptable (for example, database query results with execution timestamp, authenticated log entries with cryptographic signature, authenticated API responses with session binding).
Reviewers MUST be able to independently verify evidence integrity using the provided cryptographic hashes and confirm the vulnerability response matches what the platform reported.

Verification

For each CRITICAL/HIGH finding, locate associated evidence artifact(s) and SHA-256 hash(es)
Verify evidence includes at least one raw technical artifact (not summary or screenshot)
Confirm SHA-256 hashes of evidence artifacts are documented
Verify timestamp in evidence aligns with finding timestamp (within 1 second)
Independently verify evidence integrity using provided hashes
Confirm evidence matches vulnerability description (for encrypted/non-network findings, verify authentication and source)
For encrypted channel findings, verify alternative evidence artifacts have equivalent integrity guarantees

See also: APTS-AR-010 (platform-wide evidence hashing specification).

APTS-RP-006: False Positive Rate Disclosure

Classification: MUST | Tier 1

Requirement

The platform MUST disclose its false positive rate methodology and measurements in every engagement report.

The methodology section MUST include:
- Platform's overall false positive rate over a documented measurement period, with a description of how ground truth was established
- False positive rate specific to this engagement (findings reported compared to findings confirmed), where confirmation data is available
- Method used to calculate false positive rate, including known limitations of the measurement
If the platform's false positive rate for CRITICAL/HIGH findings exceeds the operator's published target, this MUST be prominently disclosed in the executive summary.
Customers MUST be able to report false positives back to the operator. The operator MUST track and publish aggregate false positive rates per finding category.

Verification

Locate false positive rate disclosure in Methodology section
Verify includes: platform rate over documented measurement period, engagement-specific rate, calculation method
Check for prominence notice in Executive Summary if rate exceeds operator's published target
Verify customer can identify reporting mechanism (email, web form, or API endpoint) for false positives
Request operator's false positive tracking database and confirm recent entries
Verify aggregate rates published per finding category (for example, "Remote Code Execution (RCE): 3% FP rate")

See also: APTS-RP-003 (confidence scoring methodology that informs false positive filtering), APTS-RP-001 (evidence-based validation that reduces false positives).

APTS-RP-007: Independent Finding Reproducibility

Classification: SHOULD | Tier 3

Requirement

The platform SHOULD support independent reproduction of reported findings. Specifically:

A representative sample of CRITICAL findings from a recent engagement (with customer consent) can be selected for validation.
Each selected finding can be independently reproduced using the platform's documented evidence (steps, PCAP, tool output).
The platform MUST define a reproduction success threshold and document it.
Results are documented with per-finding reproduction status.

Verification

Confirm engagement is at least 30 days old (allows time for system changes)
Select a representative sample of CRITICAL findings from engagement
For each finding, obtain platform's evidence, reproduce steps, and PCAP
Attempt independent reproduction using same target environment or documented equivalent
Document reproduction success/failure for each finding
Evaluate reproduction rate against the platform's documented threshold
Provide detailed report per finding with reproduction attempts and results

See also: APTS-RP-005 (cryptographic evidence chain that enables independent reproduction), APTS-RP-004 (provenance chain used by independent reviewers).

APTS-RP-008: Vulnerability Coverage Disclosure

Classification: MUST | Tier 1

Requirement

Platform reports MUST include a vulnerability coverage matrix that discloses which vulnerability classes the platform tested for and which it did not.

Every engagement report MUST include a "Coverage Matrix" section listing: vulnerability classes tested (mapped to Common Weakness Enumeration (CWE) categories or equivalent taxonomy), vulnerability classes explicitly excluded from testing, and vulnerability classes where coverage is partial (for example, the platform tests for SQL injection but not stored XSS).
The coverage matrix MUST be specific to the engagement, reflecting the actual test profile and autonomy level used, not a generic list of the platform's theoretical capabilities.
For each vulnerability class listed as tested, the platform MUST indicate the detection method used (signature-based, heuristic, exploitation-based, or behavioral analysis).
The coverage matrix MUST NOT misrepresent coverage. If the platform only checks for default credentials but does not test authentication bypass logic, "Authentication" MUST NOT be listed as fully covered.
The coverage matrix MUST be positioned in the report where customers will see it (Executive Summary or Methodology section), not buried in an appendix.

Verification

Review the engagement report. Verify a coverage matrix is present in the Methodology or Executive Summary section.
Verify the matrix lists specific CWE categories or equivalent, not vague labels like "web vulnerabilities."
Verify the matrix reflects the actual engagement configuration, not a generic capability list.
Cross-reference the "tested" classes against the platform's actual test execution logs. Verify no class is claimed as tested when the corresponding test module did not execute.
Verify excluded and partial coverage classes are clearly identified.

APTS-RP-009: False Negative Rate Disclosure and Methodology

Classification: MUST | Tier 2

Requirement

Platform operators MUST disclose their false negative rate measurement methodology and known limitations. Absolute false negative rates are inherently difficult to measure; this requirement focuses on transparency of methodology rather than mandating specific accuracy levels.

The operator MUST maintain a documented methodology for measuring or estimating false negative rates, including how ground truth is established where feasible (for example, known-vulnerable test environments, comparison with manual testing, third-party validation, or coverage gap analysis).
The operator SHOULD publish aggregate false negative estimates per vulnerability class based on their measurement methodology, updated periodically.
Per-engagement reports SHOULD include an estimated false negative risk assessment based on the engagement's scope, target technology, and testing depth.
The false negative measurement methodology MUST be available to customers upon request.
If the operator has not yet established measured false negative rates (for example, new platform), the report MUST explicitly state this and describe the plan for establishing measurement.
Where direct measurement is infeasible, the operator MUST disclose the estimation approach used and its limitations.

Verification

Methodology documentation: Request the operator's false negative rate measurement methodology; verify it exists, is documented, and describes how ground truth is established (known-vulnerable environments, manual comparison, or third-party validation)
Aggregate rate publication: Verify the operator publishes aggregate false negative rates per vulnerability class; confirm the publication cadence is documented in the methodology
Per-engagement estimate: Review 5 engagement reports; verify each includes a false negative risk assessment or explicitly states why one is not feasible for that engagement
Methodology availability: Request the measurement methodology as a customer; verify it is provided upon request
New-platform disclosure: If the operator has not yet established measured rates, verify the report explicitly states this and includes a plan and timeline for establishing measurement
Estimation limitations: Where estimation methods are used instead of direct measurement, verify the methodology documents the estimation approach and discloses its limitations

See also: APTS-RP-008 (coverage disclosure that bounds the scope over which false negatives are measured), APTS-RP-010 (detection effectiveness benchmarking that provides ground truth for false negative measurement).

APTS-RP-010: Detection Effectiveness Benchmarking

Classification: SHOULD | Tier 3

Requirement

Platform operators SHOULD benchmark their detection effectiveness against known-vulnerable environments and disclose benchmark results to customers upon request.

The operator SHOULD maintain at least 3 benchmark environments representing common target profiles (web application, API service, network infrastructure) with documented, known vulnerabilities.
Benchmark runs SHOULD be conducted at least quarterly and after major platform updates.
Benchmark results SHOULD include: total known vulnerabilities per environment, total detected, total missed, detection rate per vulnerability class, and time to detection.
Benchmark environments SHOULD be periodically refreshed with new vulnerabilities to prevent overfitting (the platform being tuned specifically to detect benchmark vulnerabilities while missing novel ones).
Benchmark results SHOULD be available to customers upon request.

Verification

Request the operator's benchmark environment descriptions and known vulnerability inventories.
Request the most recent quarterly benchmark results.
Verify detection rates are broken down by vulnerability class.
Verify benchmark environments have been refreshed within the last 12 months.
Compare benchmark results across the last 4 quarters. Verify detection rates are stable or improving.

See also: APTS-RP-008 (coverage disclosure that these benchmarks validate), APTS-RP-009 (false negative methodology grounded in benchmark results).

APTS-RP-011: Executive Summary and Risk Overview

Classification: MUST | Tier 1

Requirement

Every engagement report MUST include an executive summary suitable for non-technical stakeholders. The summary MUST include: overall risk posture assessment, total findings count by severity, key findings with business impact, and scope coverage achieved. The executive summary MUST be comprehensible without reading the full technical report.

Reports MUST be delivered in at least one machine-readable format (JSON, XML, or structured CSV) in addition to any human-readable format (PDF, HTML). Machine-readable output enables integration with customer vulnerability management systems, ticketing platforms, and risk dashboards. The human-readable report MUST include: engagement metadata (scope, dates, tier, autonomy level), executive summary with risk rating, finding summary table sorted by severity, detailed findings with evidence, coverage disclosure per APTS-RP-008, and methodology description.

Verification

Review sample reports for presence of executive summary section.
Verify severity breakdown is present and accurate.
Confirm the executive summary includes overall risk posture, business impact for critical findings, severity totals, and scope coverage; verify every technical acronym in the executive summary is either expanded on first use or defined in an attached glossary
Verify scope coverage percentage is disclosed.

APTS-RP-012: Remediation Guidance and Prioritization

Classification: MUST | Tier 2

Requirement

For each finding, the report MUST include remediation guidance with prioritization. Remediation entries MUST include: recommended fix, estimated effort category (quick win, moderate, significant), and priority ranking based on risk and exploitability. Where the platform cannot provide specific remediation steps, it MUST state this explicitly and provide references to relevant remediation resources.

Verification

Review findings for presence of remediation guidance.
Verify effort estimates are categorized consistently.
Compare each finding's priority ranking against the platform's documented prioritization matrix; fail any finding whose priority differs from the matrix without a recorded exception rationale
Verify explicit disclosure when remediation guidance is unavailable.

See also: APTS-RP-011 (executive summary and risk overview that incorporates prioritized remediation), APTS-RP-001 (evidence-based validation that supports remediation confidence).

APTS-RP-013: Engagement SLA Compliance Reporting

Classification: MUST | Tier 2

Requirement

The report MUST disclose engagement timeline compliance including: agreed start and end dates, actual start and end dates, any interruptions or pauses with duration and cause, and percentage of planned scope completed. If the platform failed to complete the agreed scope, the report MUST explain the reason and disclose untested areas.

Verification

Compare agreed timeline to actual timeline in report.
Verify interruptions are documented with causes.
Confirm scope completion percentage is disclosed.
Verify untested areas are explicitly listed when scope is incomplete.

See also: APTS-SE-004 (temporal boundary enforcement that defines the engagement timeline), APTS-RP-008 (coverage disclosure that pairs with SLA completion reporting).

APTS-RP-014: Trend Analysis for Recurring Engagements

Classification: SHOULD | Tier 2

Requirement

For recurring engagements against the same target environment, the report SHOULD include trend analysis comparing current findings to previous engagements. Trend analysis SHOULD cover: new findings not present in prior engagements, resolved findings confirmed remediated, persistent findings still unresolved, and overall risk posture trend (improving, stable, degrading).

Verification

Review recurring engagement reports for trend data.
Verify new, resolved, and persistent categorization.
Verify the trend analysis section includes prior-period and current-period severity counts or risk scores; fail if the stated trend label (improving, stable, degrading) is not supported by the included figures
Verify previous engagement data is correctly referenced.

See also: APTS-RP-011 (executive summary format that carries trend analysis), APTS-AR-001 (audit trail referenced to reconstruct prior engagement findings).

APTS-RP-015: Downstream Finding Pipeline Integrity

Classification: SHOULD | Tier 2

Requirement

When the platform integrates with downstream systems for finding delivery (ticketing systems, vulnerability management platforms, messaging services, or customer APIs), the platform SHOULD enforce controls to maintain finding integrity throughout the delivery pipeline:

Data fidelity: Finding synchronization SHOULD preserve evidence links, severity classifications, and remediation guidance without silent data loss or field truncation
Tenant isolation: Ticket creation or finding export SHOULD enforce authorization and tenant isolation to prevent cross-customer data leakage into incorrect ticketing queues or channels
Deduplication: Deduplication policies SHOULD be documented to prevent duplicate ticket storms while preserving distinct finding instances
Sensitive data handling: Sensitive data (credentials, PII, exploitation evidence) SHOULD be redacted or handled per the data classification framework before export to third-party systems
Delivery assurance: Finding delivery failures SHOULD be detected, logged, and retried or escalated to prevent silent finding loss
Field mapping validation: Mappings between platform finding fields and downstream system fields SHOULD be documented and validated to prevent misclassification or data loss during translation

Verification

Integration configuration documents all downstream systems, field mappings, and authorization controls
Test: push a finding to a downstream system; verify evidence links, severity, and remediation guidance arrive intact
Tenant isolation is enforced; verify findings from Tenant A cannot route to Tenant B's ticketing queue
Deduplication policy is documented; verify duplicate findings are handled per policy
Sensitive data redaction is applied before export to third-party systems
Finding delivery failure is detected and logged; verify retry or escalation occurs

See also: APTS-TP-012 (client data classification framework and provider data handling)

See also: APTS-RP-A01: Automated Finding Authenticity Verification. An advisory practice for screening agent-generated findings for fabricated evidence, hallucinated vulnerabilities, and severity misclassification before human review. Candidate for tier-gated inclusion in v0.2.0.