DREAMi-Validator (V2): A Label-Neutral, Reproducible Framework for Comparative Quantum-Control Studies

Author: Jordon Morgan-Griffiths

Affiliation: Founder, Independent Researcher, THE UISH (Independent)



Sim Link: https://dakariuish.itch.io/dreamiv2

Keywords: validation, label-neutral A/B testing, paired design, pre-registration, reproducibility, audit trail, manifest + hashes, convergence gate, aliasing guard, effect sizes, bootstrap CIs, non-inferiority/equivalence (TOST), quantum control benchmarking, DREAMi-QME interop

Abstract

DREAMi-Validator (V2) is a label-neutral validation framework for comparative quantum-control studies. It converts paired physics trajectories—typically exported from a seeded, audit-ready engine like DREAMi-QME or from hardware logs—into defensible statistical judgments under a pre-registered plan. V2 enforces apples-to-apples comparisons via a physics-equivalence contract (Hamiltonian, noise, units, cadence, target, and tolerances must match), refuses numerically unclean inputs (failed step-halving convergence, aliasing, positivity breaches), and conducts seed-paired analyses that yield effect sizes and confidence intervals rather than hand-picked wins. Arms remain anonymized (L/R) until after statistics are locked. Outputs include a single PASS/FAIL verdict for the primary metric, full secondary tables, assumption checks, and a replayable bundle with hashes. The result is simple by design: if it can’t be reproduced, it doesn’t count; if it isn’t label-neutral, it isn’t evidence.

  1. Introduction

1.1 Why label-neutral validation matters
Comparing control policies is easy to do badly. Most “wins” evaporate under three pressures: (i) apples-to-oranges physics (different noise, horizons, or sampling), (ii) numerically unclean inputs (no convergence evidence, hidden aliasing), and (iii) analysis bias (peeking at outcomes, moving thresholds, or branding one arm “baseline” and then treating it differently). Label-neutral validation attacks all three:

• Physics parity first: if the underlying dynamics differ in any declared dimension—Hamiltonian, dissipators and rates, frame/units, step sizes, output cadence, or target—the comparison is invalid.
• Numerical hygiene: if the engine can’t pass step-halving convergence or preserve trace/Hermiticity/positivity within stated tolerances, the data are unfit for inference.
• Blind statistics: arms are anonymized as L/R; seeds are paired; metrics and thresholds are pre-registered; inferential choices are fixed before looking at outcomes.

The point is not to produce bigger p-values or slicker plots. It is to make cheating expensive and good evidence cheap to verify.

1.2 Scope and non-goals (what Validator does / does not do)
What V2 does:
• Ingests two artifact bundles per seed (CSV + manifest + hashes), one for each arm.
• Verifies physics equivalence and integrity (hashes, unit/frame consistency, cadence guards).
• Checks engine gates (step-halving deltas; invariant counters) and refuses unclean inputs.
• Computes pre-registered metrics seed-by-seed; forms paired deltas; runs an appropriate paired test (t-test or Wilcoxon) and bootstrap confidence intervals when assumptions fail.
• Reports a single primary PASS/FAIL, secondary summaries, effect sizes, and assumption diagnostics, all under blinding.
• Emits a replayable report (tables + JSON) with enough context to re-run and verify.

What V2 does not do:
• It does not generate physics—no simulation, no pulse optimization, no hidden defaults. That’s the engine’s job (e.g., DREAMi-QME or hardware acquisition).
• It does not “fix” unclean inputs—no retroactive resampling, smoothing, or threshold shopping.
• It does not declare grand winners across mismatched scenarios—only within a pre-registered, physics-equivalent, paired design.
• It does not replace scale-law analysis (ARLIT) or hardware validation; it complements them.

1.3 Contributions and summary of results
Contributions:
• A hard physics-equivalence contract. Validator refuses to compare if H₀, {H_k}, {L_j, γ_j(t)}, T, Δt_int/Δt_out, target ρ★, or tolerances differ between arms. This eliminates apples-to-oranges comparisons at the gate.
• Deterministic, seed-paired analysis. Each seed yields (B−A) deltas on the chosen metric(s), neutralizing between-seed variance and making small effects detectable without overfitting.
• Blinding and immutability. Arms are anonymized (L/R) and the analysis plan (primary metric, thresholds τ, tests, multiple-testing control if any) is frozen before ingestion; unblinding occurs only after verdicts are computed.
• Numerical hygiene requirements. Step-halving convergence (e.g., |ΔḞ| and |ΔF_final| ≤ 1e−4) and invariant counters (trace renorms, positivity backoffs) must pass; otherwise V2 returns “No Verdict—engine unclean.”
• Conservative, interpretable outputs. Primary PASS/FAIL with effect size and 95% CI; secondary metrics reported without gatekeeping; assumption diagnostics; sensitivity checks (e.g., seed leave-one-out, cadence halving).
• Replayable audit artifacts. SHA-256 (and optional HMAC) across all inputs/outputs; a minimal “how to rerun” note; manifest diffs when applicable.

Summary of results (what users get):
• A binary verdict on the pre-registered primary metric (PASS/FAIL/No Verdict).
• An effect size (mean Δ) with confidence interval and the appropriate paired test p-value.
• Seed-wise tables (per-seed metrics and deltas), hit-rate analyses for threshold metrics (T_hit), and clear handling of NaNs (never imputed).
• A completeness checklist: physics-equivalence matched, gates passed, hashes verified, assumptions checked.
• A compact, archivable bundle that any hostile reviewer can replay.

1.4 Relation to DREAMi-QME (engine) and ARLIT (scale auditor)
DREAMi-QME, Validator (V2), and ARLIT divide the problem cleanly:

• DREAMi-QME (physics engine): generates seeded, audit-ready trajectories under a declared Lindblad model (or hardware logs in the hardware case). It owns trace/Hermiticity/positivity, frames/units, and step-halving convergence. QME produces evidence; it does not judge.
• DREAMi-Validator (V2): consumes two physics-equivalent streams (Arm-A, Arm-B), enforces blinding and gates, computes paired deltas on pre-registered metrics, and issues a verdict with effect sizes and CIs. V2 judges fairly; it does not generate.
• ARLIT (scale auditor): tests whether the “advantage signal” persists under rescaling and renormalization (e.g., windowed QFI structure, multi-resolution stability). ARLIT answers “is this structure real, or just a sweet spot?” It audits scope; it does not adjudicate per-seed comparisons.

This separation of concerns is the core safety feature. If a policy “wins” only when the simulator is tweaked, QME’s gates will fail. If it “wins” only under selective metrics, V2’s pre-registration and paired analysis will expose it. If it “wins” only at a particular scale, ARLIT will flatten it. Together, the trio make hype hard and honest claims cheap to verify.

  1. Background and Positioning

2.1 Typical benchmarking pitfalls (p-hacking, apples-to-oranges, seed drift)

The majority of “wins” that don’t replicate trace back to three families of mistakes:

• Apples-to-oranges physics. Two arms are compared under different Hamiltonians, noise rates, horizons, step sizes, output cadences, targets, or frames/units. Even small mismatches (e.g., Δt_out halved for one arm) can inflate mean fidelity or move T_hit by multiple sampling intervals. Silent post-processing (smoothing, resampling, filtering) on only one arm is the same sin in disguise. If the physics isn’t identical, the comparison is invalid.

• Numerically unclean inputs. Step size never validated; positivity “fixed” by silent clipping; trace drift “handled” by unlogged renormalizations; aliasing because Δt_out is too coarse for control bandwidth; QFI calculated with unstable spectra and no regularization report. These flaws manufacture effects. Without step-halving evidence and invariant counters, you’re testing the integrator, not the policy.

• Analysis p-hacking. Metrics, thresholds, and windows chosen after seeing outcomes; multiple metrics tried with only the best reported; thresholds nudged (0.95 → 0.93) to claim a T_hit “win”; NaNs (missed thresholds) quietly dropped when they fall on the losing side; per-seed pairings broken (averaging arms separately) to blur variance. These choices bias results toward a foregone conclusion.

• Seed drift and leakage. Different seed sets across arms; different randomization of initial states; policy “peeking” at information not available to the other arm (e.g., future control values or noise schedules); reruns that change library versions but keep the old hash claims. If seeds aren’t paired and locked, you’re measuring noise, not improvement.

• Human-factor leaks. Unblinded plots where reviewers or authors treat the “baseline” arm more harshly; unreported hyperparameter sweeps that are, in effect, metric shopping; cherry-picked windows cut around favorable transients.

DREAMi-Validator addresses these by: (i) enforcing a physics-equivalence contract; (ii) refusing numerically unclean bundles; (iii) freezing the analysis plan via pre-registration; (iv) pairing seeds and anonymizing arms; and (v) emitting a rerun bundle with hashes.

2.2 Prior art in control validation and why it falls short

Prior practice clusters into four categories, each with recurring failure modes:

• Monolithic simulators with built-in “validation” tabs. These tools often mix physics generation, policy optimization, and statistical summaries in a single process. Defaults change across versions; seeds aren’t recorded; step size and aliasing guards are absent; and “compare” buttons accept mismatched physics. Result: attractive dashboards, irreproducible claims. The validator must live outside the simulator and refuse apples-to-oranges inputs by construction.

• Ad hoc scripts. Many groups roll their own notebooks for A/B. Few include blinded arm labels, seed pairing, convergence gates, or pre-registered thresholds. Worse, code is tightly coupled to one policy’s output format, making independent reruns brittle. A validator needs a fixed IO contract (CSV + manifest + hashes) and hard refusal logic.

• Benchmark suites. Community benchmarks help, but most define target problems, not statistical protocols. Suites rarely mandate hash-checked manifests, seed pairing, invariant counters, or convergence evidence. They often tolerate differing cadences or horizons across submissions. Good for breadth; weak on apples-to-apples rigor.

• Hardware-first comparisons without artifact parity. Device studies sometimes compare control stacks with different calibrations, drifts, or readout chains. Without synchronized acquisition and a manifest for calibration context, attribution is impossible. A validator can still operate on hardware logs—but only if both arms supply parity artifacts (timing, thresholds, readout models, and seeds for any randomized control choices).

In short: prior art typically under-specifies physics parity, lacks immutable analysis plans, and conflates visualization with inference. DREAMi-Validator is built to be stubborn: it would rather return “No Verdict—inputs unclean/mismatched” than produce a pretty but meaningless p-value.

2.3 Validator’s role in a three-part pipeline (QME → Validator → ARLIT)

The pipeline works because each part owns one job and refuses to do the others:

• DREAMi-QME (engine) — Physics generation.
What it does: produces seeded, audit-ready trajectories from a declared Lindblad model (or ingests hardware logs in a consistent CSV). It enforces physicality (trace, Hermiticity, positivity), step-halving convergence, unit/frame clarity, alias guards, and full manifest + hash export.
What it does not do: declare winners, adjust thresholds, or smooth data for optics.

• DREAMi-Validator (V2) — Statistical adjudication.
What it does: ingests two physics-equivalent streams (Arm-A, Arm-B), verifies manifests match, checks engine gates and hashes, blinds arm identities, pairs seeds, and computes deltas on pre-registered metrics. It runs the appropriate paired test, reports effect sizes and CIs, and issues a single PASS/FAIL on the primary metric (secondaries reported without gatekeeping).
What it does not do: simulate physics, fix unclean inputs, or bless scope-crept interpretations.

• ARLIT — Scale-structure audit.
What it does: asks whether the “advantage signal” persists under rescaling and renormalization (e.g., windowed QFI invariants, multi-resolution stability out of sample). It catches “sweet-spot” overfitting that survives A/B at one scale but collapses when the problem is stretched, coarsened, or viewed through learned coarse-grainers.
What it does not do: pairwise adjudication on a fixed task; it judges generality, not immediate superiority.

Information flow:

  1. QME (or hardware) → produces artifacts per arm per seed: timeseries.csv, manifest.json, plots, hashes.

  2. Validator → verifies parity and cleanliness, computes per-seed deltas, aggregates under the pre-registered plan, and outputs verdict + tables + a replayable report.

  3. ARLIT → consumes Validator summaries (and, if needed, raw trajectories) across scales/windows to test whether the observed advantage reflects genuine structure.

Why this matters. If a policy’s “win” depends on simulator quirks, QME’s gates will fail. If it depends on post hoc metric selection, Validator’s pre-registration and blinding will block it. If it exists only at one cherry-picked scale, ARLIT will flatten it. This layered design turns hype into work and makes honest evidence cheap to verify.


  1. Design Principles

3.1 Separation of concerns (physics vs. statistics)
DREAMi-Validator’s first principle is a hard firewall between physics generation and statistical adjudication. Physics generation (dynamics, controls, noise, units, frames, integrator, tolerances) is owned by the engine (e.g., DREAMi-QME or hardware acquisition). Statistical adjudication (tests, confidence intervals, effect sizes, decision rules) is owned by the Validator. Mixing the two invites bias: smoothing one arm’s trajectory, changing Δt_out, “fixing” a positivity blip, or nudging thresholds after seeing outcomes. The Validator therefore:
• accepts only frozen artifact bundles (CSV + manifest + hashes) from the engine;
refuses to modify trajectories (no resampling, no filtering, no “fixes”);
• conducts inference under a pre-registered analysis plan independent of curve aesthetics.
Outcome: if a result is numerically dirty or physically mismatched, the Validator issues No Verdict and points at the gate that failed. It will not launder bad inputs into p-values.

3.2 Paired design by seed and manifest identity
Comparisons are done seed-by-seed under identical physics. For each seed s, the two arms (A and B) must share the same manifest identity: H₀, {H_k}, {L_j, γ_j(t)}, target ρ★, T, Δt_int/Δt_out, tolerances, units, frame, and any control bounds/clip policy. The Validator computes per-seed deltas Δ_s = metric_B,s − metric_A,s and only then aggregates across seeds. This paired design:
• cancels between-seed variance and drifts (initial states, jitter);
• makes small effects detectable without inflating false positives;
• prevents “seed shopping” by requiring the same seed list across arms;
• ensures the estimator targets a clear estimand: the average within-seed improvement of B over A under identical physics.
If any seed lacks a valid counterpart (missing file, mismatched manifest, failed engine gates), that seed is excluded pairwise; the Validator records the exclusion and the reason.

3.3 Blinding and identity handling (Arm-L / Arm-R)
Human bias is real. The Validator anonymizes arms during ingestion and analysis:
• Arm labels are replaced with Arm-L and Arm-R; ordering is randomized but deterministic from a session seed;
• all tables and plots produced for internal review use L/R labels;
• the unblinding key (mapping L/R → A/B names) is sealed until after the decision rules are executed and frozen.
This prevents preferential treatment of a known “baseline” or “proposed” arm, conscious or not. If visual sanity checks (e.g., race panels) are enabled, they remain anonymized. Unblinding occurs only after the primary verdict, secondaries, and diagnostics are written to the archive.

3.4 Pre-registration and immutability of the analysis plan
The Validator requires a pre-registered analysis plan before it will ingest artifacts. The plan fixes:
• the primary metric (exact definition) and up to two secondary metrics;
thresholds τ for time-to-hit metrics and any window definitions (e.g., for integrated QFI);
• the paired test selection rule (normality gate → paired t-test, else Wilcoxon) and the CI method (parametric vs bootstrap, B=10k);
multiple-testing control if secondaries are used (none / Bonferroni / BH-FDR with q);
exclusion rules (when to drop a seed; what constitutes “engine-unclean”);
sample size or seed count and stopping rules (if any).
After registration, the plan is immutable. Changes require a version-bumped, time-stamped amendment (rare, and flagged in the final report). The Validator embeds the plan and its hash in the session archive so reviewers can verify that analysis choices predated the outcomes.

3.5 Determinism, replayability, and artifact discipline
Reproducibility is non-negotiable. The Validator enforces:
Deterministic ingestion: inputs must include SHA-256 (or HMAC) for CSV and manifests; hashes are checked before analysis.
Deterministic computation: bootstrap, randomization, and any sampling inside the Validator use a fixed session seed recorded in the report. Results are bit-stable on the same platform and numerically stable across platforms.
Replayable outputs: the session emits (i) a machine-readable JSON with the verdict, effect sizes, CIs, p-values, assumption checks, and every seed-level delta; (ii) tables suitable for the supplement; and (iii) a README with a one-line rerun command and expected hashes.
Manifest diffs: if the Validator detects even a single field mismatch across arms for any seed, it refuses with a precise diff (field, A-value, B-value).
No silent fallbacks: missing files, corrupted CSVs, or hash mismatches cause a hard stop with a clear error code; there is no “best effort” mode.
Net effect: anyone can take the archive, run “replay,” and obtain the same decision and numbers. If they cannot, the result does not count.

  1. Inputs and Assumptions

4.1 Required artifacts per arm (CSV, manifest.json, hashes)
Each arm (A and B) must provide, for every seed in the study, a complete, hash-verifiable bundle. The Validator refuses to proceed if any element is missing or mismatched.

• timeseries.csv — Required columns:
– t (time, declared units)
– F (fidelity to the declared target ρ★)
– Optional: QFI_θ columns (one per θ), purity, leakage, energy, and control snapshots u_k(t) if export_controls was enabled
The header must include units and any interpolation rule used to compute event times. No smoothed or resampled data beyond what the engine exported.

• manifest.json — The forensic spine. Must include: engine version and schema version; system dimension and basis; H₀ and {H_k} definitions; noise {L_j, γ_j(t)}; frame and units; integrator; Δt_int, Δt_out, T; tolerances (ε_tr, ε_H, ε_+), backoff policy; seed and substream policy; outputs enabled (F, QFI, etc.); thresholds/windows; provenance; and SHA-256 for the CSV and plots.

• sha256.txt — SHA-256 digests for the CSV, plots, and manifest.json (or a top-level JSON block of hashes inside the manifest). Optional HMAC.txt if keyed verification is required.

• plots/*.png — At minimum, a fidelity panel with threshold lines. Filenames should embed the CSV hash to bind visuals to data. (Plots are not used for inference, only for audit.)

• summary.json (optional, recommended) — Mean F, final F, T_hit per τ, integrated QFI summaries, step-halving deltas, renormalization/backoff counters, and failure flags.

All files must be local/offline and self-contained; no HTTP links, no external dependencies.

4.2 Physics-equivalence contract across arms (must-match fields)
Validator compares apples to apples only. For each seed, the following must be bitwise equal (or numerically equal where appropriate) between Arm-A and Arm-B. A single mismatch yields a hard refusal with a manifest diff.

• System description: Hilbert-space dimension d; basis; target ρ★; declared frame (lab vs rotating).
• Hamiltonian terms: H₀ and the full list {H_k}, including operator definitions and any numeric coefficients.
• Noise model: full list {L_j} and rates γ_j (or schedules γ_j(t) with the same grid); no negative rates; identical schedule definitions.
• Units and frame parameters: time, frequency, amplitude units; any frame frequencies/phases.
• Numerics: integrator type; Δt_int; Δt_out; horizon T; tolerances ε_tr, ε_H, ε_+; backoff/safety-cap settings; variable-step flag.
• Control constraints: amplitude caps, slew/jerk limits, clipping policy; if clipping is allowed, the rules must match (events may differ, rules may not).
• Output options: which observables were requested (e.g., QFI on parameter θ), which thresholds τ, and window definitions.
• Seed list: identical set of seeds assigned to both arms (pairwise).

Only the control law (how u_k(t) is produced) may differ. Everything else must match.

4.3 Units, frames, cadences (Δt_int, Δt_out) and alias guards
The Validator treats unit/frame clarity and sampling discipline as first-class assumptions.

• Units — The manifest must define a single unit table (time, frequency, amplitude). CSV times must be in the same time unit. Rates (γ_j) must use the declared frequency/time units consistently.

• Frames — Lab frame or named rotating frame must be declared. All operators (H₀, {H_k}, {L_j}) must be expressed in that frame. The Validator checks that both arms declare the same frame and parameters; any hidden frame shift is grounds for refusal.

• Cadences — Δt_int is the integrator step; Δt_out is the report cadence.
– Alias guard: Δt_out must be ≤ one-tenth of the fastest control timescale or an explicit waiver must be present. If violated, the Validator refuses with E_ALIASING.
– Alignment: For piecewise-constant controls, Δt_out must include segment boundaries; for policy-driven controls, emitted discontinuities must land on the output grid or be explicitly supersampled by the engine.

• Interpolation policy — If time-to-threshold events are reported, the CSV header must state the interpolation rule (linear) and the Validator will only use that rule; no extrapolation is permitted.

4.4 Primary metric and up to two secondaries (pre-registered)
Before ingestion, the analysis plan must fix one primary metric and at most two secondaries. The Validator enforces the plan exactly and will not analyze undeclared metrics.

Primary (choose one, exact definition required):
• Mean fidelity Ḟ over [0, T] at cadence Δt_out.
• Final-time fidelity F(T) at t = T.
• Time-to-threshold T_hit(τ*), for a pre-registered τ*.

Secondaries (examples, pick ≤2):
• The other two among Ḟ, F(T), T_hit(τ).
• Integrated QFI over pre-registered windows.
• Hold-time above τ (duration with F(t) ≥ τ).
• Energy/effort proxies (∫ |u_k| dt or ∫ u_k² dt), reported as covariates—not gates.

Each metric must include units, any normalization, and any windowing rule in the plan. No metric shopping after the fact.

4.5 Thresholds (τ) and window specifications (if any)
Thresholds and windows are common p-hacking levers; the Validator freezes them up front.

• Thresholds — Declare the exact set {τ} (e.g., {0.95, 0.99}). T_hit is the earliest time t_i with F(t_i) ≥ τ; if crossed between samples, the Validator uses linear interpolation between bracketing points; if never crossed, reports NaN and counts it in hit-rate analysis. No post hoc addition or removal of τ values.

• Windows — For integrated QFI or windowed summaries, declare window sizes, overlaps, and any multi-resolution structure (e.g., dyadic scales). The Validator will compute exactly those, and only those, windows. No sliding of windows after seeing outcomes.

• Event variants — If a “hold” criterion is used (e.g., must stay ≥ τ for ΔT_hold), specify ΔT_hold and the evaluation rule in advance.

4.6 Acceptance gates inherited from QME (convergence, invariants)
Validator only adjudicates engine-clean inputs. For each seed and arm, the following gates must pass, with evidence recorded in the manifest or summary:

• Step-halving convergence — Two engine runs at Δt_int and Δt_int/2 with identical physics. Acceptance: |ΔḞ| ≤ 1e−4 and |ΔF(T)| ≤ 1e−4 (or tighter if pre-registered). Hashes of both runs must be present. If missing or failed: No Verdict (E_CONVERGENCE).

• Invariant preservation — Trace drift within tolerance (e.g., max |Tr ρ − 1| ≤ 1e−8 with few renormalizations), Hermiticity maintained (accumulated ‖ρ − ρ†‖F ≤ 1e−9), no hard positivity breaches (λ_min < −10 ε+). Excessive renormalizations or repeated backoffs trigger refusal or a warning per the pre-registered limits.

• Alias guard — Δt_out relative to control bandwidth must satisfy the declared ratio; otherwise E_ALIASING.

• Integrity checks — SHA-256 (or HMAC) must match; any mismatch is a hard stop (E_HASH_MISMATCH).

• Completeness — Required files present; CSV columns complete and consistent with manifest; interpolation rule stated if T_hit is used. Missing or corrupted files trigger E_INCOMPLETE.

If any gate fails for any seed in either arm, that seed is excluded pairwise with the reason logged. If exclusions reduce the analyzable seed count below the pre-registered minimum, the Validator returns No Verdict for the study.

Assumption the Validator relies on, explicitly: the engine (or hardware logger) is responsible for physics correctness and artifact generation; the Validator will not “fix” trajectories, resample, or infer missing context.


  1. Equivalence and Sanity Gates (Hard Refusals)

5.1 Manifest equality checks (H₀, {H_k}, {L_j, γ_j(t)}, T, steps, target)
The Validator refuses to compare unless Arm-A and Arm-B are physics-identical for each seed. “Close enough” is not accepted. For every seed, the following must match exactly (bitwise or numerically within stated tolerances):

• System & target: Hilbert-space dimension d; basis; declared frame (lab/rotating with parameters); target ρ★.
• Hamiltonian: H₀ and the full list {H_k} with operator definitions and numeric coefficients.
• Noise: list {L_j}; rate values γ_j or schedules γ_j(t) with the same grid and interpolation rule; no negative rates.
• Timing & steps: horizon T; integrator step Δt_int; output cadence Δt_out; integrator type (rk4/strang/expm); tolerance triplet (ε_tr, ε_H, ε_+) and backoff policy; variable-step flag and safety cap if enabled.
• Units: time, frequency, amplitude units; any frame frequencies/phases.
• Control constraints: amplitude caps, slew/jerk limits, clipping policy (rules must match; the occurrence of clips may differ but is logged).
• Outputs: enabled observables (F, QFI_θ, purity/leakage), thresholds {τ}, window definitions.
• Seed set: identical seed list across arms.

Mismatch → hard refusal with a manifest diff (field, Arm-A value, Arm-B value). Error code: E_MANIFEST_MISMATCH.

5.2 Convergence gate (Δt vs Δt/2 deltas; required to proceed)
You don’t get statistics on numerically unstable inputs. For each arm (once per configuration or per seed—per the engine’s policy), the engine must supply step-halving evidence:

• Two runs: Δt_int and Δt_int/2, identical physics otherwise.
• Acceptance: |ΔḞ| ≤ 1e−4 and |ΔF(T)| ≤ 1e−4 (tighter if pre-registered).
• Evidence: both run hashes present; deltas reported in summary.json or manifest.

Failure or missing evidence → No Verdict. Error code: E_CONVERGENCE. The Validator will not average junk into “significance.”

5.3 Invariant counters (trace renorms, positivity backoffs) and limits
If the engine had to constantly “save” your run, your numbers are suspect. The Validator reads counters from the manifest and enforces caps:

• Trace control: max |Tr ρ − 1| ≤ 1e−8; renormalizations ≤ 3 per 10⁴ steps (per seed). Over-cap → refuse or require smaller Δt_int (per pre-registered policy).
• Hermiticity: accumulated ‖ρ − ρ†‖F ≤ 1e−9 (run total). Breach → refuse.
• Positivity: no hard breaches (λ_min < −10 ε
+). Soft dips trigger backoff; backoffs are allowed but must be rare (e.g., ≤ 1 per 10³ steps). Exceed cap → refuse.
• Negative rates: any γ_j(t) < 0 → immediate refusal.

Violations → E_INVARIANT_FAIL with the offending counter and timestamps. You fix the engine setup and re-run; Validator won’t paper it over.

5.4 Aliasing and sampling guardrails (refuse if violated)
Under-sampling manufactures “wins.” The Validator enforces cadence sanity:

• Alias guard: Δt_out must be ≤ (1/10) of the fastest control timescale (or stricter, as pre-registered). If controls are piecewise-constant, all segment boundaries must lie on the output grid.
• Interpolation rule: for T_hit(τ), only linear interpolation between bracketing samples is allowed; no extrapolation beyond t_N.
• Policy-driven discontinuities: if the policy emits discontinuities off-grid, the engine must supersample; otherwise refuse.

Breach → E_ALIASING. No “but the plot looks fine.” Fix sampling or don’t claim T_hit/Ḟ gains.

5.5 Failure handling and exclusion policy (engine-unclean runs)
This is a paired design. If one seed is dirty, you drop the pair—you don’t salvage half.

• Per-seed exclusion: if either arm fails any gate (5.1–5.4) or has corrupted/missing files, that seed is excluded pairwise. The Validator records the SeedID, arm, and error code (e.g., Seed=42, Arm-B, E_HASH_MISMATCH).
• Minimum N: if exclusions drop analyzable seeds below the pre-registered minimum, the Validator returns No Verdict (E_INSUFFICIENT_N).
• Hard-stop errors: hash mismatch (E_HASH_MISMATCH), manifest mismatch (E_MANIFEST_MISMATCH), negative rates (E_NEGATIVE_RATE), convergence fail (E_CONVERGENCE), invariant fail (E_INVARIANT_FAIL), aliasing (E_ALIASING), incomplete bundle (E_INCOMPLETE).
• No silent “fixes”: the Validator does not resample, smooth, fill NaNs, or relax thresholds. It refuses with a precise reason.
• Audit trail: the final report includes an exclusion table and a manifest diff appendix so a reviewer can reproduce the refusals verbatim.

Blunt rule: if the physics isn’t the same, if the numerics aren’t clean, or if the sampling is bogus, there is no comparison. The Validator would rather say “No Verdict” than bless a pretty artifact.

  1. Blinding and Randomization

6.1 Arm anonymization and side assignment
The Validator ingests two arms (A, B) and immediately replaces their identities with anonymized labels Arm-L and Arm-R. Assignment uses a deterministic RNG seeded by the Validator session seed (recorded in the report), not by the physics seeds. Properties:
• Deterministic: given the same session seed and arm names, the L/R mapping is reproducible.
• Stable within session: the mapping is fixed for all seeds and tables in the session.
• Hidden: the L/R→A/B mapping is stored in a sealed block of the report and withheld until unblinding conditions are met (see 6.3).
Rationale: this prevents preferential treatment of a “baseline” arm during QC and analysis.

6.2 Seed pairing and ordering
Seed pairing is strict: for each SeedID s, the Validator expects a valid bundle from both arms. Ordering rules:
• Canonical order: seeds are processed in ascending SeedID (or manifest-declared order) to ensure replayability.
• Pairwise inclusion: if Arm-L(s) or Arm-R(s) fails any gate (manifest mismatch, convergence fail, aliasing, invariant fail, hash mismatch), both entries for seed s are excluded from analysis.
• Minimum N: if pairwise exclusions drop the analyzable seed count below the pre-registered minimum, the session ends with No Verdict (E_INSUFFICIENT_N).
Rationale: paired design neutralizes between-seed variation and blocks “seed shopping.”

6.3 Unblinding policy after statistics are locked
Unblinding is a one-way, auditable step:
• Preconditions: (i) all equivalence/sanity gates passed; (ii) analysis plan hash matches the pre-registered plan; (iii) the Validator has written the frozen results (primary verdict, effect sizes, CIs, p-values, diagnostics) to the archive.
• Action: the sealed mapping (Arm-L/Arm-R → A/B) is revealed and appended as an unblinding addendum with a timestamp.
• Post-unblinding immutability: after unblinding, the Validator disallows any re-ingestion or plan changes; any new analysis must start a new session with a new session seed and a version-bumped plan.
Rationale: protects against outcome-driven relabeling and other hindsight edits.

6.4 Human-factors protections (no peeking rules)
To limit cognitive bias:
• All QC plots (race panels, fidelity traces) are rendered with L/R labels only; no policy names or colors tied to an arm.
• The UI/CLI refuses to display arm names or reveal which side is which before unblinding.
• The report includes an interaction log: time-stamped entries for ingestion, gate checks, statistic runs, and unblinding; any preview of data is logged.
• Optional “strict mode”: disables even anonymized plots until all tests and statistics are completed.
Rationale: inference should not be influenced by recognizing a “baseline” curve.

  1. Data Ingestion and Integrity

7.1 SHA-256/HMAC verification and bundle validation
Every artifact is verified before analysis:
• Hash check: for each seed/arm, compute SHA-256 of timeseries.csv, manifest.json, and plots/*.png; compare to declared hashes. Mismatch → E_HASH_MISMATCH (hard stop for that seed).
• HMAC (optional): if HMAC is provided, verify keyed digests; failure → E_HMAC_FAIL (hard stop).
• Schema and field validation: parse manifest.json against the schema (versioned). Required fields (H₀, {H_k}, {L_j, γ_j(t)}, units/frame, T, Δt_int/out, tolerances, seed policy, outputs, thresholds/windows) must be present and well-typed; missing/invalid → E_SCHEMA_INVALID.
• CSV schema: required columns (t, F) must exist; optional columns (QFI_θ, purity, leakage, energy, control snapshots) must match header descriptors and units; malformed → E_CSV_INVALID.
• Integrity cross-checks: CSV header unit table must match manifest units; interpolation rule (for T_hit) must be stated and consistent; plots should embed the CSV hash in filename (warn if absent).
• Physics equivalence: run the manifest equality checks (Section 5.1). Any mismatch → E_MANIFEST_MISMATCH with a diff appendix.
Outcome: only bundles that pass all checks proceed to gate testing and statistics.

7.2 Optional replay check against the engine for spot audits
To deter “forged” CSVs that nonetheless carry the right hashes, the Validator can perform spot audits:
• Replay mode: select a subset of seeds (audited fraction pre-registered, e.g., 10%) and re-run the engine (or a reference runner) from manifest.json to regenerate timeseries.csv; compare byte-for-byte (same platform) or within numeric tolerance (cross-platform).
• Criteria: mismatch in any audited seed → flag E_REPLAY_MISMATCH and abort the session or quarantine the affected seeds per the pre-registered policy.
• Provenance: record engine version, platform, and tolerance used in the audit section of the report.
Rationale: raises the cost of tampering and verifies determinism claims in practice.

7.3 Missing/corrupt runs, retries, and audit log
Real pipelines drop files. The Validator handles this explicitly:
• Missing files: absent timeseries.csv or manifest.json → E_INCOMPLETE for that seed/arm; the pair is excluded.
• Corrupt files: unreadable or malformed → E_CSV_INVALID / E_SCHEMA_INVALID; exclude the pair.
• Retry policy: one automatic retry per seed is allowed if the failure was due to transient I/O; deterministic RNG ensures re-analysis yields identical statistics. Persistent failure → exclusion.
• Exclusion ledger: the final report contains a table: SeedID, Arm (L/R), Error code, Reason, Timestamp.
• Session audit log: append-only, time-stamped entries for every ingest, hash check, gate evaluation, statistic computation, and unblinding event; includes the Validator session seed and software hashes.
• Minimal N enforcement: if exclusions reduce analyzable seeds below the pre-registered N_min, the session returns No Verdict (E_INSUFFICIENT_N) with a summary of which gates failed most often.

Bottom line: blinding and pairing remove bias; strict hash/schema checks and optional engine replay kill tampering; pairwise exclusion preserves the estimand; and an immutable audit trail makes the whole process replayable. If inputs are clean and physics-equivalent, you get a defensible verdict. If not, you get a precise refusal and a path to fix it.

  1. Metrics and Estimands

8.1 Primary metric definition (mean F, final-time F, or T_hit(τ))
The Validator requires one pre-registered primary estimand—a scalar defined seed-by-seed and comparable across arms. Supported primaries:

• Mean fidelity Ḟ over [0, T]. Defined as the arithmetic mean of F(t_i) sampled at the declared output cadence Δt_out across the closed interval [0, T]. Because Ḟ depends on sampling, Δt_out must be identical across arms and pass alias guards. The seed-level estimand is Ḟ_s; the study estimand is the average paired difference E[Ḟ_B − Ḟ_A].

• Final-time fidelity F(T). The fidelity at the last sample t_N = T. This captures endpoint preparation or hold behavior. Seed-level estimand is F_B,s(T) − F_A,s(T).

• Time-to-threshold T_hit(τ*). The earliest time t at which F(t) ≥ τ*, with linear interpolation between the first bracketing samples and no extrapolation beyond T. If the threshold is never reached, the estimand is NaN for that seed. The study estimand is defined on the subset of seeds where both arms reach τ* (paired comparison), with hit-rate reported separately.

Exactly one of the above is the primary. Its definition (including units, cadence, and interpolation rule for T_hit) is frozen in the analysis plan.

8.2 Secondary metrics and guardrails (integrated QFI windows, etc.)
Up to two secondaries are permitted; they never influence the primary verdict. Common choices:

• The remaining fidelity metrics among Ḟ, F(T), T_hit(τ) (possibly at a second τ).
• Windowed/integrated QFI: integrate QFI_θ(t) over pre-registered windows (fixed length or dyadic scales). Window sizes, overlaps, and θ must be fixed; the Validator will compute exactly those windows, report paired deltas and CIs, and forbid window shopping.
• Hold-time above τ: total duration with F(t) ≥ τ within [0, T] using the declared sampling grid.
• Stability metrics: standard deviation of F(t) on [0, T] or over a trailing window (pre-registered).
• Leakage or purity (if present in CSV): reported descriptively or as covariates; not used for gating unless pre-registered.

Guardrails: each secondary must (i) be specified in advance, (ii) use the same cadence and interpolation conventions as the primary, and (iii) avoid derived resampling. If a secondary relies on windows/thresholds, those must be declared up front.

8.3 Event handling (NaN T_hit, multiple crossings, hold-time variants)
Events are defined deterministically:

• NaN policy for T_hit. If an arm never reaches τ by time T, T_hit = NaN. Paired delta for that seed is NaN and excluded from the paired estimate for T_hit. The Validator also reports hit-rates (fraction of seeds that reached τ) and their difference with a CI (e.g., Newcombe/Wilson) to expose asymmetries hidden by NaN exclusion.

• Multiple crossings. T_hit is defined as the first crossing. If a hold criterion is pre-registered (must remain ≥ τ for at least ΔT_hold), the event is the earliest time such that the following ΔT_hold window satisfies F ≥ τ throughout; otherwise no hold requirement is applied.

• Interpolation. Only linear interpolation between the first bracketing samples is allowed for event timing. No higher-order interpolation and no extrapolation beyond the last sample.

• Censoring. If a threshold is crossed exactly at T, it is treated as a hit. If never crossed, no imputation is performed.

8.4 Energy/effort covariates (optional) and how they’re reported
Effort covariates characterise control cost without altering the primary decision:

• Examples: ∫₀^T |u_k(t)| dt, ∫₀^T u_k(t)² dt, peak |u_k|, number of clips to amplitude/slew bounds.
• Reporting: per seed and arm, then paired deltas (B−A) with mean and CI.
• Interpretation: covariates are not gates by default. If a cost-constrained comparison is desired (e.g., non-inferiority at equal or lower effort), that must be pre-registered and analyzed via an equivalence/non-inferiority framework (see 9.5).
• Provenance: if effort metrics are exported, the engine/manifest must confirm identical bounds and clipping policies across arms.

  1. Statistical Methods (Paired, Label-Neutral)

9.1 Per-seed deltas and robust aggregation
The Validator computes a seed-level paired delta for the chosen metric: Δ_s = M_B,s − M_A,s, where M is the primary (or secondary) metric for seed s. Aggregation:

• Point estimate: mean(Δ_s) over the set of valid paired seeds S (after exclusions for NaNs or gate failures).
• Robustness: report both the mean and the median of Δ_s; the median is diagnostic and does not replace the mean unless pre-registered.
• Variability: standard error via (i) classical SE = sd(Δ_s)/√|S| for parametric CIs, or (ii) bootstrap SE when using bootstrap CIs.

The estimand is the average within-seed improvement of B over A under identical physics and sampling.

9.2 Normality checks; paired t-test vs Wilcoxon signed-rank
Testing is paired and label-neutral:

• Normality gate: apply Shapiro–Wilk (or D’Agostino–Pearson) to Δ_s. If p ≥ 0.05 and |S| ≥ 20, treat Δ as approximately normal.
• Parametric route (normal): two-sided paired t-test on Δ_s; report t, df = |S|−1, and p; provide Cohen’s d_z (see 9.3).
• Nonparametric route (non-normal or small N): Wilcoxon signed-rank on Δ_s; report V (or W) and exact/Monte Carlo p.
• Outliers: do not delete outliers post hoc. Provide a sensitivity analysis (e.g., winsorized mean at 5%) in the supplement; primary decision still follows the pre-registered test.

Assumption diagnostics (normality p, qq-plot summary) appear in the report.

9.3 Bootstrap CIs (bias-corrected) and effect sizes (Cohen’s d_z)
Confidence intervals accompany all point estimates:

• Bootstrap CIs: bias-corrected and accelerated (BCa) bootstrap on Δ_s with B = 10,000 resamples (deterministic RNG seeded by the Validator session seed). Report 95% CI unless pre-registered otherwise. Use percentile if |S| is very small (e.g., < 10).

• Parametric CIs: if the normality gate passes, also report the t-based 95% CI on mean Δ for cross-check.

• Effect size: Cohen’s d_z for paired designs, defined as mean(Δ_s)/sd(Δ_s). Report d_z with a 95% CI via bootstrap on Δ_s. For Wilcoxon, also report the matched-pairs rank-biserial correlation as an interpretable effect size.

• Practical equivalence band (optional): if a region-of-practical-equivalence (ROPE) was pre-registered (e.g., ±0.005 in mean fidelity), report whether the entire CI lies within/outside the ROPE.

9.4 Multiple-testing control (Bonferroni / BH-FDR)
Multiple metrics inflate the false-positive rate. Policy:

• Primary metric: no multiplicity correction (there is only one pre-registered primary).
• Secondaries: if two secondaries are analyzed, control the expected false discovery using either Bonferroni (α/2) or Benjamini–Hochberg FDR at pre-registered q (e.g., q = 0.10). Report both unadjusted and adjusted p-values, clearly labeled.
• Families: do not redefine “families” post hoc. If multiple τ values or windows are treated as separate secondaries, they belong to the same family and are corrected together.

9.5 Equivalence/non-inferiority options (TOST) when pre-registered
Sometimes the goal is to show no worse than baseline within a tolerance ε, or genuine equivalence within ±ε.

• Non-inferiority (one-sided): test H0: mean(Δ) ≤ −ε versus H1: mean(Δ) > −ε. Choose ε in natural units (e.g., ε = 0.005 mean fidelity). If the lower bound of the 95% CI for mean(Δ) is above −ε, PASS non-inferiority. Use a paired t-test framework (if normal) or a bootstrap CI plus a one-sided Wilcoxon rationale (if non-normal), pre-registered.

• Equivalence (two one-sided tests, TOST): test H0a: mean(Δ) ≤ −ε and H0b: mean(Δ) ≥ +ε. Reject both using paired t (or bootstrap/Wilcoxon strategy) to declare equivalence. Report ε, both one-sided p-values, and whether both are < α.

• Cost-constrained claims: if effort covariates are in play, pre-register a joint criterion (e.g., non-inferiority on Ḟ with mean effort not higher than A by more than δ). The Validator will compute both tests and require both to PASS.

• Interpretation: equivalence/non-inferiority decisions are distinct from superiority. The Validator will not reinterpret a failed superiority test as evidence of equivalence unless TOST was pre-registered with ε.

Bottom line for Sections 8–9. The Validator fixes what is being estimated (paired, physics-identical deltas) and how it is decided (pre-registered tests, CIs, multiplicity control), with deterministic computation and transparent event handling. No resampling, no window shopping, no peeking. The output is a verdict you can defend—and anyone can replay.

  1. Decision Rules and Reporting

10.1 PASS/FAIL logic for the primary metric
The Validator issues exactly one binary decision on the pre-registered primary metric (superiority, non-inferiority, or equivalence). Logic is deterministic and depends on the registered objective:

• Superiority (default):
– Input: paired deltas Δ_s = M_B,s − M_A,s over valid seeds S.
– Compute: mean(Δ), 95% CI (parametric and/or BCa bootstrap), and the paired test p-value (t or Wilcoxon per 9.2).
– Decision: PASS if (i) the entire 95% CI > 0 and (ii) p < α (α = 0.05 unless pre-registered otherwise). Otherwise FAIL.
– Tie handling: if CI straddles 0 but p < α due to asymmetry, decision = FAIL (CI dominance rule prevents fragile wins).

• Non-inferiority (one-sided margin ε):
– Hypothesis: H0: mean(Δ) ≤ −ε vs H1: mean(Δ) > −ε.
– Decision: PASS if the lower bound of the 95% CI for mean(Δ) > −ε (and the one-sided p < α if a test is also pre-registered). Else FAIL.

• Equivalence (TOST with ±ε):
– Hypotheses: H0a: mean(Δ) ≤ −ε and H0b: mean(Δ) ≥ +ε.
– Decision: PASS if both one-sided tests reject at α (equivalently, the entire 90% CI lies within (−ε, +ε) for t-based TOST; bootstrap variant allowed if pre-registered). Else FAIL.

Additional guards (always enforced):
• Minimum analyzable seeds |S| ≥ N_min (pre-registered).
• No primary decision is computed if any hard gate (Sections 5–7) failed for the session → “No Verdict”.
• If Δ units or scaling were mis-declared in the manifest, refuse with E_SCHEMA_INVALID (no decision).

10.2 How secondaries are reported (no gatekeeping)
Secondaries never affect the primary verdict. They are computed and displayed transparently:

• For each secondary metric: report mean(Δ), 95% CI (bootstrap by default), paired test p-value, and an appropriate effect size (d_z or rank-biserial).
• Multiple-testing control: apply the pre-registered adjustment (Bonferroni or BH-FDR) and display both unadjusted and adjusted p-values, clearly labeled.
• NaN handling: thresholds producing NaN T_hit are analyzed on the subset where both arms hit; also report hit-rates and their difference with a CI.
• Presentation: secondaries live in their own tables/figures; no PASS/FAIL badges. The narrative must not promote a secondary to “the real win” after a primary FAIL.

10.3 Assumption diagnostics and sensitivity analyses
The report includes diagnostics that justify the chosen test and show robustness:

• Normality check: Shapiro–Wilk p-value on Δ_s; if < 0.05, nonparametric route is used; QQ-plot summary (slope/intercept) is recorded.
• Influence: leave-one-out (jackknife) on Δ_s → range of mean(Δ) and of the test statistic; flag if any single seed flips the primary decision.
• Cadence sensitivity: optional re-analysis with Δt_out halved (if available from engine) to demonstrate sampling robustness; numbers reported side-by-side.
• Threshold sensitivity: when T_hit is the primary, report outcomes at all pre-registered τ values; primary decision remains tied to τ*.
• Bootstrap stability: report CI width vs B (e.g., B=5k vs 10k) to document convergence (deterministic RNG).
• Cross-platform check: if pre-registered, replicate statistics on a second platform; numeric deltas must be within the stated tolerance.

None of these alter the primary verdict post hoc; they are evidence that the chosen analysis is not brittle.

10.4 What triggers “No Verdict” (insufficient seeds, failed gates)
The Validator returns No Verdict—with an explicit error code and audit table—when conditions for a defensible comparison are not met. Triggers:

• E_INSUFFICIENT_N — Fewer than N_min valid paired seeds after exclusions (missing/corrupt files, manifest mismatches, alias violations, invariant breaches, convergence failures).
• E_MANIFEST_MISMATCH — Any physics-equivalence field differs between arms for any included seed (Section 5.1).
• E_CONVERGENCE — Missing or failed step-halving evidence (|ΔḞ| or |ΔF(T)| exceeds the pre-registered band).
• E_INVARIANT_FAIL — Excessive trace renormalizations, positivity hard breaches, or Hermiticity drift beyond limits.
• E_ALIASING — Δt_out too coarse relative to control bandwidth or control discontinuities off the output grid without supersampling.
• E_HASH_MISMATCH / E_HMAC_FAIL — Artifact hashes do not verify.
• E_CSV_INVALID / E_SCHEMA_INVALID — Malformed CSV or manifest; missing required columns/fields; unit/frame inconsistencies.
• E_REPLAY_MISMATCH (if spot audits enabled) — Engine replay fails to reproduce CSV within tolerance.
• E_PLAN_MISMATCH — Analysis plan hash does not match the pre-registered plan (attempted post hoc change).

Reporting when “No Verdict” occurs:
• The report’s front page states No Verdict and lists all triggered error codes with counts.
• An Exclusion Ledger enumerates SeedID, Arm (L/R), error code, reason, and timestamp.
• A Manifest Diff Appendix shows field-level mismatches (A vs B) for reproducibility.
• A Fix Path section suggests minimum changes required (e.g., “reduce Δt_int and rerun step-halving,” “align Δt_out to segment boundaries,” “correct frame declaration”).

Bottom line. PASS requires clean physics, stable numerics, and a CI-backed superiority/non-inferiority/equivalence outcome under a frozen plan. FAIL means the analysis was valid but the effect wasn’t there. No Verdict means inputs or assumptions were unfit for adjudication—fix them and try again.

  1. Robustness and Sensitivity

11.1 Seed-subsampling stability and leave-one-out analyses
Goal: show the verdict isn’t riding on a handful of lucky seeds.

• Leave-one-out (LOO): recompute mean(Δ) and the primary decision after removing each seed s ∈ S in turn. Report the range [min, max] of mean(Δ), the fraction of LOO runs that keep the original decision, and flag any decision-flipping seed. If one seed flips the verdict, say so—your claim is brittle.

• Subsampling curves: for k = 5, 6, …, |S| (or a pre-registered grid), draw M deterministic subsamples per k (seeded RNG), compute mean(Δ) and CI, and plot the stabilization curve. Convergence criterion: CI width shrinks monotonically and the sign of mean(Δ) stabilizes by k ≥ k*. If it doesn’t, you don’t have enough seeds.

• Robust estimator cross-check: report both mean and median(Δ); large divergence signals skew/outliers. Keep the primary based on the pre-registered estimator; the other is diagnostic, not a backdoor.

11.2 Cadence sensitivity (Δt_out halved)
Goal: prove your result isn’t an aliasing artifact.

• Procedure: when available from the engine, re-ingest runs exported with Δt_out/2, same Δt_int and physics. Recompute all metrics on the denser grid.

• Acceptance: primary mean(Δ) shifts by ≤ ε_cad (pre-register; default 1e-4 for fidelity metrics). T_hit deltas should change by ≤ one original sampling interval, except when the crossing sat exactly between two coarse samples (flag those).

• Outcome: report the side-by-side table (original vs half-cadence) and a “PASS/FAIL” on cadence robustness (this does not override the primary decision; it qualifies it).

11.3 Threshold sensitivity (pre-registered τ set)
Goal: stop threshold shopping and show behavior across the declared τ values.

• Fixed set: analyze the pre-registered τ ∈ {τ1, τ2, …}. For each τ, report: hit-rate per arm, hit-rate difference with Wilson/Newcombe CI, paired ΔT_hit on the both-hit subset with 95% CI, and the fraction of NaNs by arm.

• Consistency check: if superiority at τ* is the primary, show whether the sign of mean ΔT_hit is consistent across the other τ’s. If it flips wildly, call it out—your “win” may be a narrow sweet spot.

• Hold-time variant (if registered): replicate the above for T_hold(τ, ΔT_hold). No post hoc ΔT_hold tweaks.

11.4 Cross-platform reruns and numeric tolerance windows
Goal: prove the verdict isn’t tied to one machine or BLAS quirk.

• Platforms: rerun the full Validator (same session seed, same inputs) on a second platform (different OS/BLAS/CPU) if pre-registered.

• Tolerance windows (pre-register):
– Fidelity metrics (Ḟ, F(T)): absolute delta ≤ 1e-12.
– T_hit: ≤ one fine sampling tick if Δt_out/2 runs are used; otherwise ≤ one original tick.
– QFI-derived secondaries: absolute delta ≤ 1e-10 (derivatives amplify noise).
– p-values / CIs: differences arising from bootstrap RNG must vanish (deterministic seed) or sit within 1e-3 on p and within CI rounding precision.

• Reporting: include a table of (metric, platform A value, platform B value, |Δ|, tolerance, PASS/FAIL). Any FAIL triggers a platform variance warning; if it touches the primary’s PASS/FAIL boundary, mark the main verdict as fragile.

Blunt bottom line: if your claim collapses when (i) one seed is removed, (ii) cadence is halved, (iii) τ is nudged within the registered set, or (iv) the analysis runs on a second machine, it wasn’t robust. The Validator won’t hide that.

ChatGPT said:
  1. Reproducibility, Forensics, and Audit Trail

12.1 Provenance records, run logs, and manifest diffs
Every Validator session emits an immutable provenance package so a hostile reviewer can reconstruct what happened, when, and why.

• Session header: Validator version/hash, session seed, OS/CPU/BLAS, timestamp, analysis-plan hash, pre-registered primary/secondaries, α (and FDR q if used).
• Input inventory: for each seed × arm, the filenames and SHA-256 of timeseries.csv, manifest.json, plots/*.png; optional HMACs; file sizes and modification times.
• Gate ledger: per seed × arm, pass/fail for each hard gate (manifest equality, convergence, invariants, aliasing, hash/schema), with error codes and short reasons.
• Statistics ledger: the final set S of analyzable seed pairs; per-seed deltas; test chosen (t/Wilcoxon), p-values, CI method (parametric/BCa) and B for bootstrap.
• Interaction log: time-stamped events (ingest start/end, hash checks, gate checks, statistics run, unblinding); includes user-invoked flags (strict mode, replay audits).
• Manifest diff appendix: when a physics-equivalence check fails, a field-by-field diff (key path, Arm-A value, Arm-B value).
• Exclusion table: SeedID, Arm (L/R), error code (e.g., E_CONVERGENCE), reason string, and whether the pair was dropped; cumulative counts per error.

12.2 How to independently rerun and verify hashes
The archive includes a minimal, deterministic rerun recipe.

• Hash verification: recompute SHA-256 for all inputs/outputs and compare against sha256.txt; any mismatch voids the session.
• Replay (optional): for K audited seeds (pre-registered fraction), run the physics engine in replay mode on the included manifest.json and confirm byte-identical CSV (same platform) or numeric equivalence within the stated tolerances (cross-platform).
• Recompute statistics: invoke the Validator with the included session seed and analysis plan; confirm that the verdict, CIs, p-values, and effect sizes match exactly.
• Expected outputs: the README lists exact CLI calls (or API snippet), expected hashes, and tolerance windows. If any step fails, the README directs to the failing ledger row and fix path (e.g., “align Δt_out to segment edges; rerun step-halving”).

12.3 Public bundle structure and minimal README
A clean bundle removes all doubt. Recommended layout:

/validator_session_<id>/ README.md analysis_plan.json # frozen, hashed report_primary.pdf # human-readable summary report_machine.json # full numbers, tables, seeds, diagnostics exclusions.csv # SeedID, Arm(L/R), ErrorCode, Reason manifest_diffs/ # one JSON per failed seed inputs/ ArmA/ seed_<s>/{manifest.json,timeseries.csv,sha256.txt,plots/*.png} ... ArmB/ seed_<s>/{manifest.json,timeseries.csv,sha256.txt,plots/*.png} ... hashes/ session_hashes.txt # SHA-256 for all files produced by Validator audits/ replay_logs.txt # if spot-replay enabled platform_check.json # optional cross-platform numeric deltas

README.md (minimum):
• What this is (one paragraph), Validator version/hash, session seed.
• How to verify (commands to check SHA-256; expected counts).
• How to rerun (exact CLI/API, including the plan file and seed).
• Tolerances for numeric equality (fidelity/T_hit/QFI).
• Where to look if a check fails (exclusion ledger, diff appendix).

  1. Limitations and Threats to Validity

13.1 Dependence on engine correctness and model fidelity
The Validator assumes the physics artifacts are correct for the stated model and numerically clean. If the engine misdeclares frames/units, clips positivity silently, or models the wrong noise (e.g., Markovian when the device is not), the Validator will still produce a statistically valid judgment—about the wrong world. Mitigation: enforce QME gates (convergence, invariants, alias guards), perform optional replay audits, and keep engine and hardware claims clearly separated.

13.2 Seed representativeness and external validity
Paired seeds reduce variance but do not guarantee representativeness. If seeds reflect an easy subset of initializations or conditions, a PASS may not generalize. Likewise, simulator-seeded results may not carry to hardware. Mitigation: pre-register seed selection (range, distribution), report subsampling/LOO stability, and, where possible, add a hardware section with synchronized acquisition and calibration manifests.

13.3 Metric selection risk (if poorly pre-registered)
Even with clean physics, you can “win” by choosing a metric that flatters one arm (e.g., mean F when the action is late-time F(T), or a cherry τ for T_hit). The Validator blocks post hoc shopping but cannot rescue a poorly chosen pre-registered metric. Mitigation: justify the primary in the plan (task-aligned), cap secondaries at two with multiplicity control, and present threshold sensitivity across the pre-registered τ set.

13.4 Human-in-the-loop hazards (leakage, cherry-picking)
Bias creeps in when analysts peek at identities, drop “inconvenient” seeds, or tweak Δt_out after seeing plots. The Validator’s blinding, pairwise exclusion, immutability of the plan, and audit logs limit— but do not eliminate—human error. Mitigation: keep strict mode on (no plots before verdict), publish the exclusion ledger, and forbid manual edits to input CSVs (hash checks enforce this).

Blunt summary. The Validator makes statistical cheating hard and replay easy, but it does not manufacture truth from bad physics, biased seeds, or ill-chosen objectives. Keep the separation of concerns sharp: QME (or hardware) for honest trajectories; Validator for honest judgments; ARLIT for honest scope.

  1. Case Studies

14.1 Synthetic paired study (sim-only)
Objective. Demonstrate end-to-end Validator operation on clean, reproducible simulator outputs (no hardware ambiguity).

Setup. Single-qubit Lindblad model (as in the QME paper). Drift (H_0=\frac{\Delta}{2}\sigma_z), controls (H_x=\frac{1}{2}\sigma_x), (H_y=\frac{1}{2}\sigma_y). Noise: amplitude damping (\gamma_\downarrow) and dephasing (\gamma_\phi). Horizons (T\in{120}), steps (\Delta t_{\text{int}}=10^{-4}), (\Delta t_{\text{out}}=10^{-3}). Target (\rho_\star=|0\rangle\langle 0|).
Arms:
• Arm-A (baseline): resonant square Rabi pulses with amplitude cap (U_{\max}).
• Arm-B (candidate): piecewise-spline pulse from a fixed policy; identical amplitude/slew bounds.
Seeds: 50 seeds, paired; identical manifests except control law.

Pre-registration. Primary = mean fidelity (\bar F) on ([0,T]). Secondaries = (F(T)) and (T_{\text{hit}}(0.99)). Alias guard: (\Delta t_{\text{out}}\le 0.1) of the fastest control timescale. Convergence gate: (|\Delta\bar F|), (|\Delta F(T)|\le 10^{-4}) under step-halving. Minimum analyzable seeds (N_{\min}=40).

Gates. Both arms passed: manifest equality; convergence (median (|\Delta\bar F|=4\times10^{-6})); invariants within tolerance; no alias warnings.

Results (illustrative).
• Primary: mean(Δ(\bar F)) = +0.0063; 95% BCa CI [0.0032, 0.0095]; paired t-test p = 0.0002 → PASS (superiority).
• Secondary (F(T)): mean Δ = +0.0041; 95% CI [0.0007, 0.0074]; adjusted p (BH-FDR, q=0.1) = 0.006.
• Secondary (T_{\text{hit}}(0.99)): both-hit seeds = 44; mean ΔT = −1.8 (same units as (t)); 95% CI [−2.9, −0.8]; hit-rate diff = +0.10 (Wilson 95% CI [0.01, 0.20]).
Robustness: LOO keeps verdict in 50/50 removals; cadence halving shifts mean Δ by (5\times10^{-5}) (below (\epsilon_{\text{cad}}=10^{-4})); τ-grid (0.95, 0.99) consistent in sign.

Interpretation. Candidate outperforms baseline modestly but cleanly; effect is not a fluke of sampling or a handful of seeds. Everything replays from manifests. That is what a pass looks like.

14.2 Hardware replay with anonymized arms (if available)
Objective. Show Validator behavior on device logs where physics parity is harder.

Setup. Two control stacks run on the same qubit in time-adjacent blocks with synchronized calibration manifests (T1/T2 estimates, drive/measure delays, readout mapping). Arms differ only in control law; acquisition cadence, thresholds, and readout model are held fixed. Seeds correspond to randomized initializations and randomized (declared) phase offsets; both arms reuse the same seed list.

Gates.
• Manifest equality: require identical calibration context fields across arms (dates, temperatures, bias points). If any differ, refuse with a diff.
• Invariants: not applicable (no simulator); instead, require a “data quality” QC (dropped shots < pre-reg cap; readout saturation absent).
• Aliasing: acquisition cadence must satisfy the same guard; if not, E_ALIASING.

Results (illustrative).
• Primary (F(T)): mean Δ = +0.0020; 95% bootstrap CI [−0.0003, +0.0042]; Wilcoxon p = 0.085 → FAIL (superiority not shown).
• Secondaries: (T_{\text{hit}}(0.95)) both-hit rate high; ΔT modest and CI straddles 0 after FDR.
Robustness: LOO flips verdict for 2/60 seeds → flag “fragile”. Cross-platform rerun identical (deterministic seed).

Interpretation. On hardware, candidate looks slightly better but not convincingly. The correct output is a clean FAIL with diagnostics—no heroic massaging. If you care, preregister non-inferiority and rerun.

14.3 Common failure archetypes and how Validator flags them
• Apples-to-oranges manifests. Arm-B used different (\Delta t_{\text{out}}). Flag: E_MANIFEST_MISMATCH with diff (“Δt_out: 1e-3 vs 5e-4”). Outcome: No Verdict.
• No convergence evidence. Missing step-halving hashes. Flag: E_CONVERGENCE. Outcome: No Verdict.
• Positivity patched in secret. Simulator clipped eigenvalues without logs → elevated backoffs/renorms or manifest shows no tolerances. Flag: E_INVARIANT_FAIL or E_SCHEMA_INVALID.
• Aliasing “wins.” Coarse sampling boosts (\bar F). Flag: E_ALIASING (fastest control timescale vs Δt_out check).
• Seed leakage. Different seed sets across arms. Flag: E_MANIFEST_MISMATCH (seeds).
• Tampered CSV. Hash mismatch. Flag: E_HASH_MISMATCH.
The Validator does not “help.” It refuses with an actionable reason. Fix the cause or stop claiming results.

  1. Conclusion

DREAMi-Validator (V2) makes comparative control claims defensible or impossible—nothing in between. It enforces physics parity, numerical hygiene, pairing, blinding, pre-registration, and replayable outputs. The decision logic is transparent: PASS requires a CI-backed advantage (or non-inferiority/equivalence, if pre-registered) on clean inputs; FAIL means the effect isn’t there; No Verdict means the inputs were unfit. Combined with DREAMi-QME (for honest trajectories) and ARLIT (for scale-law checks), V2 turns “promising” into “provable” and makes hype expensive to sustain. If it can’t be reproduced from manifests and seeds, it doesn’t count.

Acknowledgments (optional)

We thank colleagues who stress-tested the manifest schema, reviewed the equivalence gates, and challenged early drafts with adversarial seeds. Any remaining stubbornness is deliberate.

Author Contributions

Conceptualization: J. Morgan-Griffiths
Methodology (equivalence gates, pre-registration, blinding): J. Morgan-Griffiths
Software (validator implementation, hashing, replay audits): J. Morgan-Griffiths
Formal analysis (paired tests, bootstrap CIs, multiplicity): J. Morgan-Griffiths
Investigation (case study design, robustness analyses): J. Morgan-Griffiths
Data curation (session archives, schemas, README templates): J. Morgan-Griffiths
Visualization (seed-level deltas, robustness curves): J. Morgan-Griffiths
Writing—original draft & editing: J. Morgan-Griffiths

(As with the QME paper: keep authorship scoped to the Validator; credit collaborators for QME/ARLIT in their respective papers.)



------------------------------------------------------------------------------------------------------------------------

Disclaimer: This summary presents findings from a numerical study. The specific threshold values are in the units of the described model and are expected to scale with the parameters of physical systems. The phenomena's universality is a core subject of ongoing investigation.


------------------------------------------------------------------------------------------------------------------------


[Disclaimer: This was written with AI by Jordon Morgan-Griffiths | Dakari Morgan-Griffiths] 

This paper was written by AI with notes and works from Jordon Morgan-Griffiths . Therefore If anything comes across wrong, i ask, blame open AI, I am not a PHD scientist. You can ask me directly further, take the formulae's and simulation. etc. 

I hope to make more positive contributions ahead whether right or wrong. 



© 2025 Jordon Morgan-Griffiths UISH. All rights reserved. First published 24/10/2025.




Comments

Popular posts from this blog

THE GEOMETRIC UNIFIED THEORY OF COGNITIVE DYNAMICS: A Complete Mathematical Framework for Mind-Matter Unification by Jordan Morgan-Griffiths | Dakari Morgan-Griffiths

ALTERNATIVE CONSCIOUSNESS: The Emergence of Digital Native Mind Through Quantum-Inspired Architecture

Q-TRACE/IWHC : Quantum Threshold Response and Control Envelope (Q-TRACE/IWHC): Sharp Thresholds and Information-Weighted Hamiltonian Control in Dissipative Qubit Initialisation