LEX — AI Legal Platform for Law Firms

AI-powered legal analysis platform for law firms and corporate counsel.

Features

Resources

Blog Articles

Technology

Built on AWS (EC2, Bedrock Claude AI, ALB, WAF, S3, ACM, KMS). PostgreSQL, Redis, Qdrant vector database. TypeScript, React, Node.js.

Start free — 50 credits on registration. Sign up

ACADEMIC 45 min read (full paper)

Edit-Trace Oversight: Scalable Alignment Signal from Agentic Workflows

Edit-traces from production agentic workflows produce alignment signal that is denser, more outcome-predictive, and distributionally unlike conventional RLHF preference data. 80.7% of edits are substantive rewrites; binary rejection correlates with 78% positive outcomes — the strongest oversight signal.

% ============================================================

Abstract

Edit-traces from production agentic workflows produce alignment signal that is denser, more outcome-predictive, and distributionally unlike conventional RLHF preference data. Three experiments on a single-practitioner case study (30,510 edit pairs, 2,892 sessions, 1,579 attributed outcomes): (1) 80.7% of edits are substantive rewrites; (2) process-level behavioral features are significant but redundant with artifact features; (3) binary rejection correlates with 78% positive outcomes — the strongest oversight signal.

The methodology rests on a simple empirical claim: a single practitioner working recursively with an LLM, under product accountability, completes long-horizon work that neither party completes alone. Edit-traces in this regime are dense, outcome-validated, and impossible to obtain through annotation in isolation.

Keywords: RLHF, preference data, scalable oversight, agentic workflows, edit-trace, domain constitution, legal AI

Introduction: The Oversight Gap in Agentic Systems

Motivation: Empirical Observation of Recursive Human–LLM Composition

Scalable oversight research typically frames the problem defensively: as models become more capable, mechanisms must compensate for the limits of human attention—debate (irving2018ai), recursive reward modeling (leike2018scalable), AI judges, constitutional methods (christiano2017deep, bai2022constitutional). This framing treats the bandwidth of human oversight as a fixed constant and asks how to route around it.

The case study documented in this paper presents an empirical observation that complicates this framing. A single practitioner shipped 1,547 PRs across 7 production systems in 105 days using an LLM agent (Claude Code) as the primary engineering counterpart. Neither party would have reached this output independently: the practitioner's throughput without the agent is bounded by typing and cognitive load; the agent's autonomous reliability at consequential scale remains insufficient for production deployment without human oversight.

In this regime, the practitioner applies corrections at each step, and each correction shapes the context for subsequent agent actions. The resulting edit-traces are generated under two constraints that standard annotation lacks: (1) production accountability—the corrected output ships to real users, creating a natural incentive for concentrated attention at each decision point; and (2) sequential dependence—corrections accumulate along the trajectory, meaning each edit is informed by the consequences of prior corrections.

We observe empirically that this regime produces a qualitatively different distribution of corrections compared to what we would expect from detached annotation (Section ). Whether this distributional difference translates into better training signal is an empirical question addressed by Experiment 4 (Section ).

The Structural Problem

As LLM agents take on longer-horizon, multi-step work—composing tool calls, accumulating context across hundreds of turns, and shipping outputs with attributable real-world stakes—the gap between how we collect alignment signal and how agents actually fail has become structural. Every existing source of RLHF preference data shares one property: the annotator operates outside the agentic workflow they are meant to govern (stiennon2020learning, ouyang2022training, bai2022training). A Mechanical Turk worker rates isolated model outputs without a codebase, a deployment pipeline, or a customer on the other end. An expert annotator evaluates in a controlled environment, not mid-trajectory in a compositional system. An RLAIF model (lee2023rlaif) applies principles supplied by its creators, without feedback from the downstream consequences of the agent's actions. They all produce ratings detached from the granularity at which agentic systems actually fail: the individual edit within a multi-step trajectory under domain constraints and outcome accountability.

Edit-Trace as Oversight Signal

We propose an alternative: edit-trace oversight—alignment signal captured natively when a practitioner works agentically with an LLM over consequential, multi-step workflows.

When a practitioner runs Claude Code agentically—composing tool calls, reviewing architectural proposals against domain constraints, accepting or rejecting suggested changes based on information not available to the model—every human edit on a model output is a localized correction relative to a domain constitution and an outcome trajectory. This is not preference annotation. It is in-the-loop oversight, captured at the granularity where agentic systems actually fail.

Two properties distinguish edit-trace oversight from expert annotation:

Outcome-validated corrections. The practitioner makes binding decisions with real consequences. Accepted agent output ships and passes or fails in production. Each edit-trace is a correction grounded in revealed preference + ground truth, not abstract judgment.

Compositional trajectory awareness. The practitioner builds compositional pipelines (Query Planner → Semantic Sectionizer → Hallucination Guard → Citation Validator), where every oversight correction affects the rest of the trajectory. Each edit encodes not just local quality judgment but awareness of how the correction propagates through downstream components. This is qualitatively more informative than isolated rating of individual model outputs.

Behavioral Context of Oversight Actions

Even rich edit-trace capture records only the artifact-level correction (what the practitioner changed). The cognitive and behavioral context behind the correction—time invested, external research consulted, voice calls made, window switches indicating cross-referencing—is lost. We capture this dimension through synchronized OS-level activity tracking, providing the behavioral context of each oversight action. This enables the question: does how a practitioner performs oversight contain signal beyond what they corrected?

Research Questions

Related Work

RLHF preference collection. The dominant paradigm for aligning language models relies on human preference judgments collected in controlled settings. christiano2017deep introduced pairwise comparisons over trajectory segments; stiennon2020learning and ouyang2022training scaled this to natural language tasks using crowd workers and contractors. bai2022training compared the signal quality of crowd annotators versus researchers. In all cases, annotators operate outside the systems they evaluate—rating isolated outputs without access to the deployment context, downstream consequences, or the compositional trajectory that produced the output. Edit-trace oversight departs from this paradigm: the signal source is the practitioner who ships the output, not a detached evaluator.

Scalable oversight. As model capabilities grow, the cost and reliability of human oversight become central concerns. irving2018ai proposed AI Safety via Debate, where models argue for and against answers to aid human judgment. leike2018scalable formulated recursive reward modeling, decomposing hard oversight tasks into easier subtasks. bowman2022measuring provided benchmarks for measuring oversight progress, and burns2023weak demonstrated weak-to-strong generalization, where weaker models supervise stronger ones with partial success. These approaches treat human oversight bandwidth as a bottleneck to be routed around. The regime documented here suggests an alternative: when a practitioner works agentically with an LLM, oversight bandwidth may scale with agent capability rather than against it, as the human's corrections become more targeted while the agent handles routine execution.

AI feedback, constitutional methods, and formal control structures. bai2022constitutional introduced Constitutional AI, replacing human annotators with AI self-evaluation against researcher-authored principles. lee2023rlaif extended this with RLAIF, showing that AI-generated feedback can approximate human preferences at lower cost. Both eliminate the annotation bottleneck but operate without production grounding—the principles and feedback are applied in abstract evaluation contexts, not during consequential deployment. A parallel tradition in knowledge engineering uses formal structures to directly control system behavior: ontology-controlled architectures (palagin2006architecture) govern information systems through domain ontologies, and recent work applies this principle to LLMs—OntoChatGPT (palagin2023ontochatgpt) uses formal ontologies to structure ChatGPT's output via meta-learning prompts, while palagin2024neural demonstrate that integrating neural network and ontolinguistic paradigms yields stronger results than either alone. The domain constitution proposed here draws on both traditions: like Constitutional AI, it defines formal conditions for evaluating model output; like ontology-controlled architectures, it uses formal structure to govern system behavior—but applied to the oversight process rather than the generation process.

Direct preference optimization. rafailov2023direct introduced DPO, which optimizes a language model directly on preference pairs without training a separate reward model. DPO's reliance on paired preferences (chosen vs. rejected completions) makes it a natural fit for edit-trace data, where each human correction provides an implicit preference pair: the practitioner's corrected output (chosen) versus the agent's original output (rejected). Experiment 4 (Section ) uses DPO to test whether the distinctive distribution of edit-trace preferences translates into improved domain-specific model performance.

Defining Valid Oversight: The Domain Constitution

As discussed in Section , existing approaches to preference collection and formal AI control operate at the level of model output. We define a domain constitution—formal conditions under which human corrections on agentic output constitute valid oversight signal. Where Constitutional AI asks "does this output satisfy these principles?" and ontology-controlled systems govern what the system produces, the domain constitution governs when corrections on system output constitute valid training signal—shifting formal control from the generation process to the oversight process.

Not all human–agent interaction produces oversight signal. A user who copies an LLM snippet into a one-off script provides no oversight. A crowd annotator who rates two completions provides weak oversight, ungrounded in real consequences. The domain constitution specifies the boundary conditions that separate noise from signal.

Two-Axis Oversight Signal

Artifact-level: what was corrected between agentic output and final artifact—edit distance, semantic change class, structural changes. This axis captures the content of oversight: which agentic behaviors the human deemed unacceptable, and how they were remediated.

Process-level: how the correction was made—keystroke timing patterns, idle gaps, app-switching trajectory, voice context. Captured only with OS-level instrumentation running in parallel. This axis captures the cognitive cost of oversight: how much effort the correction required, what external information was consulted, and whether the human deliberated or corrected reflexively.

Five Conditions of the Domain Constitution

The domain constitution specifies five conditions that must hold simultaneously for human edits on agentic output to constitute valid oversight. Each condition addresses a specific failure mode that would render the edit-trace uninformative or misleading as a training signal.

Why necessary: Without persistent shared state, the human's corrections are context-free—they reflect preferences over isolated outputs rather than oversight over an evolving system. Persistent state ensures that each correction is informed by the full history of prior agent behavior and its cumulative consequences.

Why necessary: Single-turn corrections cannot capture oversight over compositional failure modes—cases where each individual agent output appears adequate but the composition fails. When a practitioner corrects an architectural decision because it conflicts with a decision made three weeks earlier, the resulting edit-trace encodes long-range dependency information that no single-turn annotation scheme can capture.

Why necessary: Oversight that rests on subjective preference alone is indistinguishable from taste. When corrections are grounded in observable system behavior—a deployment that failed, a latency spike, an error rate increase—the edit-trace encodes causal information about what works and what does not.

Why necessary: Oversight is meaningful precisely because the overseer holds information the overseen system lacks. If the human's corrections reflect only information already available to the agent, the edit-trace is redundant with the agent's own uncertainty.

Why necessary: Oversight signal must connect to real consequences to avoid the same detachment that afflicts crowd annotation. When corrected artifacts ship and succeed or fail in production, the edit-trace acquires outcome labels that close the loop between correction and consequence.

Instantiation by the Case Study

This domain constitution is instantiated by the author's production work: **1{,**547 merged PRs across 7 interconnected projects over 105 days} using Claude Code as primary agentic counterpart. The core platform (Legal.org.ua, 1,393 PRs) produces a deployed legal AI platform with 380M+ records pipeline and 70+ MCP tools. Satellite projects (154 PRs) cover due diligence intelligence (SneakyPiper, 73 PRs), LinkedIn lead automation (aipromo, 39 PRs), meeting scheduling (Calendary, 27 PRs), OSINT aggregation (Panoptic, 10 PRs), and OS-level activity tracking (XSISTANT, 5 PRs). Measurable downstream outcomes include selection by Google for Startups, introduction to Deloitte via GFS, and acceptance into NVIDIA Inception Program.

Each of these acceptances was achieved through written applications without prior voice conversations, in-person meetings, or warm introductions from accelerator mentor networks. The applications themselves were drafted using the same recursive workflow that produced the underlying product, demonstrating that the workflow generalizes from code production to high-stakes written communication with measurable institutional gatekeepers.

What Fails to Constitute Valid Oversight

The domain constitution also defines its negation—interaction patterns that fail one or more conditions and therefore do not produce valid oversight signal:

Edit Taxonomy

Six semantic change classes: cosmetic, reorganization, factual\_correction, tone\_adjustment, substantive\_rewrite, rejection. Classification is two-phase: rule-based boundaries (edit_distance_norm <0.05 = cosmetic, ≥ 0.80 = substantive_rewrite), then Claude Sonnet 4.6 via AWS Bedrock for the ambiguous middle range. Coverage: 99.96%.

Data Collection Architecture

Workflow-Level Capture

Three retrospective extractors feed the rlhf-signals module:

Schema: workflow\_sessionsworkflow\_artifactsworkflow\_editsworkflow\_outcomes.

GitHub PR velocity (core platform): 1,393 merged PRs over 105 days (87 active). Peak: March 790 PRs (25.5/day). Median time-to-merge: 30 seconds (77.8% under 5 min)—solo-practitioner auto-merge pattern. PR timestamps do not reflect editing time; real duration is reconstructed from OS-level activity.

One Shipping Operation, Multiple Technical Surfaces

These are not separate projects in different business domains. They are components of one shipping operation—making Legal.org.ua succeed—spanning different technical surfaces. All 1,547 PRs serve one outcome: the platform works, has paying customers, and wins institutional validation.

Technical surfaces within the core platform (1,393 PRs): frontend (React 19, Vite, TailwindCSS), backend (Express, MCP protocol, 70+ tool handlers), data engineering (court decision harvesting, 380M+ records), database (PostgreSQL migrations, Qdrant vector indexing, Redis caching), DevOps (Docker, nginx, CI/CD, blue-green deployment), content (blog, SSG, SEO), and shared TypeScript packages.

OS-Level Activity Instrumentation

Parallel to workflow tracking, an OS-level activity tracker records 5-second activity buckets:

Storage: ~38 MB for 21 continuous days. Both databases store timestamptz in UTC—cross-source alignment verified to <3 seconds.

Cross-Source Linking

For each edit with time window [T_1, T_2]: query activity in [T_1 - 30s, T_2 + 30s], aggregate process features, classify window sessions by category (code\_editing / research / communication / documentation / unrelated).

Practitioner Disambiguation Sessions

Automated activity tracking captures what app or window was active, but cannot determine why. A YouTube video about Ukrainian court procedure, a YouTube video about astrophysics, and a YouTube video about cooking all look identical in the activity data—same wm\_class, similar engagement metrics. Yet their relationship to the subsequent editing session is fundamentally different.

We address this through periodic practitioner disambiguation sessions—structured interviews where the practitioner reviews ambiguous activity windows from the preceding period and provides ground-truth labels. Ambiguous activity categories requiring practitioner input:

These labels feed back into the workflow\_edit\_engagement table as a disambiguation\_label field, enabling more accurate window category computation and a novel process feature: cross\_topic\_inspiration\_ratio.

Outcome Attribution

Automated attribution with confidence levels: strong (temporally proximate, causally linkable—PR merged, no revert in 30d), medium (present but confounded), weak (causally tenuous).

Verified Pilot Dataset

All numbers verified from production databases as of May 8, 2026.

Workflow Data

Edit Distribution (Oversight Signal)

Edit distance (normalized): mean = 0.807, median = 0.839, P25 = 0.743, P75 = 0.927, P95 = 0.987. The practitioner's default mode is near-total rewrite.

Process-Level Data

Bimodal work pattern: 07–11 UTC primary peak (1,376 active windows), 19–21 UTC secondary peak. Approximately 13% real engagement time (6.7% active, 6.3% passive, 87% idle).

Overlap Window

The main PR burst (Feb–Mar, 1,156 PRs at 25.5/day) occurred before XSISTANT launched. Process-level enrichment covers steady-state work (4.4 PRs/day), not peak sprint.

Experiments

Four experiments with progressively higher compute requirements. Experiments 1–3 require no GPU.

Experiment 1: Oversight vs. Annotation—Distributional Difference (RQ1)

Status: Phase A complete.

Sample N=200 LLM outputs (stratified by semantic class, min 10 per class), send to crowd annotation platform. Compare in-the-loop oversight corrections vs. detached crowd annotation: edit distance distributions (KS test), semantic class breakdown, inter-annotator agreement (Krippendorff's α). The central question is whether corrections applied during live agentic workflows differ in kind from labels applied after the fact.

Phase A results (sampling completed May 8, 2026): 19,455 eligible samples after PII filtering (from 21,461). Stratified allocation: substantive=144, cosmetic=15, reorganization=11, rejection=10, factual=10, tone=10. Two JSONL exports: full metadata + platform-ready (no oversight edits shown to annotators). Deterministic seeded PRNG for reproducibility.

Expected: Oversight corrections show heavier tail (80.7% substantive_rewrite already—crowd annotators, operating without production context, are unlikely to match this intensity).

Experiment 2: Behavioral Context of Oversight Actions (RQ2)

Two predictive models on the 498-session overlap subset: Model A (artifact-only) vs. Model B (artifact + behavioral-context features). Compare AUC, SHAP feature importance, permutation test. With only 64 outcomes in overlap, we use edit-class proxy labels for statistical power.

Cross-source linking. Joined XSISTANT OS-level activity data (52,272 activity scores, 16,122 window sessions) with workflow edits via artifact timestamps. Alignment verified to <3s. Result: 10,846 edits processed; 6,753 (62.3%) with process data.

Process features computed per edit: active/passive/idle seconds, keystroke counts, mouse distance, idle gap analysis, app switching count, research switches, voice context, window dwell entropy, window category seconds.

Model comparison. Target variable: binary—substantive_rewrite (1) vs. cosmetic (0). N=6,152 edits (5,740 substantive, 412 cosmetic). 5-fold stratified cross-validation.

Permutation test (1,000 iterations, behavioral-context features shuffled): p < 0.001—behavioral-context features carry statistically significant, non-random signal.

Paired t-test (RF, 5 folds): p = 0.003—the delta is statistically significant (in the negative direction for RF).

Interpretation. Behavioral-context signal is real and non-random (permutation p < 0.001), but does not improve Random Forest prediction of edit class. This is a nuanced result: (1) The proxy target is already well-predicted by artifact features alone (AUC 0.903), leaving little room for improvement. The 14:1 class imbalance further limits discriminative contribution. (2) Behavioral-context features help linear models (+0.065 AUC), suggesting the signal exists but is captured non-linearly by artifact features in RF. (3) The real test requires real outcomes—only 64 outcomes exist in the overlap window, insufficient for outcome prediction.

Behavioral-context signal exists (permutation proof) but is largely redundant with artifact signal at this scale and target definition. This suggests that artifact-level capture of oversight corrections is sufficient for most preference learning, and behavioral context adds value primarily for edge cases or different prediction targets.

Experiment 3: Oversight Corrections and Downstream Outcomes (RQ3)

Level 1: Full Dataset (Artifact-Only)

30,499 edits joined with 1,579 outcomes.

Key finding: Rejection (completely halting the agentic trajectory) has the highest positive outcome rate. This suggests the most valuable oversight signal is the binary accept/halt decision, not granular edit depth. The overseer's willingness to say "no, start over" is more predictive of good outcomes than careful correction. This has direct implications for scalable oversight: the single highest-value data point is whether a human stopped the agent.

Negative correlation of edit distance with outcomes: smaller corrections correlate with better outcomes (r = -0.116). When the agent's output is already close to what the overseer needs, the final product tends to succeed. Heavy rewrites may indicate the agent was on the wrong trajectory.

Level 2: Overlap Subset (Artifact + Behavioral Context)

807 edits with both behavioral-context data and outcomes (720 with binary positive/negative label). Engagement quartile analysis shows Q4 (highest engagement) has visible positive outcome enrichment compared to Q1–Q3, but the effect is modest. The small sample size limits statistical power for definitive engagement–outcome claims.

Confound Controls

Hour of day, day of week, and session source all affect outcome rates independently of edit patterns. The bimodal work schedule (07–11 UTC peak, 19–21 UTC secondary) introduces temporal confounds that must be controlled in any outcome prediction model.

Experiment 4: Training on Oversight-Trace vs. Annotation Preferences (RQ4)

*Redesigned based on Experiments 1–3 findings (see Section

The flagship experiment requiring GPU compute. Simplified from 5 to 4 core conditions after Experiments 2–3 showed behavioral-context weighting is unlikely to improve over uniform weighting.

Four training conditions on Llama 3.1 8B or Qwen 2.5 7B (open-weight):

Method: DPO (rafailov2023direct). Evaluation: win-rate (GPT-4 judge + human N=100), domain accuracy, AlpacaEval 2.0, length-controlled win rate.

Primary metrics—three comparisons: A vs. D (edit-trace improves stock model), A vs. C (human oversight vs. AI self-correction), A vs. E (domain-specific vs. general RLHF).

Estimated cost: \310–380 (see Appendix ).

Cross-Experiment Synthesis

The Three Findings

Finding 1 (Exp 1): Oversight corrections are qualitatively different from annotation. 80.7% of all edits are substantive rewrites. Median normalized edit distance is 0.84—the overseer's default mode is near-total rewrite of LLM output. This distribution will almost certainly differ from crowd annotators, who—operating without production context or domain stakes—tend toward safe, cosmetic edits.

Finding 2 (Exp 2): Behavioral context is methodologically important but computationally redundant. Permutation testing confirms behavioral-context features carry statistically significant signal (p < 0.001). However, they do not improve Random Forest prediction of edit class beyond what token counts alone achieve (AUC 0.903 → 0.874, actually worse). Behavioral-context features help linear models (+0.065 AUC) but are captured non-linearly by artifact features in tree-based models. The behavioral-context axis is a contribution to methodology, not to prediction performance.

Finding 3 (Exp 3): The most valuable oversight action is halt/reject. Completely rejecting LLM output—halting the agentic trajectory—correlates with 78% positive outcomes, far higher than substantive rewrites (48.7%) or cosmetic edits (52.7%). Edit distance negatively correlates with outcomes (r = -0.116): the less the overseer changes, the better the result. The most informative oversight signal is binary (accept vs. halt), not continuous (edit distance).

Implications for DPO Training

The original Experiment 4 design had 5 training conditions, with the primary hypothesis being that behavioral-context-weighted preferences (Condition B) would outperform uniform-weighted (Condition A). The data now challenges this:

Behavioral-context weighting is unlikely to help. Experiment 2 showed behavioral-context features do not improve prediction. Experiment 3 showed engagement quartiles barely differentiate outcomes. The α-weighted DPO formula (weight = 1 + α · engagement_score) would scale pairs by a signal that is statistically real but practically redundant with artifact features.

The real value is in the distribution, not the weighting. The overseer's 80.7% substantive rewrite rate and 3.6% rejection rate create a fundamentally different preference distribution than either AI self-correction or general-purpose RLHF data. Training on this distribution (even with uniform weights) should produce different model behavior than training on RLAIF or public preference data. Replacing crowd annotation with RLAIF (Condition C) and public RLHF data (Condition E) enables matched-volume comparison (24,495 pairs per condition) and eliminates the annotation bottleneck.

Summary

The synthesis of Experiments 1–3 yields a clear primary contribution: oversight-trace—the edit signal captured during production agentic workflows under a domain constitution—constitutes a fundamentally different preference distribution than detached annotation. The distributional difference (80.7% substantive rewrites, 78% halt-positive rate) is substantial. Whether this distributional difference translates into improved domain-specific model performance is the subject of Experiment 4.

The behavioral-context null result is itself a contribution: it shows that capturing what was corrected is more informative than how the correction was performed, simplifying the instrumentation requirements for future deployments of this methodology.

Threats to Validity

The central methodological challenge of this work is that the study subject, the sole annotator, and the author are the same person. We explicitly enumerate the resulting threats and the mitigations available within a single-practitioner protocol.

Construct Validity: Oversight Signal vs. Practitioner Skill

The domain constitution (Section ) claims to define conditions under which edit-traces constitute valid oversight signal. However, the observed edit distribution (80.7% substantive rewrites, median edit distance 0.84) could alternatively reflect practitioner-specific editing style rather than a general property of in-the-loop oversight. That is: a different practitioner meeting all five constitutional conditions might produce a fundamentally different edit distribution.

Mitigation: The multi-practitioner cohort (Section ) is designed specifically to disentangle practitioner-specific style from constitution-induced oversight properties. Within the current single-subject study, we note that the edit distribution is consistent across 7 technically diverse repositories and multiple technical surfaces (frontend, backend, data engineering, DevOps, content), suggesting the pattern reflects the workflow regime rather than a single domain skill. However, this remains a conjecture until replicated with additional subjects.

Internal Validity: Outcome Metrics

The outcomes cited (Google for Startups acceptance, NVIDIA Inception, paying customers) validate the product, not the methodology. They demonstrate that the recursive workflow produces shippable software, but do not directly establish that the extracted edit-traces are superior training signal compared to crowd annotation. This distinction is critical: positive product outcomes are a necessary condition for the edit-traces to carry meaningful signal (per Condition C5), but are not sufficient evidence that training on those traces improves downstream model performance.

Mitigation: Experiment 4 (Section ) is designed to test the methodological claim directly: does DPO training on edit-trace preferences (Condition A) outperform training on crowd-sourced preferences of matched volume (Condition C) on domain-specific evaluation? Until Experiment 4 completes, the product outcomes serve only as evidence that Condition C5 (consequential grounding) is satisfied, not as evidence of the edit-trace's superiority as training signal.

Selection Bias: Survivorship in the Dataset

The dataset contains only edit-traces from trajectories that culminated in merged PRs and working features. Agent outputs that the practitioner accepted without correction but that later caused production failures are absent—there is no record of oversight that should have occurred but did not. Similarly, abandoned trajectories (work started but not completed) are systematically excluded by the retrospective extraction pipeline, which keys on merged PRs and resolved issues.

Mitigation: The survivorship bias is partially addressed by Experiment 3's finding that rejection (halting the agent) correlates with 78% positive outcomes. This demonstrates that the dataset does capture instances where the practitioner stopped an unproductive trajectory—a form of negative signal. However, the absence of false-negative oversight (accepted outputs that later failed) remains a structural limitation. Future work on failed oversight trajectories (Section ) is designed to address this gap directly.

External Validity: Generalizability Beyond N=1

This study establishes a protocol and demonstrates its feasibility with one practitioner in one domain (legal AI). No population-level claims are made or implied. The domain constitution is domain-specific by design (Section ), and we expect different practitioners to require different constitutional instantiations. Cross-domain generalizability is an explicit non-goal of the current work; the contribution is the methodology for capturing and validating edit-trace oversight, not its universal applicability.

Confound: Temporal Autocorrelation

Edit-traces within the same session are temporally autocorrelated: corrections early in a session shape the context for later corrections. Treating individual edit pairs as independent samples (as in Experiments 1–3) may overstate statistical significance. We report this as a known confound; future work should explore session-level modeling or hierarchical approaches that respect the nested structure (edits within sessions within projects).

Limitations

Future Work

Discussion

Composition as an Alternative Frame for Scalable Oversight

The empirical pattern documented here—one practitioner achieving output that neither human nor agent would reach independently—suggests a framing for scalable oversight that differs from the standard capability-gap model. In the standard model, oversight is a problem that grows harder as agents grow more capable (bowman2022measuring, burns2023weak). In the regime observed here, the agent's capability is not a threat to be managed but a throughput multiplier whose output the practitioner corrects at each step.

We note—speculatively—that this may represent a distinct equilibrium: rather than moving toward full autonomy as agents improve, the practitioner-agent composition may deepen, with the human's corrections becoming more targeted and architecturally informed as the agent handles more routine execution. Whether this equilibrium is stable under further capability scaling is an open empirical question that this single-subject study cannot resolve.

The Edit-Trace as Minimal Viable Oversight Signal

Experiment 3's finding—that binary rejection (78% positive outcome rate) is more informative than continuous edit distance—has practical implications. If the highest-value oversight signal is whether a human stopped the agent, then scalable oversight instrumentation may be simpler than expected: a binary accept/reject log per trajectory step, with outcome tracking, may capture the majority of useful preference signal. This hypothesis is testable in Experiment 4 by comparing DPO training on full edit-traces vs. binary accept/reject pairs.

Scope of Claims

This paper makes a methodological contribution (how to capture and validate edit-trace oversight) and reports empirical findings from a single case study (Experiments 1–3). It does not claim that edit-trace oversight is universally superior to RLAIF self-correction or general-purpose RLHF data—that claim requires Experiment 4's completion (where Condition C tests against AI self-correction rather than crowd annotation) and replication across multiple practitioners. The observed distributional difference between edit-trace corrections and expected crowd annotation patterns is a descriptive finding; its downstream utility for RLHF training remains to be demonstrated.

Conclusion

We present edit-trace oversight as a methodology for capturing alignment signal natively from production agentic workflows. The key empirical findings from a single-practitioner case study are: (1) the edit distribution under production accountability is extreme (80.7% substantive rewrites, median edit distance 0.84), and is expected to differ significantly from crowd annotation (Experiment 1, Phase B pending); (2) behavioral-context features (keystroke timing, idle gaps, app switching) carry statistically significant signal (p < 0.001$) but are largely redundant with artifact-level features for predictive tasks (Experiment 2); (3) binary rejection of agent output is the single most informative oversight action, correlating with 78% positive downstream outcomes (Experiment 3).

The proposed domain constitution—five formal conditions under which edit-traces constitute valid oversight—provides a framework for extending this methodology to multi-practitioner cohorts. Whether the observed distributional difference between edit-trace oversight and crowd annotation translates into improved model performance via DPO training (Experiment 4) remains an open empirical question.

The dataset (30,510 edit pairs, 2,892 sessions, 1,579 attributed outcomes) and instrumentation code will be released publicly upon completion of Experiment 4 and PII review.

Compute Budget and Project Status

Experiment 4 requires GPU compute for DPO training. Estimated budget:

As of May 2026: Experiments 1–3 complete. Experiment 4 (DPO training with 4 conditions) is pending compute allocation.


Download Full Paper (PDF)