Does AI-Assisted Task Completion Affect Estimation Accuracy in Small Software Teams? A Crossover Study Design

K. Brady Davis CloudSurf Software LLC, Las Vegas, NV, USA brady@cloudsurf.com

Abstract

AI coding assistants such as GitHub Copilot and Claude Code are now used daily by over half of professional developers, yet their effect on task estimation accuracy---a core planning activity in software teams---remains unstudied. Existing controlled experiments measure AI's effect on task completion time, finding speedups of 20--55% for isolated tasks but slowdowns of up to 19% for experienced developers on familiar codebases. None of these studies examine estimation: whether developers can predict how long AI-assisted work will take, and whether estimation errors compound into planning failures at the team level. This paper proposes a within-subjects crossover study targeting small remote software teams (3--6 engineers). Teams estimate and complete tasks under two conditions: AI-assisted (developers freely use AI coding tools) and unassisted (AI tools disabled). The primary outcome is estimation accuracy, measured as the ratio of actual to estimated task duration. Secondary outcomes include estimation bias direction, task completion rate, and developer confidence calibration. We present the study design, hypotheses, metrics, analysis plan, and threats to validity. No empirical results are reported; data collection is planned for Q3 2026.

Keywords: AI coding assistants, estimation accuracy, software process, developer productivity, human-AI collaboration, empirical study design

1. Introduction

Software effort estimation is difficult. Decades of research document persistent overruns: projects routinely exceed estimates by 30--50%, a pattern that has not improved substantially despite new estimation methods [1, 2]. Estimation accuracy matters because it directly determines schedule reliability, resource allocation, and stakeholder trust. For small teams---where a single mis-estimated task can consume a large fraction of weekly capacity---the cost of estimation error is disproportionately high.

AI coding assistants have rapidly entered the developer workflow. As of 2025, over half of professional developers use AI tools daily [3]. Controlled experiments show that AI assistance can reduce task completion time by 20--55% for well-scoped tasks [3, 4], but a recent randomized controlled trial found that experienced developers working on familiar, complex codebases were 19% slower with AI tools, despite predicting a 24% speedup [5]. Developers' perception of AI's benefit diverges from measured reality: in the METR study, participants still believed AI helped after the experiment showed it did not [5].

This perception gap has direct implications for estimation. If developers believe AI will make a task faster but it does not---or if AI makes a task faster in ways the developer cannot predict in advance---then estimation accuracy should degrade when AI is introduced. Conversely, if AI's effect on task duration is consistent and learnable, developers may calibrate their estimates over time. Neither possibility has been tested empirically.

Research question. Does the use of AI coding assistants change estimation accuracy for software tasks in small teams, and if so, in what direction?

This paper makes three contributions:

A focused research question connecting two well-studied areas---software estimation and AI-assisted development---that have not been studied together.
A within-subjects crossover study design that controls for between-team variability and is feasible with 4--6 small teams.
An analysis plan with pre-specified hypotheses, metrics, and statistical procedures suitable for pre-registration.

The study does not evaluate whether AI tools are "good" or "bad" for productivity. It asks a narrower question: can developers accurately predict how long AI-assisted tasks will take? This question matters regardless of whether AI speeds up or slows down development, because planning reliability depends on estimation accuracy, not on raw speed.

2.1 Software Estimation

Software effort estimation has been studied for decades. Jorgensen's review of expert estimation studies finds that human judgment dominates practice and that estimation accuracy is sensitive to anchoring, task decomposition, and feedback loops [2]. Surveys of project outcomes consistently report that 60--80% of software projects exceed their estimates [1]. More recent work compares relative estimation (story points) to absolute estimation (hours), finding that neither method dominates and that accuracy depends on context, team calibration, and task granularity [6, 7].

A common finding across estimation research is that estimation improves with feedback: developers who regularly compare estimates to actuals become better estimators [2]. This implies that estimation accuracy depends on the stability of the task execution environment. When that environment changes---as it does when AI tools are introduced---the calibration built from past feedback may no longer apply.

2.2 AI Coding Assistants: Productivity Effects

Three controlled experiments provide the empirical foundation for understanding AI's effect on development speed:

Peng et al. (2023) [3]: 95 developers implemented an HTTP server in JavaScript. The Copilot-assisted group completed the task 55.8% faster (95% CI: 21--89%). Participants were recruited online and had varying experience levels. Less experienced developers benefited more.
Cui et al. (2024) [4]: 96 Google engineers completed a realistic C++ coding task. The AI-assisted group finished 21% faster on average. Senior developers saw slightly larger gains. Code quality was not evaluated.
Becker et al. (2025) [5]: 16 experienced open-source developers completed 246 tasks on repositories they maintained (average 5 years of experience, 1M+ lines of code). AI-assisted tasks took 19% longer. Developers predicted a 24% speedup beforehand and still believed AI helped afterward.

The inconsistency across these studies---55% speedup for novice-friendly isolated tasks, 21% speedup in enterprise settings, 19% slowdown for experts on familiar codebases---suggests that AI's effect on task duration is context-dependent and not yet predictable from the developer's perspective. This is precisely the condition under which estimation should suffer.

2.3 Perception-Reality Gap

Vaithilingam et al. [8] found that developers perceived Copilot as useful even when it did not improve their task completion time or success rate, because it "provided a useful starting point." Barke et al. [9] identified two interaction modes---acceleration (developer knows what to do and uses AI to get there faster) and exploration (developer is unsure and uses AI to explore options)---with different productivity profiles. The METR study [5] quantified this gap: developers' post-study estimate of AI's benefit (+20%) was directionally opposite to the measured effect (-19%).

This perception-reality gap is the mechanism through which AI may degrade estimation accuracy. If developers systematically overestimate AI's contribution, they will underestimate task duration when AI is available. If AI's contribution varies unpredictably across tasks, estimation variance should increase regardless of bias direction.

2.4 Situational Awareness and Planning

Endsley's model of situation awareness (SA) defines three levels: perception of elements, comprehension of their meaning, and projection of future states [10]. Estimation is fundamentally a Level 3 SA activity: projecting how long a task will take based on comprehension of its complexity and the execution environment. Parasuraman et al. [11] show that automation can degrade SA when it operates at intermediate levels of autonomy---the operator may not fully understand what the automation is doing, leading to poor projections. AI coding assistants operate at exactly this intermediate level: they generate code that developers must review, accept, modify, or reject, but the developer cannot predict in advance which of these outcomes will occur.

2.5 Gap

No published study examines how AI tool use affects estimation accuracy as distinct from task duration. The three RCTs cited above measure whether AI makes tasks faster or slower; none ask whether developers can predict the answer. This paper addresses that gap.

3. Study Design

3.1 Research Questions and Hypotheses

RQ1: Does AI-assisted task completion change estimation accuracy compared to unassisted task completion?

RQ2: Does AI-assisted task completion change estimation bias direction (systematic over- or under-estimation)?

RQ3: Does estimation accuracy for AI-assisted tasks improve over the study period (i.e., do developers learn to estimate AI-assisted work)?

Hypotheses:

H1 (estimation variance): Estimation accuracy, measured as the mean absolute ratio error |A_i - E_i| / E_i per task, is higher (worse) in the AI-assisted condition than in the unassisted condition. Rationale: AI introduces a source of duration variance that developers have not yet learned to predict [5, 8].
H2 (bias direction): In the AI-assisted condition, developers systematically underestimate task duration (actual > estimated), producing a positive bias. Rationale: the perception-reality gap documented by Becker et al. [5] suggests developers overestimate AI's speedup contribution.
H3 (learning): Estimation accuracy in the AI-assisted condition improves from the first week to the second week of the AI-assisted phase. Rationale: estimation calibration improves with feedback [2]; if developers receive rapid actual-vs-estimated signals, they may recalibrate. This is a secondary, exploratory hypothesis.

3.2 Design: Within-Subjects Crossover

The study uses a within-subjects crossover design [12]. Each participating team works under two conditions, with the order counterbalanced across teams:

Condition U (Unassisted): Developers complete tasks without AI coding assistants. Standard tools (IDE, documentation, Stack Overflow, search) are permitted; AI code generation, completion, and chat tools are disabled.
Condition A (AI-assisted): Developers complete tasks with unrestricted access to AI coding assistants of their choice (e.g., Copilot, Cursor, Claude Code).

Protocol (6 weeks total):

Week 0: Onboarding. Teams receive study materials, install tracking instrumentation, practice the estimation-and-logging workflow on 3--5 calibration tasks. Data from Week 0 is not analyzed.
Weeks 1--2: Phase 1. Group X works under Condition U; Group Y works under Condition A.
Week 3: Washout. Teams work normally (no data collection). This reduces carryover effects.
Weeks 4--5: Phase 2. Conditions swap. Group X works under Condition A; Group Y works under Condition U.

Each team serves as its own control, eliminating between-team confounds (codebase complexity, team skill level, domain difficulty). Counterbalancing controls for period effects (teams improving simply because they practiced estimation for longer).

3.3 Participants and Recruitment

Target: 4--6 teams of 3--6 developers each (12--36 individual participants).

Eligibility:

Distributed software team (at least 2 members in different locations).
Active codebase with iterative delivery cadence (at least 5 tasks per developer per week).
Willingness to disable AI tools during the unassisted phase.
Willingness to log estimates before starting each task.

Recruitment channels:

Professional network of the author (direct outreach to engineering leads at startups and small companies).
Developer community forums and Slack/Discord groups focused on remote work and developer tools.
Open-source project teams with active contributors.

Incentives: Aggregate study results shared with all participants; individual team reports comparing their estimation patterns across conditions; acknowledgment in the published paper.

Sample size rationale. For a paired t-test at alpha = 0.05 and power = 0.80, detecting a medium effect (Cohen's d = 0.5) requires n = 34 paired observations. With 4--6 teams averaging 4 developers each, and each developer completing approximately 10 tasks per phase, the study generates 160--240 task-level paired observations (matched by developer and phase). This exceeds the minimum sample size with margin for attrition and exclusions.

3.4 Task Workflow

For each task during data collection phases, developers follow this workflow:

Estimate. Before starting, record a time estimate in hours (permitted values: 0.5, 1, 2, 4, 8). This uses the timebox granularity from Remote Flow [13].
Record confidence. Rate confidence in the estimate on a 1--5 scale (1 = "very uncertain," 5 = "very confident").
Execute. Complete the task under the assigned condition (AI-assisted or unassisted). A background timer logs elapsed active time.
Log completion. Record the task as done. Actual duration is computed from the timer.
Brief reflection (optional, encouraged). One-sentence note on whether AI helped, hindered, or was irrelevant. Collected for qualitative analysis only.

Tasks are drawn from each team's real backlog---bug fixes, feature implementations, documentation, refactoring. The study does not assign artificial tasks. This increases ecological validity but introduces variability in task type and difficulty.

3.5 Metrics

ID	Metric	Definition
M1	Absolute ratio error	\|A_i - E_i\| / E_i
M2	Signed ratio error	(A_i - E_i) / E_i
M3	Task completion rate	Tasks done / tasks started
M4	Confidence calibration	Correlation of confidence rating with \|ratio error\|
M5	Learning slope	Linear trend of M1 across tasks within the AI phase

A_i = actual duration (hours); E_i = estimated duration (hours) for task i.

M1 (absolute ratio error) is the primary outcome. A value of 0 means perfect estimation; 0.5 means the actual duration deviated from the estimate by 50%. M2 (signed ratio error) captures bias direction: positive values indicate underestimation (task took longer than expected); negative values indicate overestimation.

3.6 Analysis Plan

Primary analysis (H1). Compare mean M1 between Condition A and Condition U using a linear mixed-effects model with fixed effects for condition and phase order, and random effects for developer and team. This accounts for the crossover structure and the nested data (tasks within developers within teams). If residuals are non-normal, we use a Wilcoxon signed-rank test on developer-level means as a robustness check.

H2 analysis. Compare mean M2 between conditions using the same model. A significantly positive M2 in Condition A (but not Condition U) supports H2.

H3 analysis (exploratory). Within the AI-assisted phase only, fit a linear mixed-effects model with task sequence number as a fixed effect and developer as a random effect. A negative coefficient for sequence number would indicate improving accuracy over time.

Pre-registration. The study protocol, hypotheses, metrics, and analysis code will be pre-registered on the Open Science Framework (OSF) before data collection begins.

4. Threats to Validity

4.1 Construct Validity

Estimation as a proxy for planning accuracy. We measure task-level estimation accuracy as a proxy for team-level planning reliability. A team could have high task-level estimation error but still deliver on schedule if errors cancel out. We report both individual-task metrics (M1, M2) and developer-level aggregates, but do not directly measure sprint or iteration-level planning outcomes.

Timer accuracy. Elapsed time is measured by a background timer that tracks active work sessions. Developers may forget to start or stop the timer, or may interleave study tasks with other work. We mitigate this with clear instructions and by flagging outlier durations (>3x estimate or <0.1x estimate) for manual review before analysis.

Confidence rating. The 1--5 confidence scale is a simple ordinal measure. It may not capture the full structure of estimation uncertainty. We treat it as a secondary metric and do not draw strong conclusions from it alone.

4.2 Internal Validity

Compliance. Developers may use AI tools during the unassisted phase (intentionally or through IDE features they forget to disable). We ask developers to confirm compliance at the end of each day and to screenshot their IDE configuration at the start of each phase. We cannot fully prevent non-compliance but will report compliance rates.

Carryover effects. Skills or habits developed during the AI-assisted phase may carry over to the unassisted phase (or vice versa). The one-week washout and counterbalanced design partially address this, but a two-week treatment period may be too short for carryover to fully dissipate.

Task heterogeneity. Using real backlog tasks increases ecological validity but means the AI-assisted and unassisted phases may differ in average task difficulty. The within-subjects design controls for developer skill, but not for task difficulty. We will collect a developer-rated difficulty score (1--5) per task and include it as a covariate in the mixed-effects model.

Hawthorne effect. Developers who know they are in a study may estimate more carefully than usual, improving accuracy in both conditions. This would reduce the observed effect size but would not introduce a differential bias between conditions (both are equally affected).

4.3 External Validity

Generalizability. The sample is limited to small distributed teams willing to participate in a research study. Teams with strong existing estimation practices are more likely to volunteer, introducing self-selection bias toward teams that already value estimation---and may therefore be more resilient to disruption. Findings may not generalize to large teams, co-located teams, or teams that do not currently practice task-level estimation.

AI tool heterogeneity. Developers in the AI-assisted condition choose their own tools (Copilot, Cursor, Claude Code, etc.). This reflects real-world usage but means the study tests "AI coding assistants as a category" rather than any specific tool. Variation in tool quality could increase noise. We record which tools each developer uses and will report usage patterns descriptively.

Temporal scope. AI tools improve rapidly. Results from Q3 2026 may not apply to tools available in 2027. We will report the specific tool versions used and note the temporal boundary of our findings.

4.4 Reliability

Reproducibility. The study protocol, analysis scripts, and anonymized data will be published alongside the results paper. The mixed-effects model specification will be pre-registered. Qualitative reflection notes will be coded by two independent raters with inter-rater agreement reported via Cohen's kappa.

4.5 Ethics

Informed consent. All participants will provide informed consent before enrollment. The consent form will describe the study's purpose, data collected, risks, and the right to withdraw at any time.

IRB review. The study will be submitted to an independent IRB (e.g., WCG IRB or equivalent commercial IRB) for ethics review prior to data collection. The author's organization does not have an internal IRB; commercial IRB services are available for independent researchers.

Data privacy. Task descriptions may contain proprietary information. Only timing data, estimates, confidence ratings, and anonymized task metadata (type, difficulty rating) will be retained for analysis. No source code is collected.

5. Discussion

5.1 Why Estimation, Not Speed

The AI productivity literature focuses on a single question: does AI make developers faster? This paper argues that the more operationally relevant question for software teams is: can developers predict how long AI-assisted work will take?

A 55% speedup [3] is valuable only if the team can plan around it. If AI turns a 4-hour task into a 2-hour task but the developer estimated 1 hour (expecting even more AI help), the team's schedule is broken despite an objective speedup. Conversely, if AI consistently adds 15 minutes of review overhead but developers learn to account for it, planning remains reliable despite the overhead.

The METR study [5] makes this concrete: experienced developers predicted AI would help by 24%, but it actually hurt by 19%. That is a 43-percentage-point prediction error. If this gap persists in team estimation settings, AI adoption could worsen schedule reliability even if aggregate productivity improves.

5.2 Implications for Process Frameworks

Software process frameworks assume that estimation accuracy is a learnable skill that improves with practice and feedback [2]. The introduction of AI tools may reset this learning curve. If H1 is supported (AI degrades estimation accuracy), then teams adopting AI tools should expect a temporary period of planning disruption and should build in larger schedule buffers during the transition---a recommendation consistent with the margin-of-safety principle in iterative process models [14, 15].

If H3 is supported (developers learn to estimate AI-assisted work over time), then the disruption is transient and the practical recommendation is patience: maintain estimation discipline and let calibration catch up.

The Remote Flow model [13] uses time-realistic planning with hour-level timeboxes and automatic comparison of planned vs. actual duration. This rapid feedback loop is exactly the mechanism that should accelerate estimation recalibration for AI-assisted work (if recalibration is possible). This study's results would directly inform whether Remote Flow's feedback loop is sufficient to maintain planning reliability when AI tools are introduced.

5.3 What This Study Cannot Answer

This study does not measure code quality, team coordination, or developer well-being---all of which AI tools may affect and all of which matter for team outcomes. It does not test whether AI tools are net-positive for productivity. It does not compare specific AI tools to each other. It addresses one narrow question---estimation accuracy---and we believe this narrowness is a strength, not a limitation.

6. Conclusion

This paper proposed a crossover study to test whether AI coding assistants affect estimation accuracy in small software teams. The motivation is straightforward: existing RCTs show that AI's effect on task duration is inconsistent across contexts (speedups of 20--55%, slowdowns of 19%), and developers' perception of AI's benefit diverges from reality. If developers cannot predict how long AI-assisted tasks will take, estimation accuracy should degrade, with downstream effects on team planning and schedule reliability.

The study design is intentionally simple: within-subjects crossover with two conditions (AI-assisted, unassisted), real backlog tasks, and a primary outcome (absolute ratio error) that can be computed from two numbers per task (estimate and actual duration). The target sample of 4--6 small teams is realistic for a practitioner-researcher without institutional recruitment infrastructure.

Data collection is planned for Q3 2026. The protocol will be pre-registered on the Open Science Framework before recruitment begins. Anonymized data, analysis scripts, and the full results will be published openly regardless of whether the hypotheses are supported.

Acknowledgment

The author thanks Dr. Supannika Koolmanojwong Mobasser of The Aerospace Corporation for discussions on study design and software process research methodology.

References

[1] K. Molokken and M. Jorgensen, "A Review of Software Surveys on Software Effort Estimation," in Proc. Int. Symp. Empirical Software Engineering (ISESE), 2003, pp. 223--230.

[2] M. Jorgensen, "A Review of Studies on Expert Estimation of Software Development Effort," Journal of Systems and Software, vol. 70, no. 1--2, pp. 37--60, 2004.

[3] S. Peng, E. Kalliamvakou, P. Cihon, and M. Demirer, "The Impact of AI on Developer Productivity: Evidence from GitHub Copilot," arXiv:2302.06590, 2023.

[4] K. Cui, M. Demirer, S. Jaffe, L. Musolff, S. Peng, and T. Vetter, "How Much Does AI Impact Development Speed? An Enterprise-Based Randomized Controlled Trial," arXiv:2410.12944, 2024.

[5] J. Becker, N. Rush, B. Barnes, and D. Rein, "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity," arXiv:2507.09089, 2025.

[6] M. Jorgensen and E. Escott, "Relative Estimates of Software Development Effort: Are They More Accurate or Less Time-Consuming to Produce Than Absolute Estimates?," Information and Software Technology, vol. 143, 2022.

[7] V. Tawosi, R. Moussa, and F. Sarro, "On the Relationship Between Story Points and Development Effort in Agile Open-Source Software," in Proc. 16th ACM/IEEE Int. Symp. Empirical Software Engineering and Measurement (ESEM), 2022, pp. 183--194.

[8] P. Vaithilingam, T. Zhang, and E. L. Glassman, "Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models," in Extended Abstracts of the 2022 CHI Conf. Human Factors in Computing Systems (CHI EA), 2022.

[9] S. Barke, M. B. James, and N. Polikarpova, "Grounded Copilot: How Programmers Interact with Code-Generating Models," Proc. ACM Program. Lang., vol. 7, no. OOPSLA1, art. 78, 2023.

[10] M. R. Endsley, "Toward a Theory of Situation Awareness in Dynamic Systems," Human Factors, vol. 37, no. 1, pp. 32--64, 1995.

[11] R. Parasuraman, T. B. Sheridan, and C. D. Wickens, "A Model for Types and Levels of Human Interaction with Automation," IEEE Trans. Systems, Man, and Cybernetics---Part A, vol. 30, no. 3, pp. 286--297, 2000.

[12] C. Wohlin, P. Runeson, M. Host, M. C. Ohlsson, B. Regnell, and A. Wesslen, Experimentation in Software Engineering. Springer, 2012.

[13] K. B. Davis, "Remote Flow: Async-First Coordination and Time-Realistic Planning for Small Remote Software Teams," CloudSurf Software LLC, 2026.

[14] B. W. Boehm, "A Spiral Model of Software Development and Enhancement," Computer, vol. 21, no. 5, pp. 61--72, 1988.

[15] B. Boehm, J. A. Lane, S. Koolmanojwong, and R. Turner, The Incremental Commitment Spiral Model: Principles and Practices for Successful Systems and Software. Addison-Wesley Professional, 2014.