"Early results: AI Users Code Better by Hand – But It's Very Likely Not Why You Think"

2026-05-24 · Andy T Woods · Analysis log ↗

One aim of this project is to test whether heavy use of AI coding tools affects our ability to code. We don't have enough data to answer that just yet. But we can eyeball our data, and the first-glance results are surprising: participants who reported higher AI usage performed better on the coding challenges (r = +0.39 with accuracy) – but were also slower (r = +0.23 with active time per challenge).

This is almost certainly not evidence that AI use improves unaided coding skill!

With such a tiny sample size (19 people) we can't make any conclusions. But we can still ponder what is going on if the pattern of results we observe here holds with more participants. One explanation could be that we have confounding factors at play, for example, that early AI adopters in this sample may be stronger coders in ways not captured cleanly by years of experience. We can test this by checking whether experience drives AI adoption directly: the correlation between programming years and AI usage is r = −0.21, and between Python-specific years and AI usage r = +0.25. Both are weak and point in opposite directions, so experience doesn't cleanly explain the pattern. Another idea is that our average number of years coding is very high (approx 14). We may need to recruit participants with a broader range of coding abilities before we can properly test our hypotheses.

We don't know what's going on yet – it's far too early and with too few people to make any conclusions.

Can you help? If you write Python, please sign up! And please spread the word :)

All numbers in this post are reproducible from the analysis script.

Key terms used in this post

Term	Definition
Challenge	A single Python coding problem. Participants write code in the browser; automated tests check whether it works.
Session	One sitting in which a participant completes a batch of up to 12 challenges. Sessions are separated by at least 28 days.
Participant	One person enrolled in the study. A participant may have one or more sessions over time.
AI usage %	Self-reported answer to: "In the past month, roughly what percentage of the code you wrote was AI-generated?" (0–100).

What we found

Important caveat before the charts: this is a cross-sectional snapshot of 19 sessions, mostly single observations per person. The correlations tell us about the relationship between AI usage and current performance – they say nothing about how performance changes over time as AI usage changes. That question requires longitudinal data, which we don't have yet.

Correlations with AI usage

Pearson correlations between reported AI usage percentage and each outcome:

Outcome	r	n	Rough interpretation
Accuracy	+0.39	19	Weak-to-moderate positive
Completion rate	+0.13	19	Negligible
Active time per challenge	+0.23	19	Weak positive
Efficiency ratio	–0.10	19	Negligible

p-values are not added, because at this sample size, they would invite over-interpretation.

Three of the four correlations are positive – higher AI use goes with better accuracy, a tiny positive association with completions, and, oddly, more time taken.

It is worth flagging that Pearson r is the wrong tool for the actual research question. It describes a between-person relationship – do high-AI users score higher than low-AI users? – but what we care about is within-person change: does your performance shift as your AI usage shifts? Those are completely different questions, and the between-person version is hopelessly confounded by experience.

The planned analysis uses mixed-effects models with person-mean centring, so each participant acts as their own control. That is a much more powerful approach, but it requires multiple sessions per person.

AI usage vs accuracy, per session

Each bubble below is one session. Bubble size is proportional to the number of challenges attempted.

One session where the participant passed zero challenges in their only attempt has been excluded.

Performance by AI usage group

Splitting sessions into three bands by reported AI usage:

This grouping is purely illustrative. With only three sessions in the low-AI band, the difference is most likely only a descriptive split on a very small sample.

The speed paradox

Here's an interesting finding. If AI dependency were reducing coding ability, you might expect high-AI users to take longer. In this early snapshot, they do. But they are also more accurate. So what's going on?

High AI users (>75%) average 169 seconds of active time per challenge versus 120 seconds for low AI users. They are slower and more accurate. Three possible explanations are worth tracking as data accumulates:

Challenge difficulty confound. Challenges are randomly assigned and vary in difficulty; sessions are not matched. If high-AI users attempt more challenges per session – including harder, more time-consuming ones that less experienced participants might skip or abandon early – their average time goes up because of what they chose to attempt.
Heavy AI users tackle harder challenges more persistently. Instead of giving up quickly, they may persist. We can check this against challenge difficulty tier data once we have more sessions.
People used to AI feedback may take more time verifying their solutions manually, even though they ultimately get them right.

Code efficiency: a null result

We also track how fast participants' code actually runs. The efficiency ratio is execution time relative to a reference solution: 1.0 means matched; above 1 means slower. Coverage is good: 94 of 95 attempts have a value.

How it works: when you submit a solution, both your code and the reference answer are executed in the browser via Pyodide. We take a median over 15 timing samples for each, divide participant time by reference time, and send only that ratio to the server. No submitted code is ever executed on our infrastructure. Reference solutions come from the original published benchmarks (MBPP, HumanEval) where available. Full methodology in the analysis log ↗

The correlation is r = –0.10. There is no visible evidence that high-AI users write slower-running code than low-AI users.

The red point is one person who solved 11 challenges correctly but ran about twice as slow as the reference. They reported 0% AI usage. Remove them and the correlation shifts to +0.33 – because this low-AI, high-efficiency-ratio person was pulling the correlation negative. The point is flagged for transparency, not excluded, because the pre-registered rules only remove sessions on timing grounds. This is one to revisit if they return for a second session.

More details

As of May 2026 we have 19 completed sessions from 19 participants. Each session involves solving Python challenges without AI assistance, followed by a short post-session survey. That survey includes the key question:

In the past month, roughly what percentage of the code you wrote was AI-generated?

Who took part

All participants are adults aged 25–54. Across the 19 sessions, nine are from participants in the 25–34 age bracket, seven in 35–44, and three in 45–54.

This is an experienced group. Average programming experience is around 14 years (range 4–30), with roughly 9 years using Python specifically. Self-rated proficiency skews intermediate to advanced: 9 sessions from intermediate-rated participants, 8 advanced, 2 somewhat beginner, and 1 expert. Exactly half have a CS or related degree; half don't. Most are based in the UK (14 of 19 unique participants), with the remainder spread across Germany, Switzerland, and Sweden. Note that we excluded one person from all analyses as they completed no challenges.

On average, participants attempted 5.0 challenges per session (range 1–12; the maximum per session is 12).

The four outcome variables

For each session we collect or compute:

Variable	Definition
Accuracy	Average `tests_passed / tests_total` across challenges, expressed as a %
Completion rate	% of attempted challenges where all test cases passed
Speed	Average active time (seconds of keystroke activity) per challenge
Efficiency ratio	Participant solution runtime ÷ reference solution runtime. 1.0 = matched canonical; >1 = slower-running code

The first three map directly to the primary outcomes in the study design: accuracy, speed, and completion as a proxy for overall success. Efficiency ratio is a secondary measure of code quality – whether participants write solutions that run snappily.

A few caveats

The positive correlations almost certainly reflect selection bias rather than anything causal. The experience confound is less clear than it first appears; programming years and AI usage correlate at r = −0.21, Python years at r = +0.25. So experience alone doesn't account for the pattern, but we can't rule out other effects. The longitudinal design is how we get past this.

A few other things are worth bearing in mind:

Self-report noise. AI usage % is a rough estimate, not a measured one. People's sense of how much AI they use varies a lot by task and by how far back they're trying to recall.
Habits vs. the session. The survey asks about the past month, not the session itself. Someone who normally uses AI 90% of the time is describing their habits – they weren't using AI during the challenge, because we ask them not to.
Challenge difficulty varies. Challenges are assigned randomly, so a session with harder problems will naturally score lower. We're not matching difficulty across participants at this stage.
Tiny n. 19 sessions. r = 0.39. Too early to draw conclusions.

What's next

We won't be able to get a robust, more accurate, picture until far more people have completed multiple sessions.

If you want to help answer the question, please take part :). The more participants who complete multiple sessions, the faster the longitudinal signal emerges.

Take part in the study
Create an account →

← All posts