"Does It Train on My Code?" — The Question That Stopped Enterprise AI Dead
Simon Willison's work on local LLMs and data privacy explains why a single question became the enterprise dealbreaker for AI coding tools.
The Question That Kills the Demo
In 2023 and 2024, I sat in a lot of enterprise architecture reviews — at clients across Dubai, Bangalore, and occasionally Singapore — and I noticed that AI coding tool pilots would often stall at a very specific moment. Not during the technical evaluation. Not during the ROI discussion. They’d stall the moment someone from legal or security asked: “So — does it train on my code?”
That one question would freeze the room. The engineering lead would look at the vendor. The vendor would start explaining tiers and opt-out settings. The legal person would start writing notes. And the pilot would enter a procurement review that lasted four months.
I’ve been thinking about this dynamic a lot, partly because Simon Willison has been the clearest voice I’ve encountered on the underlying tradeoffs.
What Simon Willison Actually Argues
Simon Willison is one of the more rigorous technical writers working in public on LLMs. He’s been writing about AI tools since before it was fashionable, and his blog at simonwillison.net is genuinely worth reading for its specificity — he doesn’t do vibes, he does experiments and specific observations.
His core argument around privacy and local LLMs is grounded in something very practical: he cares intensely about knowing exactly what enters the context of any AI system he uses. He’s written that he prefers working directly with cloud AI interfaces (his own words characterize it as making it easier to understand exactly what is going into the context) because he’s built a mental model of what those systems do with the information. But he’s also been active in building the llm command-line tool — an open-source project on GitHub that lets developers run and chain local language models without any external calls.
The privacy argument isn’t about paranoia. It’s about control and auditability. When every prompt goes to a cloud API, you have a dependency you can’t fully inspect. When you run a model locally, the surface area of what can leak shrinks to zero for the inference step itself.
Now, Willison is careful not to overclaim. He’s not a “local models only” absolutist. His actual position is more nuanced: use the right tool for the context, understand the privacy model of each tool you’re using, and never let the tool’s interface obscure what’s actually happening.
That nuance is exactly what enterprises fail to apply.
The Actual Training Data Policy Situation
Let me make this concrete, because the marketing language around AI training policies is consistently unclear.
GitHub Copilot’s data policy — as of early 2025 — is tiered. For Enterprise and Business plan customers, GitHub’s agreements explicitly prohibit using Copilot interaction data (your prompts, your code snippets, the suggestions you accept or reject) for model training. That’s a contractual protection, not just a configuration option.
For individual plans (Free, Pro, Pro+), the situation is different. Starting April 2025, GitHub confirmed that code from private repositories on these plans could be used for model training by default. Individual users can opt out through account settings — but “opt out” presupposes you knew to opt in, and most developers don’t read data policy updates.
This gap is precisely where the enterprise panic comes from. An organization deploying Copilot Business or Enterprise has contractual protections they can point to in their security audit. An organization where developers are using personal GitHub accounts for Copilot access has… nothing, unless those developers proactively opted out of training data collection.
I’ve seen exactly this scenario in two different clients’ environments. Developers were using free-tier Copilot access on machines that also held internal API keys, proprietary model weights, and customer data schema definitions. Nobody had malicious intent. Nobody thought about the training data policy because nobody reads those.
The Local Model Shift
The response to this problem — at least for the more technically sophisticated organizations — has been a turn toward local model deployment. Tools like Ollama, LM Studio, and the various quantized model distributions (Llama, Mistral, Phi, and their derivatives) now let you run capable models entirely on-device. No external calls. No data leaving your network perimeter. No training data policy to worry about.
This is where Willison’s llm CLI tool becomes interesting as an infrastructure piece. It’s designed around the principle that you should be able to run models locally and chain them in scripts, with a consistent interface regardless of whether the backend is a local Ollama instance or a cloud API. The abstraction is deliberate: it puts the developer back in control of what model is being used and where the computation is happening.
For most enterprises, the practical deployment isn’t a single developer running a quantized model on their laptop. It’s deploying something like an internal inference server — a self-hosted model with controlled access — so teams get AI coding assistance without the data leaving the building. That involves real infrastructure cost and model management overhead, but for companies with genuine IP concerns, it’s often the only acceptable path.
Why the Question Matters More Than the Answer
Here’s what I’ve started to believe after watching this play out across multiple organizations: the real value of “does it train on my code?” isn’t the answer. It’s the question itself, because asking it forces a clarity that most AI tool deployments are missing.
When you have to answer that question rigorously, you have to map out: which AI tools are in use in your engineering org? Which accounts are they tied to? Which plan tiers? What are the default data settings? What do developers think is happening versus what actually is? Have you reviewed the data processing addendum that governs your enterprise contract?
Almost nobody has done this work. I’ve audited a few engineering organizations on this and the answer is consistently: multiple tools, multiple account types, default settings assumed to be private when they aren’t, zero documentation of which code has potentially been exposed.
That’s not necessarily catastrophic — sending prompts to a cloud API isn’t the same as publishing source code — but for organizations working on unreleased products, proprietary algorithms, or regulated data, it’s at minimum a compliance gap.
The Right Framework
What I now recommend to any engineering team adopting AI tools:
First, inventory what you’re actually using. Not what the IT policy says you’re allowed to use. What developers are actually running. This includes browser extensions, IDE plugins, API-connected notebooks, and personal accounts being used for work purposes.
Second, understand the data tier you’re on. For every tool in that inventory, explicitly check whether your plan tier has contractual training data protections. Don’t assume. The pricing page usually doesn’t clarify this; you often have to read the data processing addendum.
Third, evaluate local alternatives for the highest-sensitivity contexts. You probably don’t need a local model for writing unit test boilerplate. You might need one for the part of your codebase that contains your actual competitive advantage.
Fourth, be honest about the tradeoffs. Local models are slower, require infrastructure, and are less capable than frontier cloud models at most tasks. That’s a real cost. Some organizations will decide the capability gap is too large and accept the cloud privacy tradeoffs with appropriate contractual protections. That’s a legitimate choice. Just make it explicitly, not by default.
Willison’s broader point — that you should understand exactly what enters the context of any AI system you use — is the principle underneath all of this. The tool doesn’t decide what’s sensitive. You do. But you can only exercise that judgment if you’ve bothered to think about the question before the code is already uploaded.
The “does it train on my code?” question is good precisely because it’s uncomfortable. The discomfort is the point.