Your AI Tools Are Leaking Your Data

TL;DR

Free and consumer-tier AI tools train on your inputs by default. Enterprise tiers generally do not, but most teams do not check.
OWASP published a Top 10 specifically for LLM applications. Prompt injection and sensitive information disclosure are #1 and #2.

It Started with a Slack Message from a Friend

A few months ago, a friend of mine who works as a project manager at a mid-sized e-commerce company sent me a message. Someone on their team had taken a dataset of customer orders and transaction records, pasted the whole thing into ChatGPT, and asked it to summarize patterns and flag anomalies.

It worked. The analysis was solid. The problem was that the data, with real customer names, real order values, real transaction IDs, was now sitting on OpenAI's servers. At the time, consumer ChatGPT used inputs for model training by default. That data was gone. Not "deleted after 30 days" gone. Gone into the training pipeline.

Their clients are the kind of companies that do not allow meeting recordings. Some of them do not even allow transcripts. They operate in a world where a screenshot of internal data showing up somewhere unexpected can end a contract. And here was their data, voluntarily handed to a third party, by an engineer who was just trying to do their job faster.

That message changed how I think about every AI tool I use at work.

Where the Leaks Actually Happen

After my friend's story, I started paying attention to all the ways data bleeds out through AI tools. It is not just one vector. There are at least six, and most teams are exposed to all of them.

Secrets and API keys in prompts

Engineers paste code into ChatGPT or Claude to debug it. That code has hardcoded API keys, database connection strings, AWS credentials, and tokens. GitGuardian researchers ran a targeted experiment and extracted 2,702 real, hard-coded credentials from GitHub Copilot by constructing 900 prompts from public code snippets. If Copilot can regurgitate secrets from its training data, what happens to the secrets you paste into a chat window?

Proprietary code shared with AI

Source code, business logic, internal algorithms. Every time someone pastes a function into an AI tool to refactor it, that code leaves the organization. On free tiers, it may end up in training data. Even on paid tiers, it sits on someone else's servers, subject to their retention policies and potential breach exposure.

PII and customer data in prompts

Log files, database query results, error messages with stack traces containing customer names, emails, IP addresses, financial data. Engineers paste these all the time to debug production issues. That is a GDPR violation, a CCPA violation, and potentially a HIPAA violation depending on the data.

Meeting transcripts and recordings

AI meeting assistants like Otter.ai and Fireflies.ai auto-join calls, record everything, store transcripts, and create voiceprints. Otter.ai is currently facing multiple lawsuits for recording meetings without all-party consent and storing biometric data. In one incident in 2024, Otter captured a private investor discussion after a Zoom call "ended" and sent the transcript to a participant who should not have had it.

AI tools training on your data

This is the one most people miss. Free-tier and consumer-tier AI tools use your input data for model training by default. OpenAI's consumer ChatGPT has this toggled ON unless you manually disable it. That means every prompt, every code snippet, every dataset you paste is fair game for improving the model, and potentially surfacing in another user's session.

Prompt injection extracting context

OWASP ranked prompt injection as the #1 LLM security risk. Attackers can manipulate prompts to extract system instructions, conversation history, or data from connected tools. A Stanford student used "ignore previous instructions" to extract Microsoft Bing Chat's entire system prompt, including internal codenames and operational limitations. If your AI tool has access to internal data through plugins or integrations, prompt injection can pull that data out.

What I Was Doing (And How Little It Covered)

I will be honest about my own setup before this wake-up call. Here is what I was doing:

Not exposing .env files to AI tools. I kept environment variables out of any AI context. No pasting .env contents, no sharing files that contained secrets. This is basic, but at least it was something.

Restricting file and folder access. When using AI coding tools, I was careful about which files and directories the tool could see. Sensitive configs, credentials, and client-specific data stayed out of the context window.

That was it. Two practices. No tooling. No enforcement. No way to catch a teammate who did not follow the same rules. No audit trail. Looking at this list now, it covered maybe 10% of the actual attack surface.

Here is what I learned should be in place.

What Engineering Teams Should Actually Be Doing

Use enterprise tiers and disable training toggles

This is the absolute minimum. Every AI tool you use at work should be on a business or enterprise tier where your data is not used for model training.

OpenAI (ChatGPT):

Consumer/Free/Plus: training on your data is ON by default
To disable: Profile > Settings > Data Controls > "Improve the model for everyone" > toggle OFF
API: data is NOT used for training since March 2023
ChatGPT Business/Enterprise/Team: data is NOT used for training

Anthropic (Claude):

Consumer (Free, Pro, Max): you can opt out of data use for model improvement
API: not used for training by default
Claude for Business/Enterprise: data not used for training, configurable retention policies

The OWASP Top 10 for LLM Applications

OWASP published a dedicated Top 10 for LLM applications in 2025. The three entries most relevant to data leaks:

#1 Prompt Injection: Malicious input manipulates LLM behavior to extract data, bypass restrictions, or execute unintended actions. This is both the most common and the hardest to fully prevent.

#2 Sensitive Information Disclosure: LLMs expose PII, proprietary data, or business secrets through their responses. This includes training data memorization (the Copilot credential leak), context window leakage, and insufficient output filtering.

#7 System Prompt Leakage: Attackers extract system prompts that contain sensitive configuration, internal instructions, or access patterns. If your system prompt includes database schemas or API endpoints, those can be extracted.

The full list also covers supply chain vulnerabilities (#3), data poisoning (#4), improper output handling (#5), excessive agency (#6), vector and embedding weaknesses (#8), misinformation (#9), and unbounded consumption (#10).

If your team builds anything with LLMs, this list should be your security review checklist.

The Configuration Checklist

Here is what you can do today, right now, in under an hour:

Switch every AI tool to a business or enterprise tier. Free tiers train on your data. No exceptions.
Disable "Improve the model" in ChatGPT settings. Profile > Settings > Data Controls > toggle OFF.
Disable "Suggestions matching public code" in Copilot org settings. Prevents public code from leaking into your suggestions.
Block AI meeting bots from auto-joining calls. Zoom: Settings > In-Meeting > disable "Allow AI Companion." Do this org-wide, not per-user.
Audit which AI tools your team actually uses. Shadow AI is real. The data going into these tools is sensitive. You need to know what tools are in play before you can secure them.

What I Got Wrong

Before my friend's message, I treated AI tool security the way most engineers do: as a personal discipline problem. Do not paste secrets. Do not share sensitive files. Be careful.

That does not scale. It does not survive a new hire's first week. It does not survive a deadline where someone is trying to debug a production issue at 2 AM and pastes a log file full of customer data into ChatGPT because it is faster than grepping through it manually.

The fix is not "be more careful." The fix is tooling and policies that make leaking data harder than not leaking it. Enterprise tiers with training disabled. Restricted file access for AI tools. Explicit rules about what data never goes into a prompt.

The .env discipline and file access control were a start, but they were not enough. If your team uses AI tools daily and your security posture is "we told people to be careful," you are one tired engineer away from your own data leak moment.

The data is too valuable and the tools are too easy to use for "be careful" to be the whole strategy.