The Hidden Privacy Risks in Large-Language Models

As AI chatbots become standard workplace tools and consumers increasingly rely on them for everything from writing emails to financial advice, a mounting body of evidence reveals that these systems pose unprecedented privacy risks. Major companies like Samsung have already suffered data breaches through employee AI use, while investigators have discovered millions of personal photos and documents---including children's school records and medical information---embedded in the datasets powering today's most popular AI models.

Large-language models (LLMs) feel almost magical because they can discuss nearly any topic, but that range rests on a very uncomfortable fact: these models are trained on an indiscriminate corpus of the internet, often including private information that was never meant to be public. Additionally, depending on the provider's terms of service, user conversations may be used separately for fine-tuning or safety training, though this typically requires explicit consent and follows different processes than base model training.

With AI adoption accelerating across industries and new models launching monthly, these privacy vulnerabilities affect millions of users who may be unknowingly exposing sensitive data or finding their personal information reproduced by AI systems. Each stage of the pipeline - collection, training, and daily use - creates a privacy issue that current regulations struggle to address.

Data Ingestion Exposes Personal Information

Developers lean on large web corpora like Common Crawl, The Pile, RefinedWeb and C4, Wikipedia, Reddit, and Book Corpus for text training data. The problem is that the public web is littered with content from the private lives of individuals.

Recent large-scale audits have provided concrete evidence of this problem. A 2024 privacy audit of DataComp CommonPool--a dataset of 12.8 billion image-text pairs released in April 2023--found extensive personally identifiable information despite sanitization efforts. The researchers estimate at least 142,000 images in all 12.8 billion samples of CommonPool depict resume documents linked to users with public online presence. The audit also found images that depict credit cards, driver's licenses, social security numbers, passports, birth certificates, and children's personal documents including health information from news articles or blogs.

Image datasets like LAION-5B present additional risks: a 2024 Human Rights Watch investigation revealed that LAION-5B contained photos of 362 Australian children and 358 Brazilian children from across all states and territories, complete with names and school uniforms. The investigation found easily traceable information including when and where photos were taken, with some containing children's full names and school locations.

Sources:

Training Data Becomes Embedded and Difficult to Remove

During training, LLMs can memorize portions of their training data within their parameters, particularly when data appears multiple times or has unusual patterns. Unlike traditional databases where sensitive information can be located and deleted, this memorized data becomes embedded in the model's neural weights, making it costly to remove and often requiring complete retraining or emerging machine unlearning techniques.

A 2021 USENIX Security paper "Extracting Training Data from Large Language Models" demonstrated this vulnerability by showing how researchers could prompt GPT-2 to output verbatim emails and phone numbers from its training data. Research has consistently shown that even datasets with privacy protections contain substantial amounts of unfiltered personal information that could be memorized during training.

As open-weight models get cloned and fine-tuned by various organizations, privacy vulnerabilities can spread across the ecosystem faster than coordinated fixes can be implemented.

Sources:

Enterprise Usage Creates New Attack Vectors

The workplace adoption of AI tools is accelerating faster than security protocols can keep pace. When employees use LLMs, sensitive data typed into prompts may be transmitted to and stored by the provider's servers, creating new vulnerabilities that many IT departments haven't yet addressed. The data handling practices vary significantly between providers and service tiers---a complexity that often leaves organizations exposed.

For example, ChatGPT Enterprise, Team, and Edu versions explicitly do not use business data for training by default, provide AES-256 encryption, SOC 2 compliance, and offer Data Processing Agreements. These enterprise products include zero data retention options for qualifying organizations and data residency controls across multiple regions. However, the free consumer version retains chat history for at least 30 days and may use data for service improvements---a distinction many employees using personal accounts for work tasks may not understand.

The consequences can be severe. Samsung engineers used ChatGPT to debug semiconductor code, uploading proprietary information including source code and internal meeting notes. While the data was retained by OpenAI's servers according to their terms of service, Samsung subsequently imposed a comprehensive ban on external AI tools, concerned that the proprietary information could not be retrieved or deleted from the AI system.

Sources:

Regulatory Response Intensifies

Governments worldwide are scrambling to address AI privacy risks, but the technology is evolving faster than regulations can keep pace. The Italian data protection authority temporarily banned ChatGPT in March 2023 over GDPR concerns, lifting the ban on April 28, 2023 after OpenAI made changes---but this represented just the beginning of a global regulatory reckoning.

Spain's data protection agency has launched investigations into OpenAI's data practices, while California's CPRA has tightened opt-out and "sensitive data" rules that apply even to data scraped from public pages. The regulatory momentum is building, with states like Illinois introducing comprehensive privacy acts, chatbot disclosure requirements, and employment AI legislation.

Multiple U.S. states have enacted comprehensive consumer data privacy laws, with several states developing AI-specific regulations for government and commercial use. For businesses, this creates a compliance nightmare---different rules in different jurisdictions, with penalties that can reach millions of dollars or 4% of annual revenue under GDPR.

The legal reality is stark: using large-scale web-scraped datasets creates significant compliance challenges under prevailing privacy laws. Personal data in datasets triggers obligations under GDPR, CCPA, and other privacy frameworks, meaning entities cannot simply ignore these legal requirements and hope for the best.

Sources:

Current Privacy Frameworks Are Fundamentally Inadequate

Research has revealed that current privacy frameworks are fundamentally inadequate for web-scale data practices. Two patterns emerge repeatedly when LLM projects encounter privacy challenges:

  • The collapse of individual control: The model of "privacy self-management" breaks down when individuals cannot meaningfully understand or control how their data is used across web-scale systems. Data that was once "public" may later be made private, but this change has no bearing on datasets already scraped.

  • Incomplete anonymization: While some developers attempt to mitigate risk through de-identification, audits demonstrate the limits of those efforts. At web scale, even small failure rates translate into millions of instances of potential privacy harm. For example, a 0.1% failure rate in anonymization processes for a dataset of 12.8 billion samples could still result in millions of privacy incidents.

Sources:

The "Publicly Available" Data Defense Falls Short

Legal analysis shows that relying on "publicly available" data as a legal defense is increasingly problematic. Laws like GDPR, CCPA, and OCPA make clear that just because data is online doesn't make it free for commercial use. The legal definition of "publicly available" is more nuanced than mere accessibility:

  • Both CCPA and OCPA require that controllers have a "reasonable basis" to believe data was lawfully made available by the consumer

  • Indiscriminate web-scraping fails this standard because automated systems cannot discern consumer intent or consent

  • Data disclosed to limited audiences or with restricted permissions doesn't qualify as "publicly available"

  • Biometric data collected without the consumer's knowledge is explicitly excluded from public data exemptions

Sources:

A Proposed Six-Layer Mitigation Framework

Layer 1: Ethical Data Sourcing License curated corpora rather than scraping indiscriminately. Respect robots.txt files and honor takedown requests. Prioritize datasets with clear consent and licensing terms.

Layer 2: Pre-Processing Controls While automated PII detectors combined with human spot-checks can help identify personal information before training, studies demonstrate these methods are imperfect and create what researchers call a "false sense of privacy." Additional steps include stripping technical metadata from documents and images, including creation timestamps, author information, file paths, camera details, and geographic coordinates.

Layer 3: Architecture Choices Implement Retrieval-Augmented Generation (RAG) to keep sensitive information in separate, access-controlled databases rather than in model parameters. This enables granular deletion and access controls while maintaining functionality.

Layer 4: Privacy Engineering Apply differential privacy techniques during fine-tuning, adding calculated noise to gradients to prevent individual data points from being recoverable. Explore emerging machine unlearning techniques to support right-to-be-forgotten requests, though these methods are still developing and may require significant computational resources.

Layer 5: Inference Controls Implement monitoring systems that detect and prevent systematic data extraction attempts through rate-limiting and pattern recognition. Establish output auditing to flag potential data leaks or privacy-related content exposure.

Layer 6: LLM-Specific Governance Conduct Data Protection Impact Assessments (DPIAs) that specifically address AI training data risks. Any entity subject to GDPR, CCPA, or OCPA while using large-scale web-scraped datasets must fulfill complex legal duties including transparency requirements, establishing lawful basis for processing, and honoring data subject rights. Develop incident-response playbooks for privacy breaches, including procedures for model retraining or retirement. Create audit procedures for third-party AI services.

Sources:

The Path Forward

The stakes are rising rapidly. As AI becomes embedded in critical systems---from healthcare to finance to education---the privacy vulnerabilities documented here could affect billions of users. Companies deploying AI tools face potential regulatory fines, lawsuits, and reputational damage, while individuals risk having their most sensitive information leaked or misused in ways they never consented to.

The current moment represents a critical window. With the EU preparing comprehensive AI regulations and multiple U.S. states considering new privacy laws, organizations that act now to implement robust privacy protections will be better positioned for the regulatory landscape ahead. Those that don't may find themselves scrambling to retrofit privacy measures into systems that were fundamentally designed without them.

For consumers, the message is clear: assume that anything you share with an AI system could potentially be stored, learned from, or even reproduced in responses to other users. Until stronger protections are in place, the convenience of AI tools comes with significant privacy trade-offs that most users don't fully understand.

Sources:

 

Next
Next

Your ChatGPT conversations are public