The Internet Is Eating Itself: Why AI Data Saturation Is the Privacy Crisis No One Is Ready For
Table of Contents▼
On this page
- Where This Started: The Internet Was Never Built for Privacy
- The Saturation Thesis: What Happens When Public Data Runs Out
- What Computer Science Actually Teaches About Data Protection
- The Metadata Problem: What Encryption Was Never Protecting
- What the Courts Have Actually Decided
- The Architectural Answers That Actually Exist
- The Dark Pattern Layer: Why Good Options Are Hidden
- What I Actually Did About It: Aldform and ibbe
- What the Public Can Do: A Layered Approach
- The Real Answer: Privacy Is an Architecture, Not a Policy
- Shift in Communication Privacy
- Platform Policy Evolution
- Legal Precedents and Copyright in the AI Era
- Surveillance and Pattern Recognition
- Psychological Manipulation and UX Design
- Emerging Privacy-Preserving Technologies
- Statistical Privacy Models
- Decentralized Protocols and Sovereignty
- Practical Protection Frameworks
- Future Directions in Data Sovereignty
- Decentralized Machine Learning
- Internal Principles
I wrote about this once before, from a place of personal frustration. Every time I opened a photo sharing app, sent a message, or scrolled through a feed, there was a background noise in my head asking whether any of it was actually private. The more I dug in, the louder it got. I eventually published that piece and stepped back from most mainstream platforms. But that essay was personal. This one is structural. Because what I have since learned from studying computer science, cybersecurity, cryptographic architecture, and the actual court records is that the discomfort I was feeling was not paranoia. It was pattern recognition. And the pattern is now playing out in real time, at a scale most people still have not fully understood.
Where This Started: The Internet Was Never Built for Privacy
The internet was not designed with privacy as a foundation. When Tim Berners-Lee published his proposal for the World Wide Web in 1989, the goal was open information sharing between academic institutions. The HTTP protocol, which still carries most of the web today, was stateless and transparent by design. There was no concept of data ownership, no concept of identity, and no anticipation that the request headers exchanged between a browser and a server would one day become a surveillance instrument. The Referer header, for instance, was built to help servers understand where traffic was coming from. Today it tells platforms exactly what search terms you typed before arriving on their page, including sensitive health queries, financial searches, and deeply personal information. Modern HTML does allow developers to set a Referrer-Policy tag that instructs the browser to send only the domain origin rather than the full URL, stripping the sensitive detail from that header. Most platforms have chosen not to implement it.
The commercial web of the late 1990s and 2000s did not correct this architectural gap. It exploited it. The dominant business model that emerged was not selling products to users. It was selling users to advertisers, packaged as behavioral profiles assembled from every click, scroll stop, dwell time, and search query. Google made this into a science. Facebook made it into a social graph. The data collection was always the product, not the byproduct. What changed in the 2020s was the demand side. Training large language models requires data volumes that dwarf what advertising profiling ever required. And that changed everything about how platforms think about the user data sitting inside their closed systems.
The Saturation Thesis: What Happens When Public Data Runs Out
Right now, AI companies are hitting what researchers are beginning to call a data wall. The open web, which provided the bulk of training material for models like GPT, Gemini, and Claude, is closing. Publishers are suing. Courts are drawing lines. Websites are enforcing robots.txt at the server level and actively updating those files to block AI crawlers. The consequence of this is not that AI companies stop training. It is that they turn inward. They train on what their users already gave them, inside closed systems, under terms of service that most people accepted without reading.
Meta's removal of end-to-end encryption from Instagram direct messages, effective May 8, 2026, is the clearest public signal of this shift. The official explanation given was low feature usage. The structural explanation is data access. Without end-to-end encryption, Meta can read those messages. With a corpus of billions of human conversations and a sufficiently capable model, those messages become training material. Vercel, the developer infrastructure platform, updated its terms of service in March 2026 to explicitly allow using code deployments, agent interactions, and platform telemetry to improve its AI products, with an opt-out window that closed on March 31, 2026. These are not isolated decisions by two unrelated companies. They are the same capital allocation logic expressing itself across different product categories. When user data becomes your competitive moat, you stop protecting users from accessing it and start protecting your access to it.
What Computer Science Actually Teaches About Data Protection
To understand how badly the current situation diverges from what is technically correct, you need to understand what proper data protection looks like at the infrastructure level. When a web server stores your password, the correct approach, established in computer science education and industry best practices for decades, is to never store the password itself. Instead, the server passes it through a hashing function: a one-way mathematical transformation that produces a fixed-length output string from any input. SHA-2 and SHA-3 are the current industry standards for this operation. The transformation is irreversible by design. The server cannot recover your password from the hash because the mathematics does not work in that direction.
Hashing alone is not sufficient, because two users with the same password would produce identical hashes, making precomputed lookup tables, called rainbow tables, a viable attack. The correct countermeasure is salting: adding a unique random value to each password before hashing, so that identical passwords produce completely different stored values. A server that implements hashing and salting correctly cannot tell you your own password. It can only verify that what you typed, when salted and hashed, matches what it stored. That is what correctly implemented protection looks like: the system is architecturally incapable of a betrayal it might otherwise be pressured into.
The same principle applies to data deletion. When you drag a file to the recycle bin and empty it, the operating system marks that region of disk space as available for reuse. The data is still physically present on the storage medium until something else overwrites it. Forensic recovery tools can reconstruct deleted files from that space with high fidelity. True secure deletion requires overwriting the disk sectors with random data, which is why specialized tools exist for exactly this purpose. The everyday user who assumes emptying the trash deleted their data is operating under a misunderstanding that the operating system interface actively encourages.
End-to-end encryption extends this same principle to communications. The platform holds encrypted ciphertext. The decryption keys exist only on the communicating devices. Mathematically, the platform cannot read the message because it does not have the key required to transform ciphertext into plaintext. When Instagram removed this option from its messaging feature, it did not change a privacy setting. It changed the cryptographic architecture so that it now joins the pool of platforms that are structurally capable of reading every message on their system. That is a significant and deliberate shift.
The Metadata Problem: What Encryption Was Never Protecting
Most people hear "end-to-end encrypted" and assume their communications are safe. This is where the conversation gets technically important. Encryption protects the content of a message. It has never protected metadata: who you communicated with, how often, at what time of day, from which location, for how long, and with what regularity of pattern. A leaked NSA document from the Snowden archive explicitly described metadata as the agency's most useful tool, noting that their collection systems were ingesting 125 million metadata records per day even in 2004. Former NSA Director Michael Hayden stated publicly that the US government makes lethal targeting decisions based on metadata. The content of the messages was not required.
This matters for the AI data question because platforms that offer end-to-end encryption almost universally still collect full metadata, and metadata reconstructs a person's life with high accuracy. Communication pattern analysis alone can infer the nature of each relationship in your network, whether a relationship is romantic or professional, whether you are managing a health crisis, what your political orientation is, and what your physical movement patterns look like across a day, without ever accessing a single word you wrote. Academic research published in late 2025 confirmed through controlled experimental studies that dark patterns in consent interfaces, specifically ambiguous language and architecturally limited choice presentation, are designed to suppress users' perceived control and bypass active persuasion awareness. The design of the opt-out process is the real privacy policy. The existence of the opt-out is the legal shield.
What the Courts Have Actually Decided
The legal landscape as of early 2026 gives us a working map of where the lines are being drawn, and understanding these cases is important because they are the direct cause of the inward turn toward platform-held data.
| Case | Year | Outcome | What It Means |
|---|---|---|---|
| Thomson Reuters v. Ross Intelligence | 2025 | Plaintiff won | Scraping to build a competing product in the same market is infringement, not fair use |
| Bartz v. Anthropic | June 2025 | Mixed: training won, data hoarding lost | Training itself is transformative; building a pirated library to do it is not |
| Kadrey v. Meta | June 2025 | Meta won on fair use | Torrenting books for model training passed the transformative use test |
| New York Times v. OpenAI | Ongoing | 20M conversation logs ordered produced | Verbatim reproduction of copyrighted content in outputs is a separate legal theory from training |
| Reddit v. Perplexity AI | Ongoing | Unresolved | Bypassing rate limits and CAPTCHA systems may violate DMCA Section 1201, independent of copyright |
| Clearview AI v. BIPA | 2025 | $51.8M settlement | Biometric scraping without consent violates state biometric privacy law |
What these cases collectively establish is that courts are increasingly protecting the training process itself while restricting the data acquisition method. The practical implication is that the piracy route to training data is closing legally. But the internal platform data route, where a company uses what its own users generated inside its own system under its own terms of service, faces essentially no legal obstacle. This is the gap that the inward turn is designed to exploit. No lawsuit currently being litigated closes it.
The Architectural Answers That Actually Exist
The question I keep returning to is not how to make data harder to find. It is how to make data structurally useless to anyone except its owner, even while it is being processed. That is a different and more demanding engineering problem, and there are real answers to it.
Homomorphic encryption is the most radical solution. It allows a server to perform computations on encrypted data without ever decrypting it. The server operates entirely in the encrypted domain and returns an encrypted result. It learns nothing about the input. In 2025, over 250 million financial transactions were processed through fully homomorphic encryption systems, and healthcare analytics firms ran aggregate insights across 110 million patient records without ever exposing the underlying content to their own infrastructure. The current limitation is computational cost: FHE operations run orders of magnitude slower than plaintext computation, making real-time consumer applications uneconomic at present. That gap is narrowing as hardware catches up to the mathematical requirements of the approach.
Differential privacy takes a fundamentally different approach. Rather than hiding the data from computation, it mathematically degrades the individual signal while preserving aggregate patterns. Before any data is used for analysis or model training, a calibrated amount of statistical noise is injected. The mathematics provides a formal, provable bound on how much information about any specific individual can ever be recovered from the output, regardless of what an adversary already knows. Apple uses this for keyboard and emoji usage analytics on iOS. Research published in late 2025 demonstrated dynamic versions of differential privacy that are scalable enough for real-time deployment. This approach does not stop data collection. It makes what is collected provably useless for individual targeting while still being useful in aggregate, which is a technically honest trade-off.
Federated learning extends this principle to model training itself. Instead of sending raw data to a central server, the model weights are sent to the user's device, which trains locally on the user's data, and only the resulting gradient update, a mathematical delta representing what the model learned, is transmitted back. The raw data never leaves the device. Google Keyboard already operates on this basis. The limitation is that gradient inversion attacks can sometimes reconstruct approximate training samples from gradient updates, which is why federated learning is most robust when combined with differential privacy applied to the gradient before transmission.
Tim Berners-Lee's Solid project represents the most architecturally complete rethinking of where data should live by default. Under Solid, all personal data lives in a Personal Online Data Pod that the user controls, hosted wherever they choose. Applications request permission to read specific pieces of data from that pod. The user grants or revokes access at any time. The application cannot access data the user did not explicitly permit, and there is no terms of service update any company can push that changes that. The Open Data Institute formally adopted Solid into its portfolio in October 2024. This is categorically different from a privacy policy promise. It is structural. The enforcement mechanism is mathematical and architectural, not contractual.
Zero-knowledge proofs add another layer to this architecture. A ZKP allows one party to prove to another that a statement is true without revealing any of the underlying information used to prove it. You can prove you are over 18 without revealing your birthdate. You can prove you have sufficient funds without revealing your account balance. In the context of platform verification and identity, this makes it possible to satisfy compliance requirements with provable guarantees while releasing zero personal data. ZKP is already deployed in several blockchain-based financial systems and is making its way into identity verification workflows for regulated industries.
The Dark Pattern Layer: Why Good Options Are Hidden
Understanding the technical solutions also requires understanding why they are not being offered. Research published in the Journal of Advertising in late 2025 found through three controlled experimental studies that dark patterns in data consent interfaces, including confusing language, limited choice architecture, and difficult-to-locate opt-outs, directly suppress users' perception of control over their own data and are architecturally designed to bypass active resistance. This is not an accident of design. It is the output of a conversion optimization process applied to privacy controls. The opt-out exists to satisfy regulatory requirements. The friction exists to ensure almost nobody completes it. The gap between those two facts is where most user data is captured.
The photo sharing platform I discussed in my earlier piece hid its encrypted messaging option inside a conversation's contact menu, behind a single low-contrast text line with no visual indicator that it was interactive. Your existing conversation did not upgrade. A new separate thread opened. A company with thousands of engineers did not do this because upgrading an existing conversation was technically infeasible. It took that design team far longer to ship that specific friction than it would have taken to build a one-tap upgrade prompt. That is a product decision that favors data access over user protection, dressed as a design limitation. And regulators accepted the explanation because the option technically existed.
What I Actually Did About It: Aldform and ibbe
I want to be concrete about something, because I think this conversation is too often theoretical. I run two companies. At Aldform and at ibbe, we have implemented end-to-end encryption by default across every piece of user data we handle: email addresses, phone numbers, names, and every other identifying data field. Not as a premium tier. Not as an opt-in setting. By architectural default, for every user, from the moment their data enters our systems.
I want to explain what that decision actually costs, because the privacy conversation usually skips this. Building E2EE as a default across your entire data model means your own infrastructure cannot query plaintext user data. You cannot run naive database searches across personal information fields. You cannot build behavioral recommendation systems on that data without significant additional cryptographic complexity. Your debugging workflows change. Your customer support workflows change. Your analytics pipeline changes. These are real engineering and operational costs that we absorbed deliberately. The reason is not altruism alone. As AI makes user data more economically valuable every year, being a platform that is structurally incapable of betraying its users is not just an ethical position. It is a product position and a long-term trust position. We decided early that we would rather build on that foundation than on one where user trust is a setting buried four menus deep.
What the Public Can Do: A Layered Approach
The systemic fix requires regulatory and architectural change at a scale individuals cannot force alone. But the individual response is not helpless, and the most impactful moves are about changing which architectural category of tool you depend on, not which terms of service you accept.
At the account layer, the CS50 cybersecurity curriculum is direct about this: brute force can crack a four-digit PIN in milliseconds and a four-letter password in seconds. The correct response is a passphrase of 20 or more characters, unique per service, stored in a password manager like Bitwarden, which is open-source and independently audited. NIST guidelines now recommend long passphrases over short complex passwords specifically because the latter are hard to remember and lead to reuse, which is far more dangerous than the complexity deficit. Two-factor authentication should use a hardware key or an authenticator app rather than SMS, because SIM-swap attacks can intercept SMS codes without requiring physical access to your device.
At the communication layer, the switch from Instagram DMs to Signal is the single highest-impact move for most users. Signal is a non-profit. It stores almost no metadata because it was architecturally designed not to. It cannot disclose who you talked to or when because it genuinely does not have that information. Proton Mail applies the same logic to email: zero-knowledge encrypted, meaning Proton's own servers cannot read your messages. Ente Photos applies it to photo storage: client-side encrypted before upload, fully open-source.
At the network layer, your ISP logs every domain you query by default through its DNS resolver. Switching to Quad9 (9.9.9.9) or NextDNS routes those queries through privacy-respecting resolvers and blocks tracker domains at the DNS level, before requests even reach websites. Brave browser is the only major browser that randomizes your browser fingerprint in independent Electronic Frontier Foundation testing, making cross-site tracking technically infeasible rather than just policy-restricted. The Global Privacy Control header, supported by Brave and Firefox, signals automatically to every website that you do not consent to data sale, and some jurisdictions legally require websites to honor it.
At the deepest layer, the shift that matters most is moving from platform-hosted to locally-hosted or protocol-based tools. Obsidian and Logseq store notes locally by default. Nextcloud gives you your own file storage. Miniflux gives you your own RSS reader. The Fediverse, built on the ActivityPub protocol and running through platforms like Mastodon and Pixelfed, means there is no central company to update its terms of service, because the protocol belongs to no one. These tools require slightly more setup. They require zero trust in a corporation's current intentions or future board decisions.
The Real Answer: Privacy Is an Architecture, Not a Policy
Every module in CS50's cybersecurity curriculum, whether it covers password hashing, memory safety, buffer overflow prevention, XSS injection, or HTTP header leakage, is teaching the same underlying lesson: security and privacy are properties of a system's design, not of its documentation. A system that hashes and salts passwords cannot accidentally expose them. A system with proper memory bounds cannot be exploited through a buffer overflow. A browser configured to strip the Referer header cannot leak your search history. The protection is structural. It does not depend on good intentions holding under commercial pressure.
The internet's original design mistake was separating data from identity. Your data lives somewhere else, owned by someone else, under their rules, with their ability to update those rules at any time. Every genuine architectural fix being developed right now, Solid pods, homomorphic encryption, differential privacy, federated learning, zero-knowledge proofs, local-first software, decentralized protocols, is some version of correcting that original error. The technology to do this correctly has existed in various forms for years. What has been missing is the economic and political will to build on it rather than around it. The courts are now closing some of the old data acquisition routes. Regulations like India's DPDP Act and the EU AI Act are beginning to impose structural requirements rather than just disclosure requirements. And a small number of builders have decided that user trust, real structural trust, is worth more than the data asymmetry that betrays it. I am one of them. The question is whether enough of the industry follows before the architecture of extraction becomes too entrenched to replace.
Reference Material & Expanded Research
Shift in Communication Privacy
- Instagram E2EE Sunset (Firstpost)
- Implications of Meta's Encryption Policy (Times of India)
- Instagram E2EE Shutdown Coverage (The Hacker News)
- Meta Killing E2EE in DMs (Engadget)
- Analysis of Instagram's Privacy Pivot (Mashable)
- What Users Need to Know (NDTV)
- Meta's New Data Strategy (Moneycontrol)
Platform Policy Evolution
Legal Precedents and Copyright in the AI Era
- Drawing Lines Around training Piracy (Reuters)
- AI Copyright Lawsuit Tracker (Copyright Alliance)
- Fair Use Tests for Training Data (IPWatchdog)
- Insights on Fair Use and AI (Skadden)
- AI Litigation Updates (McKool Smith)
- Copyright Litigation Trends for 2026 (MoFo)
- Major Legal Cases of 2025-26 (Internet Lawyer Blog)
- AI Copyright Class Action Tracker (BakerHostetler)
- Key Lessons from Recent AI Cases (LinkedIn/Dr. Deepak)
- Top Privacy and Data Protection Cases (Inforrm)
- Legal Compliance in Web Scraping (Tendem.ai)
- Scraping Landscape in 2026 (Apify)
- Web Scraping and AI Litigation (ZwillGen)
Surveillance and Pattern Recognition
- NSA Metadata Collection Insights (Business Insider)
- The Importance of Metadata Surveillance (LinkedIn/Nick Evans)
Psychological Manipulation and UX Design
- Suppressing User Control via Interface (Journal of Advertising)
- Architecture of Choice Presentation (ACM Digital Library)
- Persuasion Awareness in Privacy Interfaces (ScienceDirect)
- Checklist for Avoiding Dark Patterns (Secure Privacy)
Emerging Privacy-Preserving Technologies
- Securing AI with Homomorphic Encryption (Dialzara)
- Privacy-Preserving Model Inference (Gopher Security)
- FHE vs. Confidential Computing (Cloud Security Alliance)
- Homomorphic Encryption Advances (ScienceDirect)
Statistical Privacy Models
- Theory and Practice of Differential Privacy (DP.org)
- Dynamic Differential Privacy Scalability (Wiley Online Library)
Decentralized Protocols and Sovereignty
- The Solid Project (Wikipedia)
- Tim Berners-Lee's Vision for Data Control (TechTarget)
- ODI and Solid Integration (The ODI)
- Dreams of Decentralization (Reddit/Ethereum)
Practical Protection Frameworks
- PrivacyTools.io Directory
- Privacy Guides: Recommended Tools
- Review of Best Private Browsers (PCMag)
- Security-First Browser Comparison (CloudSek)
- Top Privacy Tools 2026 (CamoCopy)
- Self-Hosted Privacy Stack (LightningDev123)
- In-Depth Guide to Digital Sovereignty (YouTube)
Future Directions in Data Sovereignty
- Privacy Trends for the Coming Year (Secure Privacy)
- Data Protection Strategies for 2026 (Hyperproof)
- Venture Perspectives on Privacy Trends (Insights4VC)
- Latest PET Implementations (StarAgile)
- Data Privacy by Design (Vofox)
- Data Privacy Week Insights (SparxIT)
- International Data Protection Trends (Chambers)