The Internet Is Eating Itself: Why AI Data Saturation Is the Privacy Crisis No One Is Ready For

I wrote about this once before, from a place of personal frustration. Every time I opened a photo sharing app, sent a message, or scrolled through a feed, there was a background noise in my head asking whether any of it was actually private. The more I dug in, the louder it got. I eventually published that piece and stepped back from most mainstream platforms. But that essay was personal. This one is structural. Because what I have since learned from studying computer science, cybersecurity, cryptographic architecture, and the actual court records is that the discomfort I was feeling was not paranoia. It was pattern recognition. And the pattern is now playing out in real time, at a scale most people still have not fully understood.

Where This Started: The Internet Was Never Built for Privacy

The internet was not designed with privacy as a foundation. When Tim Berners-Lee published his proposal for the World Wide Web in 1989, the goal was open information sharing between academic institutions. The HTTP protocol, which still carries most of the web today, was stateless and transparent by design. There was no concept of data ownership, no concept of identity, and no anticipation that the request headers exchanged between a browser and a server would one day become a surveillance instrument. The Referer header, for instance, was built to help servers understand where traffic was coming from. Today it tells platforms exactly what search terms you typed before arriving on their page, including sensitive health queries, financial searches, and deeply personal information. Modern HTML does allow developers to set a Referrer-Policy tag that instructs the browser to send only the domain origin rather than the full URL, stripping the sensitive detail from that header. Most platforms have chosen not to implement it.

The commercial web of the late 1990s and 2000s did not correct this architectural gap. It exploited it. The dominant business model that emerged was not selling products to users. It was selling users to advertisers, packaged as behavioral profiles assembled from every click, scroll stop, dwell time, and search query. Google made this into a science. Facebook made it into a social graph. The data collection was always the product, not the byproduct. What changed in the 2020s was the demand side. Training large language models requires data volumes that dwarf what advertising profiling ever required. And that changed everything about how platforms think about the user data sitting inside their closed systems.

The Saturation Thesis: What Happens When Public Data Runs Out

Right now, AI companies are hitting what researchers are beginning to call a data wall. The open web, which provided the bulk of training material for models like GPT, Gemini, and Claude, is closing. Publishers are suing. Courts are drawing lines. Websites are enforcing robots.txt at the server level and actively updating those files to block AI crawlers. The consequence of this is not that AI companies stop training. It is that they turn inward. They train on what their users already gave them, inside closed systems, under terms of service that most people accepted without reading.

Meta's removal of end-to-end encryption from Instagram direct messages, effective May 8, 2026, is the clearest public signal of this shift. The official explanation given was low feature usage. The structural explanation is data access. Without end-to-end encryption, Meta can read those messages. With a corpus of billions of human conversations and a sufficiently capable model, those messages become training material. Vercel, the developer infrastructure platform, updated its terms of service in March 2026 to explicitly allow using code deployments, agent interactions, and platform telemetry to improve its AI products, with an opt-out window that closed on March 31, 2026. These are not isolated decisions by two unrelated companies. They are the same capital allocation logic expressing itself across different product categories. When user data becomes your competitive moat, you stop protecting users from accessing it and start protecting your access to it.

What Computer Science Actually Teaches About Data Protection

To understand how badly the current situation diverges from what is technically correct, you need to understand what proper data protection looks like at the infrastructure level. When a web server stores your password, the correct approach, established in computer science education and industry best practices for decades, is to never store the password itself. Instead, the server passes it through a hashing function: a one-way mathematical transformation that produces a fixed-length output string from any input. SHA-2 and SHA-3 are the current industry standards for this operation. The transformation is irreversible by design. The server cannot recover your password from the hash because the mathematics does not work in that direction.

Hashing alone is not sufficient, because two users with the same password would produce identical hashes, making precomputed lookup tables, called rainbow tables, a viable attack. The correct countermeasure is salting: adding a unique random value to each password before hashing, so that identical passwords produce completely different stored values. A server that implements hashing and salting correctly cannot tell you your own password. It can only verify that what you typed, when salted and hashed, matches what it stored. That is what correctly implemented protection looks like: the system is architecturally incapable of a betrayal it might otherwise be pressured into.

The same principle applies to data deletion. When you drag a file to the recycle bin and empty it, the operating system marks that region of disk space as available for reuse. The data is still physically present on the storage medium until something else overwrites it. Forensic recovery tools can reconstruct deleted files from that space with high fidelity. True secure deletion requires overwriting the disk sectors with random data, which is why specialized tools exist for exactly this purpose. The everyday user who assumes emptying the trash deleted their data is operating under a misunderstanding that the operating system interface actively encourages.

End-to-end encryption extends this same principle to communications. The platform holds encrypted ciphertext. The decryption keys exist only on the communicating devices. Mathematically, the platform cannot read the message because it does not have the key required to transform ciphertext into plaintext. When Instagram removed this option from its messaging feature, it did not change a privacy setting. It changed the cryptographic architecture so that it now joins the pool of platforms that are structurally capable of reading every message on their system. That is a significant and deliberate shift.

The Metadata Problem: What Encryption Was Never Protecting

Most people hear "end-to-end encrypted" and assume their communications are safe. This is where the conversation gets technically important. Encryption protects the content of a message. It has never protected metadata: who you communicated with, how often, at what time of day, from which location, for how long, and with what regularity of pattern. A leaked NSA document from the Snowden archive explicitly described metadata as the agency's most useful tool, noting that their collection systems were ingesting 125 million metadata records per day even in 2004. Former NSA Director Michael Hayden stated publicly that the US government makes lethal targeting decisions based on metadata. The content of the messages was not required.

This matters for the AI data question because platforms that offer end-to-end encryption almost universally still collect full metadata, and metadata reconstructs a person's life with high accuracy. Communication pattern analysis alone can infer the nature of each relationship in your network, whether a relationship is romantic or professional, whether you are managing a health crisis, what your political orientation is, and what your physical movement patterns look like across a day, without ever accessing a single word you wrote. Academic research published in late 2025 confirmed through controlled experimental studies that dark patterns in consent interfaces, specifically ambiguous language and architecturally limited choice presentation, are designed to suppress users' perceived control and bypass active persuasion awareness. The design of the opt-out process is the real privacy policy. The existence of the opt-out is the legal shield.

What the Courts Have Actually Decided

The legal landscape as of early 2026 gives us a working map of where the lines are being drawn, and understanding these cases is important because they are the direct cause of the inward turn toward platform-held data.

Case	Year	Outcome	What It Means
Thomson Reuters v. Ross Intelligence	2025	Plaintiff won	Scraping to build a competing product in the same market is infringement, not fair use
Bartz v. Anthropic	June 2025	Mixed: training won, data hoarding lost	Training itself is transformative; building a pirated library to do it is not
Kadrey v. Meta	June 2025	Meta won on fair use	Torrenting books for model training passed the transformative use test
New York Times v. OpenAI	Ongoing	20M conversation logs ordered produced	Verbatim reproduction of copyrighted content in outputs is a separate legal theory from training
Reddit v. Perplexity AI	Ongoing	Unresolved	Bypassing rate limits and CAPTCHA systems may violate DMCA Section 1201, independent of copyright
Clearview AI v. BIPA	2025	$51.8M settlement	Biometric scraping without consent violates state biometric privacy law

What these cases collectively establish is that courts are increasingly protecting the training process itself while restricting the data acquisition method. The practical implication is that the piracy route to training data is closing legally. But the internal platform data route, where a company uses what its own users generated inside its own system under its own terms of service, faces essentially no legal obstacle. This is the gap that the inward turn is designed to exploit. No lawsuit currently being litigated closes it.

The Architectural Answers That Actually Exist

The question I keep returning to is not how to make data harder to find. It is how to make data structurally useless to anyone except its owner, even while it is being processed. That is a different and more demanding engineering problem, and there are real answers to it.

Homomorphic encryption is the most radical solution. It allows a server to perform computations on encrypted data without ever decrypting it. The server operates entirely in the encrypted domain and returns an encrypted result. It learns nothing about the input. In 2025, over 250 million financial transactions were processed through fully homomorphic encryption systems, and healthcare analytics firms ran aggregate insights across 110 million patient records without ever exposing the underlying content to their own infrastructure. The current limitation is computational cost: FHE operations run orders of magnitude slower than plaintext computation, making real-time consumer applications uneconomic at present. That gap is narrowing as hardware catches up to the mathematical requirements of the approach.

Differential privacy takes a fundamentally different approach. Rather than hiding the data from computation, it mathematically degrades the individual signal while preserving aggregate patterns. Before any data is used for analysis or model training, a calibrated amount of statistical noise is injected. The mathematics provides a formal, provable bound on how much information about any specific individual can ever be recovered from the output, regardless of what an adversary already knows. Apple uses this for keyboard and emoji usage analytics on iOS. Research published in late 2025 demonstrated dynamic versions of differential privacy that are scalable enough for real-time deployment. This approach does not stop data collection. It makes what is collected provably useless for individual targeting while still being useful in aggregate, which is a technically honest trade-off.

Federated learning extends this principle to model training itself. Instead of sending raw data to a central server, the model weights are sent to the user's device, which trains locally on the user's data, and only the resulting gradient update, a mathematical delta representing what the model learned, is transmitted back. The raw data never leaves the device. Google Keyboard already operates on this basis. The limitation is that gradient inversion attacks can sometimes reconstruct approximate training samples from gradient updates, which is why federated learning is most robust when combined with differential privacy applied to the gradient before transmission.

Tim Berners-Lee's Solid project represents the most architecturally complete rethinking of where data should live by default. Under Solid, all personal data lives in a Personal Online Data Pod that the user controls, hosted wherever they choose. Applications request permission to read specific pieces of data from that pod. The user grants or revokes access at any time. The application cannot access data the user did not explicitly permit, and there is no terms of service update any company can push that changes that. The Open Data Institute formally adopted Solid into its portfolio in October 2024. This is categorically different from a privacy policy promise. It is structural. The enforcement mechanism is mathematical and architectural, not contractual.

Zero-knowledge proofs add another layer to this architecture. A ZKP allows one party to prove to another that a statement is true without revealing any of the underlying information used to prove it. You can prove you are over 18 without revealing your birthdate. You can prove you have sufficient funds without revealing your account balance. In the context of platform verification and identity, this makes it possible to satisfy compliance requirements with provable guarantees while releasing zero personal data. ZKP is already deployed in several blockchain-based financial systems and is making its way into identity verification workflows for regulated industries.

The Dark Pattern Layer: Why Good Options Are Hidden

Understanding the technical solutions also requires understanding why they are not being offered. Research published in the Journal of Advertising in late 2025 found through three controlled experimental studies that dark patterns in data consent interfaces, including confusing language, limited choice architecture, and difficult-to-locate opt-outs, directly suppress users' perception of control over their own data and are architecturally designed to bypass active resistance. This is not an accident of design. It is the output of a conversion optimization process applied to privacy controls. The opt-out exists to satisfy regulatory requirements. The friction exists to ensure almost nobody completes it. The gap between those two facts is where most user data is captured.

The photo sharing platform I discussed in my earlier piece hid its encrypted messaging option inside a conversation's contact menu, behind a single low-contrast text line with no visual indicator that it was interactive. Your existing conversation did not upgrade. A new separate thread opened. A company with thousands of engineers did not do this because upgrading an existing conversation was technically infeasible. It took that design team far longer to ship that specific friction than it would have taken to build a one-tap upgrade prompt. That is a product decision that favors data access over user protection, dressed as a design limitation. And regulators accepted the explanation because the option technically existed.

What I Actually Did About It: Aldform and ibbe

I want to be concrete about something, because I think this conversation is too often theoretical. I run two companies. At Aldform and at ibbe, we have implemented end-to-end encryption by default across every piece of user data we handle: email addresses, phone numbers, names, and every other identifying data field. Not as a premium tier. Not as an opt-in setting. By architectural default, for every user, from the moment their data enters our systems.

I want to explain what that decision actually costs, because the privacy conversation usually skips this. Building E2EE as a default across your entire data model means your own infrastructure cannot query plaintext user data. You cannot run naive database searches across personal information fields. You cannot build behavioral recommendation systems on that data without significant additional cryptographic complexity. Your debugging workflows change. Your customer support workflows change. Your analytics pipeline changes. These are real engineering and operational costs that we absorbed deliberately. The reason is not altruism alone. As AI makes user data more economically valuable every year, being a platform that is structurally incapable of betraying its users is not just an ethical position. It is a product position and a long-term trust position. We decided early that we would rather build on that foundation than on one where user trust is a setting buried four menus deep.

What the Public Can Do: A Layered Approach

The systemic fix requires regulatory and architectural change at a scale individuals cannot force alone. But the individual response is not helpless, and the most impactful moves are about changing which architectural category of tool you depend on, not which terms of service you accept.

At the account layer, the CS50 cybersecurity curriculum is direct about this: brute force can crack a four-digit PIN in milliseconds and a four-letter password in seconds. The correct response is a passphrase of 20 or more characters, unique per service, stored in a password manager like Bitwarden, which is open-source and independently audited. NIST guidelines now recommend long passphrases over short complex passwords specifically because the latter are hard to remember and lead to reuse, which is far more dangerous than the complexity deficit. Two-factor authentication should use a hardware key or an authenticator app rather than SMS, because SIM-swap attacks can intercept SMS codes without requiring physical access to your device.

At the communication layer, the switch from Instagram DMs to Signal is the single highest-impact move for most users. Signal is a non-profit. It stores almost no metadata because it was architecturally designed not to. It cannot disclose who you talked to or when because it genuinely does not have that information. Proton Mail applies the same logic to email: zero-knowledge encrypted, meaning Proton's own servers cannot read your messages. Ente Photos applies it to photo storage: client-side encrypted before upload, fully open-source.

At the network layer, your ISP logs every domain you query by default through its DNS resolver. Switching to Quad9 (9.9.9.9) or NextDNS routes those queries through privacy-respecting resolvers and blocks tracker domains at the DNS level, before requests even reach websites. Brave browser is the only major browser that randomizes your browser fingerprint in independent Electronic Frontier Foundation testing, making cross-site tracking technically infeasible rather than just policy-restricted. The Global Privacy Control header, supported by Brave and Firefox, signals automatically to every website that you do not consent to data sale, and some jurisdictions legally require websites to honor it.

At the deepest layer, the shift that matters most is moving from platform-hosted to locally-hosted or protocol-based tools. Obsidian and Logseq store notes locally by default. Nextcloud gives you your own file storage. Miniflux gives you your own RSS reader. The Fediverse, built on the ActivityPub protocol and running through platforms like Mastodon and Pixelfed, means there is no central company to update its terms of service, because the protocol belongs to no one. These tools require slightly more setup. They require zero trust in a corporation's current intentions or future board decisions.

The Real Answer: Privacy Is an Architecture, Not a Policy

Every module in CS50's cybersecurity curriculum, whether it covers password hashing, memory safety, buffer overflow prevention, XSS injection, or HTTP header leakage, is teaching the same underlying lesson: security and privacy are properties of a system's design, not of its documentation. A system that hashes and salts passwords cannot accidentally expose them. A system with proper memory bounds cannot be exploited through a buffer overflow. A browser configured to strip the Referer header cannot leak your search history. The protection is structural. It does not depend on good intentions holding under commercial pressure.

The internet's original design mistake was separating data from identity. Your data lives somewhere else, owned by someone else, under their rules, with their ability to update those rules at any time. Every genuine architectural fix being developed right now, Solid pods, homomorphic encryption, differential privacy, federated learning, zero-knowledge proofs, local-first software, decentralized protocols, is some version of correcting that original error. The technology to do this correctly has existed in various forms for years. What has been missing is the economic and political will to build on it rather than around it. The courts are now closing some of the old data acquisition routes. Regulations like India's DPDP Act and the EU AI Act are beginning to impose structural requirements rather than just disclosure requirements. And a small number of builders have decided that user trust, real structural trust, is worth more than the data asymmetry that betrays it. I am one of them. The question is whether enough of the industry follows before the architecture of extraction becomes too entrenched to replace.

Reference Material & Expanded Research

Shift in Communication Privacy

Platform Policy Evolution

Vercel Terms of Service Update - March 2026

Legal Precedents and Copyright in the AI Era

Surveillance and Pattern Recognition

Psychological Manipulation and UX Design

Emerging Privacy-Preserving Technologies

Statistical Privacy Models

Decentralized Protocols and Sovereignty

Practical Protection Frameworks

Future Directions in Data Sovereignty

Decentralized Machine Learning

Advancements in Federated Learning (Nature)

Internal Principles

Where This Started: The Internet Was Never Built for Privacy

The Saturation Thesis: What Happens When Public Data Runs Out

What Computer Science Actually Teaches About Data Protection

The Metadata Problem: What Encryption Was Never Protecting

What the Courts Have Actually Decided

Case	Year	Outcome	What It Means
Thomson Reuters v. Ross Intelligence	2025	Plaintiff won	Scraping to build a competing product in the same market is infringement, not fair use
Bartz v. Anthropic	June 2025	Mixed: training won, data hoarding lost	Training itself is transformative; building a pirated library to do it is not
Kadrey v. Meta	June 2025	Meta won on fair use	Torrenting books for model training passed the transformative use test
New York Times v. OpenAI	Ongoing	20M conversation logs ordered produced	Verbatim reproduction of copyrighted content in outputs is a separate legal theory from training
Reddit v. Perplexity AI	Ongoing	Unresolved	Bypassing rate limits and CAPTCHA systems may violate DMCA Section 1201, independent of copyright
Clearview AI v. BIPA	2025	$51.8M settlement	Biometric scraping without consent violates state biometric privacy law