How Can We Maintain Privacy During the AI Arms Race?
What ethical limits exist on how public data is used in AI models? Should companies limit what public data they include in their training data sets?
Our New Series on Ethical Data Principles
Welcome to our new series, where we examine each of the 5 core Ethical Data Principles that we stand for at The Ethical Tech Project and see how they apply to ongoing current events in the rapidly changing data+AI landscape.
We don’t pretend to have all the answers. But we believe today’s debates are too important to ignore. Old norms and ethical rules are being tested and broken by emerging AI systems, and there’s hard ethical questions that need answering. With that, let’s dive in!
Google’s Updated Terms - What Public Data is Off-limits?
The first of our 5 principles we’re highlighting is privacy. Our guiding principle is that sensitive and personal data should be collected conscientiously and used with care with the consent of the user.
Privacy: Sensitive and personal data must be collected conscientiously – and used and protected with care. If personal data is no longer relevant to the purpose for which it was collected, it should be erased.
Privacy has been at the heart of debates around technology for decades, but it faces new challenges from AI systems. Case in point: Google recently made headlines after it changed its privacy policy to confirm it is using all publicly available data to train its AI models like Bard and Cloud AI. This includes data that they’ve previously scraped from the web, essentially meaning all data ever published online is fair game for AI training purposes.
Nothing is unique about Google’s approach here from other big tech companies training large AI models. In fact, nearly every one, including Google, is being sued in a series of class action lawsuits alleging some form of violation in how they gathered or used their training data.
So what’s the ethical dilemma?
On one hand, the companies argue that if data is public it is fair to be used, that’s typically been the norm in a lot of cases in the past. New AI models can create unique forms of value that benefit society, so as long as companies like Google are transparent about what they’re doing, what’s the harm?
The problem for some people is that the statement “we reserve the right to use all public data to train all our AI models” is so broad that it doesn’t really offer individuals any protections. It also grates against a lot of established privacy norms and existing laws. For example:
Just because something is public doesn’t mean it’s free to use for any purpose. That’s why copyright laws exist and why there’s a class action lawsuit against OpenAI (led by comedian Sarah Silverman) for the use of copyrighted content.
It’s unclear whether anyone can opt-out or exercise a right to be forgotten, both of which are central to privacy laws like GDPR and CCPA. Can anyone reasonably be excluded from being used in those models if companies use any data ever scraped from the web? Google proposed to Australian regulators that content publishers could have an opt-out form their web-scraping, but their proposal lacked any details on how it would work or whether opting out would apply to old data that had previously been scraped or used to train models. In this new world what constitutes a meaningful opt-out?
Public information can still be sensitive. In our last post we referenced how researchers had scraped 70,000 OkCupid profiles to create a public database including users’ age, gender, and sexual orientations. Technically that information is public, but that doesn’t mean they give their consent for it to be used for any future model or product. Permission matters.
The irony in all of this? No business would consent to their IP finding its way into AI models without consent and control. Yet that’s exactly the data leakage risk that many businesses face in deploying AI applications. This is just one way that Data Dignity at the individual level and Data Stewardship at the firm level are closely linked.
Tell us your thoughts - What limits should exist on how public data is used for training models?
What We’re Reading This Week On Ethical Tech: