In 2009, Netflix was sued for releasing movie ratings data from half a million subscribers who were identified only by unique ID numbers. The video streaming service divulged this “anonymized” information to the public as part of its Netflix Prize contest, in which participants were asked to use the data to develop a better content recommendation algorithm. But researchers from the University of Texas showed that as few as six movie ratings could be used to identify users. A closet lesbian sued Netflix, saying her anonymity was compromised. The lawsuit was settled in 2010.

The Netflix case reveals a problem about which the public is just starting to learn, but that data analysts and computer scientists have known for years. In anonymized datasets where distinguishing characteristics of a person such as name and address have been deleted, even a handful of seemingly innocuous information can lead to identification. When this data is used to serve ads or personalize product recommendations, re-identification can be largely harmless. The danger is that the data can be — and sometimes is — used to make assumptions about future behavior or inferences about one’s private life — leading to rejection for a loan, a job or worse.

A research paper published in Nature Communications last month showed how easy re-identification can be: A computer algorithm could identify 99.98% of Americans by knowing as few as 15 attributes per person, not including names or other unique data. Even earlier, a 2012 study showed that just by tracking people’s Facebook ‘Likes,’ researchers could identify if someone was Caucasian or African-American with a 95% certainty, male or female (93%), or gay (88%); whether they drink (70%); or if they used drugs (65%).

This is not news to people in the industry — but it is to the public. “Most people don’t realize that even if personal information is stripped away or is not collected directly, it’s often possible to link certain information with a person’s identity by correlating the information with other datasets,” says Kevin Werbach, Wharton legal studies and business ethics professor and author of the book, The Blockchain and the New Architecture of Trust. “It’s a challenging issue because there are so many different kinds of uses data could be put to.” Werbach is a faculty affiliate of the Warren Center for Network and Data Sciences, a research center of Penn faculty who study innovation in interconnected social, economic and technological systems.

For example, telecom companies routinely sell phone location information to data aggregators, which in turn sell them to just about anyone, according to a January 2019 article in Vice. These data buyers could include landlords screening potential renters, debt collectors tracking deadbeats or a jealous boyfriend stalking a former flame. One data aggregator was able to find an individual’s full name and address as well as continuously track the phone’s location. This case, the article says, shows “just how exposed mobile networks and the data they generate are, leaving them open to surveillance by ordinary citizens, stalkers, and criminals.”

That’s because the data you generate — whether from online activities or information held by your employer, doctor, bank and others — is usually stored, sold and shared. “That data is often packaged and sold to third parties or ad exchange networks,” says Michael Kearns, computer and information science professor at Penn Engineering. A founding director of the Warren Center, he is also co-author of the book, The Ethical Algorithm. “You are leaving data trails all over the place in your daily life, whether by where you move physically in the world or what you do online. All this is being tracked and stored.”

“[Even] if personal information is stripped away or is not collected directly, it’s often possible to link certain information with a person’s identity by correlating the information with other datasets.” –Kevin Werbach

Companies and other entities do try to keep datasets anonymous — the common practice is to strip out unique information like the name and birthday. “It would seem like an effective approach and it used to work reasonably well,” says Kartik Hosanagar, Wharton professor of operations, information and decisions, and author of the book, A Human’s Guide to Machine Intelligence: How Algorithms Are Shaping Our Lives and How We Can Stay in Control. But increasingly people have begun to recognize that this approach fails to offer protection, especially if marketers cross-reference different datasets — say, social networking surveys with census data. “If one has enough information about individuals … and [applies] sophisticated machine learning algorithms, then it is possible to re-identify people,” he notes.

Mathematically speaking, it is not that hard to re-identify people by using non-private information, Kearns explains. Let’s say you drive a red car. A data analyst who knows that only 10% of the population have a red car can disregard 90% of the people. Further assume that for red-car drivers, half use Macs and the rest PCs. You use a Mac, so another 50% of the group can be removed, and so on until you’re identified. “Each attribute slices away a big chunk of the remaining candidates, so you can quickly get down to … a very small handful of people,” he says.

That is precisely what happened at Netflix. Researchers were able to identify subscribers by looking at what ratings they gave movies when they viewed the content — then cross-referencing this data with movie ratings on the IMDB website, where people use their own names. Netflix thought it did enough to keep identities private, but that wasn’t true. “Relatively small amounts of idiosyncratic information are enough to uniquely identify you,” says Aaron Roth, computer and information science professor at Penn Engineering and co-author with Kearns of The Ethical Algorithm. “That’s the fundamental problem with this approach of anonymizing datasets. Basically, it doesn’t work.”

Millions of Data Points

People generate millions of points of information about themselves all the time. You create data every time you push the ‘Like’ button on Facebook, search for something on Google, make a purchase on Amazon, watch a show on Netflix, send a text on your mobile phone or do a transaction with your bank, insurer, retail store, credit card company or hotel. There’s also medical data, property records, financial and tax information, criminal histories, and the like. “You’re generating data all the time when you’re using the internet, and this can very quickly add up to a lot of features about you,” Roth says.

This dataset looks like an Excel spreadsheet where rows and columns correspond to different points of information, Roth notes. For example, the rows in the Netflix dataset represented 500,000 subscribers while the columns comprised 18 million movie ratings. That might seem like a lot of information, but to data scientists, “that’s not considered a large dataset,” he says. Facebook and Google, for example, would have datasets with millions of people, each of them having millions of attributes.

Roth explains that when organizations compile this information, they’re typically not interested in individual people’s records. “Netflix wasn’t interested in what movies Suzy watched. They were interested in statistical properties of the dataset … to predict what movies someone would like.” The way to make predictions is to let a machine learning algorithm go through the data and learn from it. After this training, the algorithm can use what it learned to predict things about a person, such as what she would like to watch, eat, buy or play. This data is invaluable to marketers, who use the information to try to serve relevant digital ads to consumers.

Much of the data that is collected from people online is used for advertising, Kearns says. “A great deal of the internet is monetized by advertising. Facebook and Google are entirely free services that make money by advertising. All the data that they collect and ingest is largely in service of improving their advertising predictions, because the better they can target ads to you, the more money they make from their advertisers,” he adds. “This is the vast majority of their revenue.”

“That’s the fundamental problem with this approach of anonymizing datasets. Basically, it doesn’t work.” –Aaron Roth

For companies like Amazon, advertising is not its prime money maker although it is starting to focus on it, Kearns says. “They want all the data they can get about you to personalize product recommendations. So a lot of this data is being used for a sensible business use that is not necessarily at odds with consumers’ interests. We all prefer to have better recommendations than worse recommendations. The problem arises when there’s mission creep.” That’s when data meant to be used for one thing is also used for another. “The data that’s useful for targeting advertising to you is also useful when I want to know whether to give you a loan,” he says.

Another danger arises when this data is used to profile you. “It’s not just that I know 15 facts about you and that I uniquely identified you. It lets me predict a whole bunch of things about you other than those 15 features,” Kearns says. You might say, “‘I don’t really care if Facebook knows this about me, or Google knows that about me or Amazon knows this about me,’” he says. “Actually, those innocuous facts may serve to identify you uniquely and they might also let unwanted inferences be made about you.” An average of 170 Facebook ‘Likes’ per person was enough to predict with high accuracy one’s sexual orientation in the 2012 study.

Algorithms also are not immune to mistakes, like their human counterparts. However, when machines make mistakes, they do so at scale. “In the modern, data-driven era, when statistical modeling and machine learning are used everywhere for everything, whether we realize it or not, mistakes are being made all the time,” Kearns says. It’s one thing when the mistake is in serving up a useless ad, it’s another thing when it leads to “systemic harms,” he adds.

For example, if an algorithm stereotypes incarcerated convicts from certain ethnic groups as being more likely to recommit crimes, it could recommend to a parole board not to release a prisoner with that racial makeup. “That mistake has much bigger life implications than my showing you the wrong ad on Facebook,” Kearns says. “Society is waking up to this. … Scientists are waking up to it also and working on designing better models and algorithms that won’t be perfect but will try to reduce the systemic harm.”

The challenge is to create equity mathematically. One approach is to create algorithms that put different weights on various factors. In the example about which convicts will most likely recommit crimes if they are given parole, one way to remove unfairness in targeting an ethnic group is to add inequity to other groups. “Typically, if I want the false positive rates to be equal among populations, I would have to probably increase it on one of those [other ethnic groups],” Roth says. But it means the overall error rate will get worse, he adds. So it comes down to tradeoffs. “How much do I value equity compared to raw accuracy?”

“In the modern, data-driven era, when statistical modeling and machine learning are used everywhere for everything … mistakes are being made all the time.” –Michael Kearns

Differential Privacy

One promising technical solution is called “differential privacy,” according to Roth. This technique adds ‘noise’ to the dataset to make accurate re-identification a lot harder. Let’s say an employer wants to conduct a poll in Philadelphia to see what fraction of the population has ever used drugs. Using a typical anonymizing technique, people’s answers are recorded but their names, ages, addresses and other unique information are hidden. But if a company really wants to spot drug users, it could cross-reference this data with other datasets to find out.

With differential privacy, people would still be asked the same questions, but their responses would be “randomized,” Roth says. It’s like asking them to flip a coin first before they answer. If the coin comes up heads, they have to tell the truth. If it’s tails, they answer randomly. The result of the coin flip will be hidden from researchers. “If you have used drugs, half the time you tell the truth when the coin flip comes up heads, half the time [it’s a] random answer” because a coin toss is a 50%-50% probability. Further looking at just random answers (participants will be asked to toss a coin again to see whether to tell the truth or lie), “half of that time [people are telling] the truth,” Roth says.

That means 75% of the time the answer is truthful and 25% of the time it’s a lie. But are you among the 75% or 25%? The researcher doesn’t know because the coin toss was a secret. “Now you have strong plausible deniability,” Roth says. “If someone gets a hold of my spreadsheet and says you use drugs … you have pretty strong deniability because [a lie could] have happened one-fourth of the time.” In algorithmic terms, researchers add “a little bit of randomness but [can] still very accurately compute population level statistics” while guaranteeing strong plausible deniability, he explains.

Going back to the drug survey example, “in the aggregate, I can still get a very accurate answer about what fraction of people in Philadelphia have used drugs because I know these numbers: three-fourths of the time people tell the truth and one-fourth of the time people lie,” Roth continues. “So in aggregate, I can subtract the noise and get a very accurate estimate of the average … without ever having collected information about individuals that implicated them [since] everyone has strong plausible deniability.”

In recent years, companies such as Google, Apple, Microsoft and LinkedIn have started to use this technique, Roth says. While the math has been around since 2006, “it’s only in the last few years that it has made it in the real world” because it takes time to shift from theoretical to practical, he says. In 2014, Google began using differential privacy in the collection of usage statistics on its Chrome web browser. Two years later, Apple did the same with personal data collection on iPhones. In 2020, the U.S. government will deploy this method in the Census survey.

“You can have two solutions, neither of which is better than the other but one offers more privacy but at the cost of higher error. The other offers more accuracy, but at the cost of less privacy.” –Aaron Roth

But there are tradeoffs to this method as well. The main one is that the level of plausible deniability is one of degrees. The stronger the plausible deniability for individuals, the less accurate the results will be for the researcher. Roth likens it to a “knob you can turn … to set this privacy parameter.” Society as a whole has to figure out where the right balance is between privacy and research results. “It depends on what you value. You can have two solutions, neither of which is better than the other but one offers more privacy but at the cost of higher error. The other offers more accuracy, but at the cost of less privacy. You have to decide which … you value more for your particular use case.”

Legal Protection

Today, the U.S. has a hodgepodge of regulations on data privacy. “There is no comprehensive privacy law on the federal level,” Werbach says. California does have an extensive privacy regulation modeled after the European Union’s stringent General Data Protection Regulation (GDPR), but there isn’t a national law. Roth agrees. “It’s a patchwork of different laws in the U.S. There’s no overarching privacy regulation. There’s one regulation for health, there’s one for [student] records, there’s another for video rental records. [And] some areas are unregulated.”

And what privacy laws the U.S. has need to be strengthened. Take the Health Insurance Portability and Accountability Act (HIPAA), which is designed to keep health records private. It requires that 18 personally identifiable information (name, address, age, Social Security number, email address and others) be hidden to use this dataset. “Under the safe harbor provision, you can do whatever you want as long as you can redact these unique identifiers. You can release the records,” Roth says. But “as we know, even from collections of attributes that don’t seem to be personally identifying, we can recover personal identity.”

Roth also cites the Family Educational Rights and Privacy Act (FERPA), which protects student records, and the Video Privacy Protection Act (VPPA), which keeps video rental records private. The VPPA dates back to the 1980s when a journalist dug up the video rental records of Supreme Court Justice nominee Robert Bork, he says. Afterwards, Congress passed the act, which assesses penalties of $2,500 for every user record revealed. (The plaintiff of the Netflix lawsuit alleged violations of the VPPA.)

“[More] and more, it’s seen that consent is not enough as a protection.” –Kevin Werbach

Various privacy bills have been introduced in Congress in the aftermath of Facebook’s Cambridge Analytica scandal. But Werbach points out that for protection to be robust, regulations must go beyond “limiting specific kinds of collections but thinking broadly about data protection and what sorts of rights people should have in data collection.”

So far, “the U.S. approach has been mostly focused on consent — the idea that companies should be transparent about what they’re doing and get the OK from people. [But] more and more, it’s seen that consent is not enough as a protection,” Werbach adds. This consent is usually buried in a company’s ‘Terms of Service’ agreement. However, this is “not something an ordinary person is comfortable reading,” he says.

Refusing to agree to a company’s ‘Terms of Service’ also is not realistic for most people, especially if they can get free use of a product or service such as Google search or Facebook. So what can consumers do to protect themselves? “The one obvious solution is one that nobody realistically will adopt — be extremely limited in your online activity. Don’t use services that are big and sprawling and are collecting all sorts of data from you and maybe using that data internally for things you don’t know about or giving that data to third parties you’re unaware of,” Kearns says.

Werbach adds that consumers should be “attentive to what options you do have to say no, to what choices you may have about how your information is collected and shared.” He says many companies will let you opt out. But “at the end of the day, this is not a problem that can be solved by end users. They don’t have the power to do this. We need a combination of legal and regulatory oversight — and companies recognizing that it’s ultimately in their interest to act responsibly,” Werbach says.

Until that happens, be afraid — be very afraid — of where your data breadcrumbs could lead.