How Data Analytics Can Help Deliver Social Good

Innovations in data science are finding uses beyond business settings to bring effective solutions to pressing social problems. In one novel exercise, analysis of data on sex trafficking provided insights on directing preventive and remedial resources to poorer areas from where victims are trapped, instead of an earlier focus on richer, urban areas where they are sold. In another instance, machine learning tools helped the Greek government to identify incoming COVID-19-infected travelers at nearly double the volume that conventional random tests would have achieved, thereby reducing the spread of the virus within its country. Analytics can also correct long-held misperceptions such as the role of media in shaping public opinion: TV broadcasts are far more pernicious than social media in peddling biased reports, another study using analytics found.

Wharton Dean Erika James led a discussion on the opportunities, challenges, and solutions in using analytics for social good at a panel discussion on December 14 titled “Beyond Business: Using Data to Protect and Serve.” The discussion was part of Wharton’s Beyond Business series, which explores some of the most complex and pressing issues affecting organizations and individuals around the world. An expansion of Wharton’s Tarnopol Dean’s Lecture Series, Beyond Business is streamed live on Wharton’s LinkedIn page. (See video below.)

The panelists were three Wharton professors of operations, information and decisions – Hamsa Bastani, Duncan Watts, and Dean Knox. Knox is also co-founder of a project called Research on Policing Reform and Accountability. Watts is also director of the Computational Social Science Lab at the University of Pennsylvania, and he holds faculty positions at Penn’s School of Engineering and Applied Science and at the Annenberg School of Communication.

This year’s Beyond Business series “shines a light on how analytics, artificial intelligence, and machine learning are providing viable pathways for solutions in every domain,” said James. The first of this year’s series, held in November, focused on how analytics is influencing critical decision-making in the field of finance.

Delivering social good is fertile territory for innovation in analytics using machine learning and artificial intelligence (AI) tools. For instance, in the COVID platform Bastani and her colleagues deployed for Greece, they trained a machine learning tool called reinforcement learning, which identified nearly double the number of infected travelers at the border than what conventional random testing would have achieved.

Innovating Around Resource Constraints

In deploying the Greek COVID platform, resource constraints hindered the ability to gather, clean, and label large quantities of data that are typically necessary for modern machine learning algorithms. Bastani and her team innovated with novel algorithms that work with so-called “small data” settings. Another strategy they used was “adaptive learning” which allowed the model to be refined as data is gradually gathered over time. “[That helped] because we were starting with effectively zero knowledge about COVID risk,” she said.

Bastani and her team also innovated with “transfer learning,” a way to learn from different data sources, to address data-related challenges in an ongoing project to uncover sex trafficking supply chains. Here, they have been working with the Tellfinder Alliance, a global network to combat human trafficking, to analyze deep web data using support from Analytics at Wharton in “deep learning” aids, specifically large language models. They are now designing new multimodal learning algorithms to combine multiple “biased data views” on human trafficking from sources including deep web ads, consumer reviews, tax records, and sex worker inputs to create a more holistic view of trafficking risk.

“Machine learning gave us unprecedented visibility into previously opaque problems, like human trafficking.”— Hamsa Bastani

In tracking human trafficking, Bastani’s challenge was to uncover patterns and extract insights from large data streams like satellite data or deep web data. “They’re highly complex, noisy, and high-dimensional, making them essentially impossible for humans to go through or comprehend manually,” she said. “Machine learning gave us unprecedented visibility into previously opaque problems, like human trafficking,” she said. She and a doctoral student from Wharton used deep learning to process 14 million deep web ads on commercial sex sales websites to identify entities that were deceptively recruiting vulnerable populations for non-sex jobs like modeling or massage, and then selling sex to customers.

“Using an active learning approach, we identified over 20 types of deceptive recruiting that our partners at the TellFinder Alliance weren’t previously aware of,” she said. “We also found a couple of surprising insights: While sex sales are concentrated in large urban cities like New York City or Los Angeles, recruitment of victims actually tends to occur in smaller, less-resourced cities. This is important because law enforcement and social work efforts are largely focused on large, well-resourced cities because that’s where we see sex sales happening more prominently.”

Those findings can help to reallocate social work resources to recruitment hotspots to help prevent victims from being trafficked in the first place or “help law enforcement attack these supply chains from both the recruitment and the sales ends,” she added.

Tracking Media Bias and Misinformation

At Penn’s Computational Social Science Lab, Watts aims to use “applied innovation” to solve problems of social relevance. One such endeavor is a project called PennMAP, or the Penn Media Accountability Project, where his team analyzes large data sets around media consumption and production to try and understand issues such as bias and misinformation and their effects on democracy.

Watts is deluged with data from across the media ecosystem, which includes tens of thousands of web publishers, hundreds of TV channels, cable channels, small local stations, radio, and other media. “You have an enormous population of very heterogeneous producers of content, everything from The New York Times and CNN down to some guy sitting in his basement, running a YouTube channel,” he said. “All of these different actors are producing content that could be important to [shaping] public opinion.”

The challenges in capturing all that data on media production are replicated on the consumption side, which includes billions of people around the world who are browsing the web on their mobile devices and tablets or watching television. “We have to solve this problem of tracking all of that holistically to get a real sense of what is being said and what is being consumed, and how that is translating into outcomes like public opinion and attitudes,” Watts said.

“The majority of the research community have been effectively ignoring television as a source of misinformation over the last several years and focusing very much on social media.”— Duncan Watts

In tracking fake news and misinformation, the PennMAP project discovered that “the vast majority of people consume very little news at all,” said Watts. Three-quarters of the U.S. population spends less than one minute per day consuming news online, he added. “Where they do get their information from is television, by a factor of 5 to 1 across the population. The majority of the research community has been effectively ignoring television as a source of misinformation over the last several years and focusing very much on social media.”

Watts has an issue also with so-called explanatory journalism. “There are so many examples in everyday journalism where you read some article or you watch some program and it’s leading you to believe that something has been explained,” he said. “It’s not right. It’s not wrong. It’s just a story. Other stories could have been told. And if those stories had been told, you would have had a very different impression of what was happening in the world.”

Watts is also suspicious of surveys. “They have very large biases in them because people over-report how much news they consume and they dramatically over-report how much time and how much news they get from social media,” he said. He has issues also with data on TV viewership, which may not capture settings where viewers don’t turn on their meter, or if two people are watching television at the same time. Echo chambers are another obstacle he faces, where people get a large portion of their news from ideologically homogeneous sources, which could produce “a skewed and biased view of reality.”

Tracking Law Enforcement for Discrimination

Knox began researching police organizations when he noticed that “a lot of academic research was being cited in various quarters, often to claim that there was no systemic bias.” Those findings were based on so-called “stop data” that is generated by police departments which showed that in traffic stops, “the use of force against Black and white civilians was not all that different,” he said.

In Knox’s study of the oversight of police organizations, a major challenge is “extremely incomplete” data. With the insights gained from data analytics, he and his team aim to ensure that officers who commit misconduct are investigated and disciplined for their actions, and that they represent and reflect the communities that they serve. Decisions to hire and deploy different kinds of officers lead to different kinds of enforcement behavior, said Knox.

The absence of “gold-standard data and fine-grained instrumentation” are the challenges Knox, too, faces in tracking police conduct. “Essentially, all of the work that we do in triangulating and pulling together unconventional data sources is in trying to get around that problem in clever ways,” he said.

“In America, we have 18,000 police agencies that have different law enforcement practices, different data collection practices, different problems, are faced with different challenges and police different communities.”— Dean Knox

Knox explained that with an example. “If you think about discrimination and a police encounter, it’s a question about whether from the moment an officer lays eyes on a civilian, what chain of events and enforcement actions take place that would not otherwise have taken place if that civilian had been white,” he said. “The problem is that we simply don’t have any record of the vast majority of those encounters. If an officer doesn’t take that initial step of detaining the civilian, it vanishes into the ether as far as data analysts are concerned.”

Knox and his team attempt to get around those challenges by gathering data from a variety of sources such as traffic sensors (which show who is speeding, for instance), and then comparing that to police records of those who were actually stopped. “And of course, that’s incomplete information because we don’t know the race of the drivers,” he added. “But every drop of data helps.”

Knox also uses mobile location data to find out who’s walking around a neighborhood at a given time, what the composition of that group is, where they’re coming from, and what their movement patterns are. They overlay that data with additional data sources, including police body cam videos to audit the records and the officer’s narrative of what took place, and verify that against a more objective video record. “And because this is high-dimensional, complex video data, we have to build computer vision and audio analysis tools in order to process that in a way that’s feasible given our limited human expert resources,” he said.

Knox and his team gather data from a variety of sources including “direct outreach to reform-minded police agencies … and filing a staggering number of freedom-of-information requests to agencies all around the U.S.” That has helped them secure “extremely granular information” on almost all of the largest 100 largest police organizations in the country. But the data challenges that persist are formidable.

“In America, we have 18,000 police agencies that have different law enforcement practices, different data collection practices, different problems, are faced with different challenges and police different communities,” Knox said. The outcome of those data challenges is “a complete lack of accountability in terms of the number of allegations of misconduct that actually get investigated and end in discipline,” he added. “The number of civilian allegations of misconduct that end in actual consequential discipline to the officer is well under one percent.”

More From Knowledge at Wharton

What’s Your Chronotype? How Brain Science Can Boost Performance

The Decline of the Cover Letter in the AI Era

From Hype to Impact: AI Reshapes Enterprise Software

Looking for more insights?