During the first lecture of the spring semester, Ritika Khandeparkar, a graduate student in computer and information sciences at the University of Pennsylvania, recalls Wharton operations and information management professor Shawndra Hill telling her class that “being a data scientist is going to be the sexiest job of the 21st century.”
While describing the course — Data Mining for Business Intelligence — Hill explains more about the reasons why that might be the case. “Firms have had data for a very long time, and a lot of them have been mining it for a very long time,” she says. “What’s changing is that businesses are getting more and more data from consumers via people posting in social media and from other sources, and many firms are now collecting data on a larger scale.”
Thanks to new software programs and the increased availability of low-cost storage space in the cloud and elsewhere, it has also become easier and cheaper for firms to actually mine the data they collect, Hill adds.
The class is composed of about 100 students, some from Penn’s computer science and engineering programs and others from Wharton’s MBA program. In addition to attending weekly labs, the students formed groups and developed semester-long projects using data mining techniques to address a business-related problem.
In the process of completing their projects, the students have had to confront the same challenges many firms face when trying to make sense of their data, and also to analyze it in a way that adds value. “From the first day, I try to get them to start thinking about what novelty means in the space that they are working in,” Hill notes. “It can often be expensive to obtain data, and it may not be worth it to mine for an incremental benefit that doesn’t offer value beyond the cost of taking on the project.”
What the students also quickly find out is that the easy part is actually applying an algorithm to the data; the hard part is “cleaning the data [beforehand], making sure it is of high quality — and when it’s not, really understanding what is missing and why,” Hill says. “And then once you push the button and apply the algorithm, you have to understand what the results mean.”
Class discussions have also increasingly turned to the importance of protecting consumer privacy and whether it is ethical to take on a particular data mining project just because it is technologically possible. “I’m often challenging students to use their moral compass and think about if they would be happy if someone used their data for some of the things they suggest in class,” Hill notes.
Projects from this semester have focused on topics including predicting the success of microfinance loans, the probability of flight delays and the likelihood of a particular outcome in a basketball game. Students usually seek out large publicly available data sets, but some have also reached out to their personal and professional networks and obtained “proprietary data that would take faculty members months to get,” Hill says. “It’s exciting to see them click with a particular problem and also to go out and find data sets that I haven’t worked with before.”
For their project on predicting airline flight delays, Wharton MBA students Valeriy Rastorguev, Natalia Alikhashkina, Lee Horn, Irina Azu and Anu Verma found detailed weather and airline on-time performance data from the National Oceanic and Atmospheric Administration’s Aviation Weather Center. Focusing on the route from John F. Kennedy airport in New York to Los Angeles International Airport, the group is working on developing a model that would help travelers choose a flight with the lowest probability of cancellation, even if the trip is weeks away.
“People really hate airline delays, and if you remember, [earlier this month] more than 1,000 flights were canceled because of a snowstorm,” says Rastorguev, who is working on earning his pilot’s license. “We thought it was a really high-impact problem, and with some research, we found that airline delays cost more than $12 billion to the economy. If we were able to help people know in advance whether the flight is likely to be delayed or canceled so they can choose another option … we estimate that they would be able to save roughly from a half billion to $1 billion a year.”
This is the first year that Hill has encouraged students from different disciplines to partner on the semester projects. Khandeparkar is working on a model to predict microfinance loan outcomes with fellow computer and information sciences student Bhavesh Raheja, systems engineering student Carolina Cornejo Gutierrez, computer and information technology student Adina Amanbekkyzy and Wharton MBA students Carlos Vega and Elizabeth McCracken.
The team used a set of social demographic data from microfinance lending platform Kiva.org. “We’re looking at the data to identify predictors for defaulting,” Vega says. “The next stage is trying to speak with third parties in the mobile and financial services industries to try and relate some of our initial findings with patterns we find in those areas.”
Data available through Kiva pertains to a narrow group of people, Khandeparkar adds. “If we stick to what Kiva gives us, our model isn’t going to be universal enough. The more diverse your data is, the better your model gets.”
Among the factors that the students are incorporating into the model are gender, marital status and whether or not someone owns a home or has children. But they’re also looking to analyze mobile phone usage data, such as number of calls made, time of day calls are made and whether calls are primarily made to the same or different groups of people. In addition to creating a platform that allows microfinance institutions to digitally assess loan applicants, Vega says the project could also aid in setting up loan repayment plans.
Khandeparkar says her team was also helped by the fact that the six members hail from five different countries: India, Kazakhstan, Panama, Peru and the U.S. “Even though [the computer science and engineering students] knew what microfinance was, we didn’t know the details,” Khandeparkar notes. “The business students were able to tell us that, and we were able to better explain the technical aspects to them of how to convert a file or which algorithm would run better. Everyone’s ideas were completely different, and we worked really well together because of that.”