The field of statistics is about more than just crunching numbers. Wharton statistics professor Bhaswar Bhattacharya is researching the best ways to apply statistical methods to solve problems in a range of fields, from health care to marketing to languages. Bhattacharya spoke with Knowledge at Wharton about shedding new light on one of the oldest mathematical disciplines.
An edited transcript of the conversation follows.
Knowledge at Wharton: Could you give us a brief summary of your research and what kind of question you were trying to answer?
Bhaswar Bhattacharya: My research interests are the intersection of statistics probability and combinatorics. Recently, numerous and very interesting combinatorial and graph theory-related problems have emerged in statistics, mainly because of the ubiquitous presence of network data and the increasing use of graph-based methods in modern-day analytics. As a consequence, many interesting connections have emerged between modern statistical methods and classical concepts in geometry and probability. You can use them to solve interesting problems in statistics.
Knowledge at Wharton: What are the key takeaways of your research?
Bhattacharya: The key takeaways of my research are basically the interplay between computational complexity, which is the time it takes to implement one of these methods, and its statistical performance, which is how close it is to the mathematically best procedure. It turns out that many of these graph-based methods are, computationally, very efficient, so they can be aptly used to solve and apply large data sets. We have also shown that, in many situations, these tests have near-optimal performance guarantees, which provide the theoretical justification required for using these procedures.
Knowledge at Wharton: Have graph-based methods not been used as much in the past, or have they been looked at with more skepticism?
Bhattacharya: Graph-based methods have been used in practice for a long time, but I think what comes out of my research is the answer to the question of why they work. They were used, and they were working fine before, but here we provide some theory behind why it works, given the justification of using it in real problems.
Knowledge at Wharton: How can businesses apply this research?
“Graph-based methods have been used in practice for a long time, but I think what comes out of my research is the answer to the question of why they work.”
Bhattacharya: One of the recent projects we’re looking at is what is known as the two-sample problem. Imagine a situation where I want to find out whether a set of genes regulates or affects the occurrence of a disease. For example, suppose I have 20 genes, and I have the gene expression-level data from 100 patients who have diabetes. I also have the same expression-level data for 100 patients who are healthy. The goal is to find out whether these 20 genes are expressed differentially. By that, I mean that their expression levels are significantly different between these two sets of patients. Our research aims to provide theoretical understanding and comparison of the different methods that are deployed, to understand or answer such questions.
Knowledge at Wharton: What are some other applications for this research?
Bhattacharya: Another interesting application of our work is in natural language processing, mainly in problems that try to understand similarity between words. So, imagine the word “color,” which can be spelled in two ways: with the letter u, or without the letter u. They are the same word. The words “wolf” and “fox” both describe animals, but they are very different words. In this case, in spite of the amount of data we have, the support size, which is basically the collection of all words, is far larger than the data set itself. One of the methods that we are studying can be used to analyze such problems as well.
Knowledge at Wharton: I would think that would be pretty interesting to businesses because so many are using social media to gather data about customers. So, what’s next for this research?
Bhattacharya: Currently, I’m trying to understand or analyze the methods for analyzing data in what is known as the high-dimensional setting, where you have, say, 10,000 genes and only a few hundred patients, and you want to find out something about how the genes affect the disease or the patients. For these cases, different new techniques are required, and I’m trying to understand the theoretical background of these results and how these can be used to find new methods and new algorithms.