How data.world Wants to Unify the Data World

Most organizations today know that data has value, but they are unable to extract its full potential. Typically, data is buried deep in organizations, in silos, and accessible only to a few people. Brett Hurt and Matt Laessig, co-founders of data.world, want to change all that.

They believe that a unifying and collaborative platform could make data accessible to people within an organization, across organizations and around the globe. This democratization of data and a collaborative approach, they say, can not only help companies become more efficient and more competitive, it can also help solve big global problems such as climate change. In a recent conversation with Knowledge at Wharton, Hurt, CEO, and Laessig, COO, discussed their vision for data.world and why they believe it can change the world.

An edited transcript of the conversation follows.

Knowledge at Wharton: You announced the launch of data.world in July 2016. What was your original vision? Two years on, how has that changed?

Brett Hurt: Our vision was to create the most meaningful, the most abundant, and the most collaborative data resource on the planet. This is one of those “100-year missions.” It’s very ambitious. We want to change the world. [In the two years since we started,] we have become the world’s largest collaborative data community. We’ve launched a tremendous amount of enterprise functionality for our clients. We’ve grown faster than GitHub [a leading software development platform] grew at this stage, which is really exciting, especially given the fact that Microsoft recently announced that it will buy GitHub for $7.5 billion.

A lot of people have compared us to GitHub. They have called us the GitHub for data. It’s very flattering. It also shows that people are interested in data. They need data to power their internal projects at companies, at universities, and they need a platform to tie together all of their data tools into one cohesive whole so that they can create a data-driven culture across their organization. That is where we are getting the most traction from an enterprise standpoint. For instance, [news agency] Associated Press is using us to raise the game for data journalism.

“We like to talk about our platform being data aware; it’s not just a storage location.” –Matt Laessig

Knowledge at Wharton: Why did you set up data.world as a public benefit corporation?

Hurt: We have been successful entrepreneurs earlier. This time we wanted to do things a bit differently, with a different type of operating system. A public benefit corporation doesn’t mean you’re a nonprofit. You are very much a for-profit- organization. It’s just that you have a public benefit mission at the core of your company.

The thing that really excited us coming together as co-founders to launch this company was not just that it would make companies more data-driven — and therefore more competitive, efficient, and collaborative — but how it could change the world. For example, it could facilitate cancer researchers working together. People today work in silos, not because they necessarily want to do so, but because they don’t have a unifying collaborative platform. So we set out to build that.

I am very happy with the decision (of being a public benefit corporation) because it allows people to relate to our company at a different level. It’s changed the way universities and governments relate to us. It’s made us much more approachable and much more collaborative with our community. It has changed our relationship with our employees. It has made us a very attractive place to work.

I also want to emphasize that we intend to have top-tier financial performance. A lot of people think that B-corporations are just bleeding hearts that don’t care about business performance. That is absolutely not the case. Some of the top-performing companies in the world — like Patagonia or Ben & Jerry’s — are B-corporations.

Knowledge at Wharton: Could you give some examples of projects that are meaningful from the social mission standpoint, but which are also financially and fiscally profitable?

Hurt: I mentioned Associated Press earlier. One of the things that Associated Press is trying to do is trying to raise the game on data journalism. They compile nationwide datasets on important topics like the opioid crisis, housing inventory in the U.S., climate change and so on. These datasets are nationwide and down to individual locales — states, counties, cities and zip codes. Associated Press puts this data on the data.world platform in private. It sells it to their clients and helps them become more data-driven when it comes to their journalism. Associated Press pays us for access to our platform and all the different customers that use our platform via the Associated Press’ account. We make it easy for newspapers, radio news shows, TV news shows, etc. to collaborate with Associated Press around those datasets and write data-driven articles.

Knowledge at Wharton: What problems are people trying to solve with open data using the data.world platform? What kind of solutions have you seen emerge so far?

Matt Laessig: The open data community has a number of participants. There are publishers, several hundred cities and government agencies like the Department of Labor, the City of San Francisco, that have open data programs which have been established over the last 10 years. But they publish on their own .gov websites, which are often hard to navigate and discover. So we built integrations with them. We basically publish a mirror copy of their open data catalog on data.world, where it is more discoverable by a larger audience. It lives next to tools that allow you to immediately start working and exploring with the data.

Some of those capabilities are innate to data.world. We like to talk about our platform being data aware; it’s not just a storage location. A lot of capabilities on the platform let you explore the data, analyze it, and use some preliminary visualization. You also have integrations into a suite of tools that people who work in data love to use. Analysts love to use Excel, Tableau, Power BI and Google Data Studio. We have integrations into all those tools. Data scientists like to use R and Python. We have integrations in those tools. We want to be an agnostic hub from data sources, as well as into data tools that people like to use.

Organizations that have open data as being the constituents and stakeholders in the open data community have found that it is hard for people to access and understand that data on the .edu or .gov data portal. So, they publish on to data.world. Then you have organizations like Data For Democracy. It’s a group of citizens and data scientists. They use data on the system that is already available and open and they bring in data from other sources. They use our platform for project management and collaboration to further a lot of their public-good, civic-minded data projects.

“[Data] is like oil in the way that it is crude and it’s buried deep inside of these corporations.” –Brett Hurt

There are other groups, like Makeover Monday, which is a global organization of people who are passionate about data visualization as a form of analysis. They have a challenge every week. They post datasets on different topics – it could be the European soccer league or beer consumption in England and so on – and they challenge each other to do the coolest, most insightful data visualizations on it. They share it with each other on the platform and comment on it. Most of the people on this group are professional data analysts, business analysts, etc. Their idea is to inspire one another, sharpen their tools, and take those back to benefit their companies or businesses.

Hurt: There is a big data divide in the world. On the one side, you’ve got Amazon, Google, Facebook, Uber and others, which are phenomenally data-driven and are disrupting all types of industries. On the other side, you have companies like Ford and GM and American Express, which are worried about being disrupted.

These companies are aware of the data divide and they are trying to make their corporations more data-driven. But their data — their formal data — is buried in a lot of silos. It’s in a lot of different formats and different databases throughout the organization. There is also a lot of data in spreadsheets that is constantly making its way onto platforms like Slack. There is even data at some companies that are being shared in flat CSV format and GitHub. It’s a mess. It’s all over the place. There is no centralized hub for them to collaborate around data within the organization, and then to be able to easily pull in information from outside the organization to supplement that data.

That is where we come in — on both fronts. We make it easy for them to have all their data inside in one standardized format. Our entire platform is built on top of the semantic web, and that allows linked data to be possible. They want an environment where it is connected to all the tools that they use. That is why we have so many integrations. They also want an environment that is very social, that makes it very easy to ask questions, and where everybody can see the answers and the data. This democratizes data access, and therefore, increases learning across the organization. This makes them more competitive, efficient, effective and collaborative.

Knowledge at Wharton: One thing I keep hearing is that data is the new gold, that data is the new oil. People are starting to realize that there is a lot of value in data. As this happens more and more, whom do you see as your competition? What will be your strategy to differentiate what you are doing from either existing or potential rivals?

Hurt: There are two points there. Point one is, is data the new oil or the new gold? Well, data is an unlimited resource, right? We’re not going to run out of it. If anything, it’s exponentially increasing. So, it’s not like gold or oil in that way. But it is like oil in the way that it is crude and it’s buried deep inside these corporations. A lot of organizations have not figured out how to properly account for their data and analyze it. We want to help clean that up.

Regarding competition, the competition in our space is enterprise software tools that take 12 to 24 months to install. They are top-down driven, hard to wield and only handle a fraction of the company’s data. They only handle core data coming out of a company’s transactional database. They don’t handle the data in spreadsheets and Slack and email, or the massive amount of data that they could be pulling in from outside to make the companies smarter and more competitive. Our platform takes no time to install. You sign up and you can drag and drop a dataset in a minute. We’ve also got a more modern pricing structure than the enterprise software companies

Laessig: To build on one of Brett’s points, not only do traditional legacy enterprise solutions have extremely long implementation timelines, there are only a few elite folks in an organization that have their hands on the reins. And they are extremely technical. Very few people get to participate in that ecosystem. So part of what is going to make us data-driven is not just a technology solution, but it’s bringing the people into the process as well.

Knowledge at Wharton: I believe data.world is part of a group of organizations that have launched a non-profit. You have plans to launch the first ever open (artificial intelligence) AI global marketplace. Could you explain this initiative? What role will data.world play and how does it fit into your mission?

Hurt: We believe that for AI to be possible, other than in extremely narrow applications, you have to clean up the world’s data. You can’t do that alone. It will take millions of smart people who are astute with data and subject matter experts. That’s the kind of people that we are bringing together.

You also have to have a platform that allows AI to interoperate on top of it. That is what we have built. That is why we started with the semantic web. All the data that gets ingested into data.world gets immediately converted into an RDF (resource description framework) format in a graph database at massive scale. AI applications can be integrated on top through the use of our APIs (application program interfaces).

We believe that the future is going to be powered by good data. So it made sense to get involved in AI Global.

The broader mission is a connected, collaborative platform where the data itself is linked so that AI can understand the context of the data. This is in contrast to today’s AI, which is very narrow and depends on skinny datasets. We’ve got centuries of data globally. So many datasets haven’t made it online yet. AI is not going to be able to understand things unless it looks at historical patterns. The big problems that AI can help to solve, like climate change and disease outbreak, for instance, will only happen with very well understood data, and a very broad set of data. We want to play a role in helping with that.

“The big problems that AI can help to solve, like climate change and disease outbreak for instance, will only happen with very well understood data, and a very broad set of data.” –Brett Hurt

Knowledge at Wharton: People are also becoming increasingly aware of unfair bias in algorithms as a growing problem. How do you view that issue and what are you doing to deal with it?

Hurt: Last year, we brought together some of the world’s most famous data scientists to define the rules of the road for algorithms and data and created a manifesto for data practices. When people program things and create algorithms, they tend to do so from their own lens. This manifesto addresses biases in algorithms. It is not branded under data.world; it’s not our manifesto, although we have completely bought into it.

Laessig: I would like to highlight something that Brett said. He was talking in reference to biases in algorithms and as a direct response to your question. But it’s actually a universal principle that we hold very close to our heart, which is the power of reproducibility. The concept of transparency and reproducibility is core to all types of data work. You will see that in the manifesto for data practices.

Knowledge at Wharton: The environment around data privacy has changed in the aftermath of the Cambridge Analytica scandal, and also all of the trouble that Facebook has been facing. For example, Facebook was recently slammed with a maximum fine of 500,000 pounds for privacy violations by the U.K. Information Commissioner’s Office. What implications does this have for the way in which people use data on the data.world platform? And more broadly, what do you think should be done to foster more transparency and trust to deal with the nightmare of data privacy?

Hurt: On our platform, transparency reigns in every way. If it’s private data inside a company, there are lots of people across the company that now have access to data that they didn’t have earlier. It’s the same with open data. With regards to Facebook and Cambridge Analytica, we are not in that space. We are not a data generating company. We are a platform that people use to analyze data with the sources clearly marked. If NASA uploads a dataset on data.world, or the U.S. Census uploads a dataset, you know it’s coming from NASA and from the U.S. Census. If it’s uploaded by an individual, then you need to think about the credentials of that person.

In the case of Facebook, they are advertising-based and they own all the data. The rules that they had at that time, that people had signed up for, said that they could use the data in [a particular] way. But Cambridge Analytica had developed psychographic models that allowed it to theoretically know you better than you know yourself, if you liked a certain amount of content on Facebook. I think the stat was that if you had done more than 250 likes, then the psychographic models would know you better than you know yourself in terms of what preferences you would have, and what messages to target you with. I am glad that that’s out there and there is transparency about that issue.

“The concept of transparency and reproducibility is very core to all types of data work. And you will see that in the manifesto for data practices.” –Matt Laessig

Knowledge at Wharton: What are the implications for the ways in which data is being used?

Hurt: For us, we allow people to report on datasets that there may be a problem with. Maybe the data is stolen. Maybe the data is inaccurate. Maybe the data has been tampered with. We monitor for all those types of things, and look for the social signal for those types of problems. To our knowledge, we have had very few bad actors on our platform. I think being a benefit corporation has helped us in this regard. It’s attracted the right type of people.

Knowledge at Wharton: So far, data.world has raised about $33.3 million in venture funds. What are the major business and financial milestones you are working towards hitting in the foreseeable future?

Hurt: Our plan is to build one of the most transformative enterprise software companies the world has ever seen and to unify the data world, pun intended. We want to unify all the data tools inside corporations, universities, out there at large, so that people can collaborate like never before. And, as I mentioned, we consider top-tier financial performance very important. We launched our commercial offerings in November 2017 and we’ve beaten our sales goals ever since. So we’re off to the races.

Knowledge at Wharton: What is your long-term dream for data.world? How will you realize that you achieved it?

Hurt: My 13-year old daughter describes it better than I do. She says: Imagine that you have a cancer researcher working on the same type of lung cancer that that my mom had. That cancer researcher is in Brazil, and she gets connected to someone in Philadelphia who is working on that exact same type. Their data becomes connected to each other and together they can solve that type of lung cancer. They would have never known each other had it not been for data.world.

The thing that excites me is that this is ultimately a platform that our children will be using in schools. There are already a lot of universities using us to help teach their data science and data analytics courses. This ultimately will be a platform that makes the world a lot smaller. That doesn’t keep data as this oily, buried, siloed-type substance, but brings it to the forefront and allows humanity to solve big problems, whether it’s cancer or climate change, or poverty alleviation, or nutrition, and so on.

My dream is my daughter coming to me when she is in college and saying: “Dad, what was the world like before data.world? How did you find data?”

Laessig: I think Brett really captured it. Our kids today are asking that question about the Internet. Think of all of the things that have been unlocked because of the World Wide Web, the standards there, and the network effect — access to opportunity, to education. But that hasn’t been applied to data yet. What we’re trying to build at data.world is the world’s largest knowledge graph for data.