Ever since the world wide web exploded in the mid-1990s, attempts have been made to extend its basic presentation format to create a richer, more meaningful network of information. Internet users have envisioned a web that presents information that can not only be read by humans but also be understood by computers.
Why does that matter? The reason is that it could usher in entirely new ways of doing business. The web could evolve from a collection of loosely linked pages to an enormous database that could be searched and filtered and re-assembled in new ways.
For example, when someone views information on a web page about an upcoming concert, why can’t he instantly add it to his personal calendar? Or when a person’s contact information is displayed, why can’t it be added to a contact list or cell phone directory with a single click? Sites like LinkedIn and Friendster let their users explore social networks, but the users have to enter the information about the people they know at each web site. Those who have a personal web site or a web log (or “blog”) probably already link to many of the people they know. Why can’t a search tool automatically build a social network from those links?
The reason is that the HTML [hypertext markup language] tags used to display these items on the web don’t describe what they mean. If one web site links to another, the link doesn’t carry any information about why the sites are linked. But what if it did? And what if every event listed on a web page could also be read by software that could understand its date, time and location?
Adding this type of meaning to the vast universe of web pages has long been the dream of those working on developing the next generation of the Web. It has also been a major focus of the web’s original developer, Tim Berners-Lee, who for many years has been working with the web’s main standards body, the World Wide Web Consortium (W3C), to develop the “Semantic Web.” But so far, efforts to infuse the Web with meaning have gained little traction. These initiatives have been bogged down by complexity and over-ambitious goals, or have simply been too much trouble to implement at a large scale.
Now, though, a grassroots movement has emerged that seeks to attach intelligent data to web pages by using simple extensions of the standard tags currently used for web formatting — HTML (or XHTML, its more formally-structured cousin). These so-called “microformats” may change the way the web works. Microformats were discussed during a workshop on the first day of Supernova 2005, held at Wharton West in San Francisco on June 20. Knowledge at Wharton met with Tantek Çelik, senior technologist at Technorati, following his presentation on microformats at Supernova. An edited version of that conversation follows:
Knowledge at Wharton: What problems are microformats trying to solve?
Çelik: We’ve taken a lot of the really tough problems in the past — [such as] how to publish structured information in a way that’s presentable to humans as well as something that machines can read — and we’ve said, “What’s the simplest possible way that a [web] publisher can do that today — on today’s web — with simple XHTML?”
That’s what microformats do. They are intended to focus on the human side. We ask, “How can we make this information adapt to people’s behaviors currently? And how can we keep things as simple as possible, so that the effort required by publishers is minimal?”
Now, what happens as a result of that? Our belief is that — by offering a solution which provides better exposure to publishers, better indexing in search engines, better opportunities to get found, blogged and linked — that they will take that small extra step of adding this little microformat markup to their content. This is in direct contrast to a lot of other solutions which say — “Oh, here’s a new language we want you to learn.” Or, “Here’s a new language we want you to learn and now you need to output these additional files on your server.” It’s a hassle. We’ve lowered the barrier to entry.
Knowledge at Wharton: Let’s step back a second. You said that this is meant to solve a lot of “tough problems.” What are the big problems that microformats are addressing?
Çelik: The problems are about as old as the web itself. People have been trying to share information about numerous kinds of things on the Web for as long as it has been around. And HTML actually provides a very good framework for sharing some types of information, such as documents with headers, paragraphs, code samples and abbreviations.
But when you talk about people, contact information, calendaring events, reviews and ratings, social network relationships — those are problems that HTML, by itself, currently doesn’t address. Many attempts have been made to solve these problems outside of HTML, but, as I said earlier, they have invariably involved extra work. With microformats, instead of asking people to do extra work, we say, let’s make it really simple. It’s important to not invent problems out of the air — the important thing is to look at what people are doing on web pages already.
Knowledge at Wharton: Other formats exist that address a similar set of problems. Why not use those? Why not use XML [extensible markup language] and RDF [resource description framework]; iCalendar and vCard?
Çelik: The answer is actually “yes” and “no”. The interesting thing about microformats is that they leverage XHTML. XHTML is essentially HTML, but cast in XML. Think of XHTML as XML plus some building blocks you get to start with. You don’t have to invent every new tag yourself. There are already lists in HTML — you don’t have to invent those. There are already definitions in HTML — definition lists and terms. So if you’re marking up something with those kinds of concepts, you can reuse those building blocks. The important thing about XHTML, though — the reason that it has been adopted — is that it works on today’s Web, and the data in XHTML is visible to humans. This is a really important point. This is something we learned through the meta-keywords debacle.
Knowledge at Wharton: How so?
Çelik: HTML has a tag called the META tag. One variant of it allows you to — invisibly — document the keywords about the page. People used to do this all the time and search engines used to index the keywords on a page. But they were completely invisible. What happened? They got out of sync. Documents got updated; their keywords didn’t. And no one noticed, because you couldn’t view the keywords when you looked at the document. The META tags also got abused. People put in long lists of hundreds and hundreds of keywords that had nothing to do with the content. So they became useless.
The problem is if your data isn’t visible in an open ecosystem like the web, then it quickly gets out of date. It gets spammed, corrupted, and your signal-to-noise ratio drops to nothing. That’s what happened with meta-keywords.
With microformats we emphasize visible data, instead of invisible metadata. We’ve taken this lesson of meta-keywords and said, “All right, we’ve learned our lesson. Now, what works today?” Well, take a look at what Google does. Google indexes, searches, and prioritizes pages according to hyperlinks. Hyperlinks are visible on pages. Of course there are search engine optimizers and [others] trying to game the system, but they’re much less successful.
The difference between a visible data system like hyperlinks and an invisible data system like meta-keywords is that with the visible system, there’s this great feedback loop. If I’m an author and I make a mistake in my hyperlink, I’m going to see it. If I don’t, my readers are going to see it and they’re going to tell me. There’s a feedback loop — error correction is built in.
There’s also a penalty built in for people who abuse the system. If I put 2,000 links on a page, that’s going to be visible to the reader. The reader is going to know, “Ah, something fishy is going on here”. Because of that social pressure, because of that feedback loop, we get this much more accurate corpus of data that we can index and search, and prioritize and relevance rank — with visible data, rather than invisible metadata.
So, enter microformats. We’re asking, “What’s visible on the Web?” XHTML is visible on the Web, and it’s presentable with cascading style-sheets to make it look as beautiful as you want it to look. Therefore, we choose that as our foundation.
Now, let’s look at some of these other technologies you talked about, such as XML. XML can be visible on the Web, but in practice, it doesn’t really work well by itself in today’s browsers. It was supposed to solve that problem, but it didn’t happen. HTML is still completely dominant. And if you want to post visible information on the Web, you have to use HTML or XHTML.
RDF is also effectively invisible metadata on the Web. If you go to an RDF file and you actually try to view it, the browser will either tell you “I don’t understand this, let me download it for you”, or it’ll give you a bunch of gibberish.
Knowledge at Wharton: And what about [the data interchange standards] iCalendar and vCard?
Çelik: One of the basic principles of a microformat design is to reuse rather than re-invent. And when you can’t find a microformat to use for whatever purpose you want, you look at established standards — interoperably implemented standards. I’m sure you have a couple of devices that support vCard. I have as well. You probably have applications that support iCalendar. These are well-established IETF [Internet Engineering Task Force] standards. For the microformats for people and events, we said, “Let’s look at vCard and iCalendar. And let’s create a one-to-one correspondence in XHTML.”
We’re literally reusing vCard and iCalendar for two of the microformats. Now when all these event and venue sites publish their event information, people have already built open source software, plug-ins, that let you subscribe to that data that’s in [the] hCalendar [microformat], automatically converts it to iCalendar, and loads it into your calendaring program. It’s like magic! Without even changing your calendaring program, it can all of a sudden access all this new content that’s out there. We’ve really unlocked the power behind some of these standards in a way that no one else has done before.
Knowledge at Wharton: Tim Berners-Lee has been working for several years on the “Semantic Web.” Microformats seem similar in concept but different in implementation.
Çelik: It’s partially similar in concept. And one of the benefits of microformats is all the work that came before it. Tim is a visionary. He’s done some amazing work. He had amazing successes with [the web standards] HTML, HTTP and URLs. And one of the aspects of those that we often forget is how “fast and loose” they were early on. And that sort of under-specification, you might say, was directly responsible for their success.
The basic concept of the Semantic Web that Tim talks about — about publishing more semantic information on the Web — is something anyone involved with microformats agrees with 100%. It’s our goal to put as much semantic information as possible on the Web.
In practice, what does that mean? For microformats this means keeping things visible and presentable to the user. That’s where microformats differ from the vision of the Semantic Web. The Semantic Web — capital “S” and capital “W” — focuses on trying to put semantic information on the web in a machine-readable format. It actually doesn’t care about being humanly readable. That’s not one of its design centers. With microformats we contrast that by saying, “We want to put semantic information on web primarily so that it can be read by humans. It must be able to be read, edited, viewed and verified by humans — because of that positive feedback loop I talked about. And, secondarily, we want to also make sure it’s readable by machines. That’s the big difference.
Knowledge at Wharton: You mentioned in your presentation at the conference that you and the people working on microformats are not a standards body, nor would you want to be.
Knowledge at Wharton: Why not? Why would you want to work outside the World Wide Web Consortium and the other standards bodies?
Çelik: Microformats are, in many ways, such a lightweight thing, that it’s not clear that they need months or years in a standards body to make them work. In part, all this is still something of an open experiment on the Web. Everyone interested in microformats said, “Hey, if we can just get together, make a few simple things, build a few simple examples — maybe we could make this work.” That’s literally what it is. That’s why we’ve built this community site — microformats.org.
Knowledge at Wharton: And you launched it here at the Supernova conference, right?
Çelik: Right. We planned it for the conference.
Knowledge at Wharton: What do you hope the site will do?
Çelik: The goal is to take a lot of individual efforts that have been going on around microformats — from companies such as Technorati, Six Apart, Yahoo!, Microsoft, AOL — and bring together a community. We wanted to build a place where interested parties could come and share ideas. If someone from a new company says, “I want to develop a microformat to solve a specific problem,” where do they go? We provide a community site. They can come and post on the mailing lists. They can say, “Hey, I want to develop this.” And someone else may say, “Hey, I want to develop something like that as well. Let’s figure it out.” They can create a Wiki page and start immediately developing it.
It’s a different kind of development process — very decentralized — than the traditional “OK, let’s send out a request for proposal; let’s establish a committee; let’s establish which companies should be on that committee and which representatives should be on that committee, and then let’s send it on its way for a couple of years to come up with something.”
Knowledge at Wharton: So what do you want to see happen next?
Çelik: The next thing that needs to happen is to see an increase in the diversity of the participants, the kinds of microformats that are being developed, and an increase in the adoption. We’ve gotten quite a bit of adoption so far. Companies like EVDB and Upcoming.org — which specialize in hosting event information and venue information — have already adopted hCard, the equivalent of vCard in XHTML.
Knowledge at Wharton: Anything else we should know?
Çelik: I would like to encourage your readers to please take a look at microformats.org. And if they’re looking at sharing structured information of any kind, come and participate.