On August 21, 2006, Time Warner’s America Online revealed that it had severed ties with its chief technology officer after the online service released three months of search queries from 658,000 subscribers which, although “anonymized” by removing user account details, still contained enough data to possibly identify some of the users. The privacy breach underscored the perils of supposedly “anonymous” Internet profiling and raised the hackles of privacy advocates such as the Electronic Frontier Foundation. The EFF, a week earlier, had urged the Federal Trade Commission to investigate AOL and force the company to change its privacy practices.

A different type of anonymous Internet profiling is highlighted by Wharton operations and information management professor Balaji Padmanabhan in a working paper — co-authored with Catherine Yang, a professor at the Graduate School of Management at the University of California, Davis — titled, “Clickprints on the Web: Are There Signatures in Web Browsing Data?” Although the authors don’t focus specifically on the AOL incident, the paper highlights how it is possible to identify unique users based merely on their browsing behavior.

Padmanabhan and Yang find that each individual may have a “clickprint” — a unique pattern of web surfing behavior based on actions such as the number of pages viewed per session, the number of minutes spent on each web page, the time or day of the week the page is visited, and so on. The authors conclude that by observing these patterns, an e-commerce company can distinguish between two individuals with nearly 100% accuracy, sometimes with as few as three Internet sessions, and potentially use that information to deter fraud. The number of sessions needed to identify an individual rises with the number of unique users a site has because there are more people to differentiate.

While Padmanabhan and Yang focus on whether individuals have clickprints, the number of sessions needed to identify a unique individual, and potential fraud prevention applications, the paper also shows how companies can track users just by watching behavior. “Our main finding is that even trivial features in an Internet session can distinguish users,” says Padmanabhan. “People do seem to have individual browsing behaviors.”

Padmanabhan and Yang found that it takes anywhere from 3 to 16 Internet sessions to identify the behavior of a unique individual. “The paper is really a proof of concept that behavior and minimal information can be used to ID users,” says Yang.

There is an important difference, though, between the AOL incident and the setting addressed in the research paper. The AOL incident highlights how data captured by search engine companies may be (inappropriately) used to (1) build profiles of users based on what they search for on the Internet or (2) even identify them by name.

In contrast, the setting addressed by Padmanabhan and Yang is one where an online retailer observes an anonymous user browsing on its web site. The authors suggest that — based on clickprints learned from web browsing data — the retailer may then be able to match this anonymous user to a previous user who visited its web site in some prior session. Also unlike the AOL incident, the user profiles implicit in the clickprints relate to basic browsing characteristics of users, such as how many pages they view or how long they spend on the site.

The real trick is finding the right balance between gathering customer data and providing benefits. According to Padmanabhan, that conversation is just beginning. His research focused only on gathering data for internal use by a single site so that a company can suggest products, prevent fraud and tailor information for individual customers. The type of data released by AOL oversteps privacy bounds “because a list of all searches can be used to paint a picture that may not be accurate,” says Padmanabhan.

Tracking a Clickprint

The theory that web surfers have unique clickprints is based on the idea that humans all have “signatures,” such as fingerprints or handwriting styles that uniquely differentiate each person. Padmanabhan and Yang adapt those concepts to web surfing by observing characteristics that are behavioral (such as visiting the same four pages at 8:15 p.m.) rather than physiological (such as a person’s appearance).

The clickprint effort, the authors write, is really a first foray into the field of identifying web users’ behavior, and more research is required. Furthermore, each web site that attempts to use clickprint analysis would have slight variations in what data is collected and how it is used. For an e-commerce company, clickprints could be used to customize shopping recommendations and help prevent fraud. A large site like Yahoo could analyze different variables with a goal of customizing content.

Padmanabhan says the main goal was to find an efficient way to sort through aggregated data and to find the least number of sessions necessary to get a valid result. As the authors note in their paper, “A news event may prompt an individual to visit a web site and read an article and perhaps watch a related video. If the news event is of wide interest, there may be several, even millions, of sessions that ‘look similar.’ However over time — across many sessions — an individual may implicitly reveal more information that may then enable unique identification. In other words, we do not assume that every web session has information to uniquely identify individuals. In fact, the examples offered above suggest otherwise. What we do assume is that there is some level of aggregation [that contains] enough information to uniquely distinguish individuals.” The authors also show that the number of sessions analyzed improves the accuracy. For instance, one example found that 51 aggregated sessions yielded accuracy of 99.4%. With seven sessions analyzed, individuals could be identified with 86.7% accuracy.

Applications of Clickprints

The big question raised by all this data mining is how clickprints will be used. In their paper, Padmanabhan and Yang focus on the positive applications by noting how clickprints could mitigate identity theft and e-commerce fraud. The importance of clickprints “can be significant, given applications to electronic commerce in general and, in particular, online fraud detection, [which is] a major problem in electronic commerce costing the economy billions of dollars annually,” they write.

For instance, an e-commerce company could use clickprints to recognize that a person is using a stolen credit card based on differences in browsing behavior from the card’s true owner. “If who I think it is enters a different credit card, I can either ask for more information or investigate,” Padmanabhan says. Of course, there is a flip side, he adds. “If you ID anonymously, it raises privacy concerns…. Using clickstream information inappropriately can pose a danger to both users and a company’s reputation.” The authors, however, point out that their current work mainly shows that users have unique clickprints, and stops short of exactly addressing how this can be employed in fraud detection.

Clickprints can also help customer service, the authors add. “Having a method for detection could help online merchants customize content and recommendations much earlier in a user session than they might otherwise be able to do (since they will not have to require a sign-on before implementing strategies to better serve this customer). Implemented appropriately, such customized online storefronts have been shown to increase customer satisfaction.”

However, there are limitations and challenges to using clickprints to profile customers. Among the challenges outlined by Padmanabhan:

  • It is unclear whether clickprints can be applied on a massive scale such as 100 million unique users; effective fraud detection may require methods that work at such a scale.

  • Companies will have to discover unique characteristics to their customers, and sometime distinguishing characteristics may well not exist.

  • Online companies will have to experiment to see what specific browsing behavior(s) need to be tracked to build unique profiles.

The Privacy Conversation

While, in theory, clickprints may have promising applications, Padmanabhan acknowledges that the AOL incident could hamper further development. Internet companies and their customers will need to discuss the privacy implications and benefits of using clickprints and anonymous data to identify individual browsing patterns. Ultimately, communications, perception and expectations all play a role in the privacy debate, says Padmanabhan.

For instance, web users don’t expect to be identified if they are only searching for information, he notes, which explains why the AOL leak caused such outrage. “However, if Amazon or a credit card company that can track everything you do uses clickprints, the perception is different because you expect it,” he adds.  “Perception is a big factor.”

As is the benefit to the customer. If clickprints are used as a way to prevent fraud on an individual site, it’s highly unlikely there will be an uproar over privacy, says Padmanabhan. “If you are identified in a way that will provide substantial benefits, it makes sense. If the benefit is not clear, it’s a recipe for a bad reaction.”

The major takeaway from Padmanabhan’s finding is that “this (privacy vs. benefit) doesn’t have to be a zero sum game.” In other words, some profiling is good if done carefully and used for a legitimate purpose with tangible benefits.

The evolution of clickprints is likely to be an ongoing issue for content companies that mine user data but fail to communicate privacy policies or demonstrate any tangible user benefits, Padmanabhan notes. While the value of the appropriate use of clickprints is clear for e-commerce companies, content firms are going to have to figure out how to handle customer tracking. Padmanabhan suggests that communicating expectations to users is critical and that companies should have the privacy conversation in the context of his clickprint findings.

Clickprints on the Web: Are there signatures in Web browsing data?