Categories
General

Guided Summarization

I’m still waiting for the ECIR organizers to post the slides from the Industry Day. I particularly liked Nick Craswell’s presentation on A Brief Tour of “Query Space”. Until his slides are up, I recommend this SIGIR ’07 paper to give you an idea of his approach.

Slides are here as a PowerPoint show for anyone interested, or use the embedded SlideShare show below.

Categories
General

List of Findability Solutions

Dan Keldsen has posted a list of findability-related solutions at BizTechTalk. The 80 or so solutions that he lists are certainly an attempt to err on the side of recall, by including search, taxonomies, interfaces, and visualization as aspects of findability. Definitely a useful resource for anyone interested in enterprise information access.

Categories
General

Privacy through Difficulty

I had lunch today with Harr Chen, a graduate student at MIT, and we were talking about the consequences of information efficiency for privacy.

A nice example is the company pages on LinkedIn. No company, to my knowledge, publishes statistics on:

  • the schools their employees attended.
  • the companies where their employees previously worked.
  • the companies where their ex-employees work next.

If a company maintains these statistics, it surely considers them to be sensitive and confidential. Nonetheless, by aggregating information from member profiles, LinkedIn computes best guesses at these statistics and makes them public.

Arguably, information like this was never truly private, but was simply so difficult to aggregate that nobody bothered. As Harr aptly put it, they practiced “privacy through difficulty”–a privacy analog to security through obscurity.

Some people are terrified by the increasing efficiency of the information market and look for legal remedies as a last ditch attempt to protect their privacy. I am inclined towards the other extreme (see my previous post on privacy and information theory): let’s assume that information flow is efficient and confront the consequences honestly. Then we can have an informed conversation about information privacy.

Categories
General

Social Navigation

There has bit a lot of recent buzz about social navigation, including some debate about what the phrase means. I dug into the archives and found a paper from the CHI ’94 Conference on Human Factors in Computing Systems entitled “Running Out of Space: Models of Information Navigation”. In it, Paul Dourish and Matthew Chalmers distinguish between semantic navigation and social navigation:

[semantic navigation offers] the ability to explore and choose perspectives of view based on knowledge of the semantically-structured information.

In social navigation, movement from one item to another is provoked as an artifact of the activity of another or a group of others.

Back in 1994, the Web was only starting to reach a broad audience. The authors cite two examples of social navigation: personal home pages, where people listed sites they found interesting, and collaborative filtering (specifically, the Information Tapestry project at Xerox PARC).

Today, a decade and a half later, the web has scaled by several orders of magnitude, search engines have largely obviated the listing of interesting sites on personal home pages, and collaborative filtering, while still going strong as a social influence on user experience, hardly feels like navigation. It does seem that the term “social navigation” deserves an update.

Following Dourish and Chalmers, let us define social navigation as the ability to explore and choose perspectives of view based on social information. Importantly, social navigation is user-controlled navigation just like semantic navigation–only that the user is navigating by changing the social lens on the information rather than specifying semantic constraints.

One example of social navigation is the ratings information at the Internet Movie Database (IMDB). For example, we can see from the ratings for Live Free or Die Hard that the movie appealed most to males under 18.

Fandango (an Endeca customer) takes this concept a step further, offering users faceted navigation of the space of movie reviews, where facets include age, gender, whether or not the reviewer has children, and whether the reviewer lives near the user.

More sophisticated interfaces will intermingle semantic and social navigation. Here is a screen shot from a prototype some of my colleagues put together and demonstrated at HCIR ’07:

Social navigation, defined as above, offers users more than just the ability to be influenced by other people. It offers users transparency and control over the social lens. It allows us to think outside the black box.

Categories
General

Happy Rota Day!


Since this is a personal blog, I’d like to go a bit off-topic and take recognize my late mentor Gian-Carlo Rota, whose birthday is today. While I and countless others recall Gian-Carlo most fondly as a mentor and teacher, his crowning achievement was to make combinatorics a respectable branch of modern mathematics. Indeed, combinatorics and probability theory have been instrumental to the progress of information retrieval and information science.

And this nugget of his advice about lecturing seems remarkably appropriate in the context of how information retrieval engines should work:

Every lecture should state one main point and repeat it over and over, like a theme with variations. An audience is like a herd of cows, moving slowly in the direction they are being driven towards. If we make one point, we have a good chance that the audience will take the right direction; if we make several points, then the cows will scatter all over the field. The audience will lose interest and everyone will go back to the thoughts they interrupted in order to come to our lecture.

Happy Birthday, Gian-Carlo.

Categories
Uncategorized

Workshop on Ranked XML Querying

Thanks to an excellent blog written by Panos Ipeirotis at the NYU Stern School, I learned about a workshop held last month in Dagstuhl on ranked XML querying. Most of the presentations are available online, including one entitled DB & IR from a DB Viewpoint by Gerhard Weikum at the Max Planck Institut für Informatik. I’m excited to see these efforts to unify the DB and IR perspectives. So much more productive than the infamous MapReduce debate!

Categories
General

Database Usability

Just as I was digesting Jeff Naughton’s presentation at DB/IR day, a colleague at Endeca emailed me the keynote that H. V. Jagadish (University of Michigan) presented at SIGMOD ’07 on making database systems usable. He enumerates the familiar pain points of today’s database systems: confusing schemas, too many choices to make, unexpected–and unexplained–system behavior, and too high a cost for initial creation. He proposes “systems that reflect the user’s model of the data, rather than forcing the data to fit a particular model.”

As with Jeff’s presentation, the main take-away here is a framework (though both he and Jeff have taken initial steps to address the problems they describe). As a practitioner, I’m most encouraged by the fact that database researchers, like information retrieval researchers, are increasingly recognizing the importance of users.

Categories
General

The Efficiency of Social Tagging

Credit to Kevin Duh by way of the natural language processing blog for highlighting recent work from PARC on understanding the efficiency of social tagging systems using information theory. The authors apply information theory to establish a framework for measuring the efficiency social tagging systems, and then empirically observe that the efficiency of tagging on del.icio.us has been decreasing over time. They conclude by suggesting that current tagging interfaces may be at fault, through a positive feedback process of encouraging popular tags.

After seeing this and the TagMaps work at Yahoo Research Berkeley, I feel that the IR and HCI communities should join forces to understand social tagging in general terms that relate information, knowledge representation, and human beings. These concerns are hardly specific to the web or to what is now called “social media”–after all, media is social by definition. Indeed, there is no reason to confine this approach to human-tagged collections–why not consider automated tagging systems on the same playing field?

Categories
General

Accessibility in Information Retrieval

The other day, I was talking with Leif Azzopardi at the University of Glasgow about accessibility in information retrieval. Accessibility is a concept borrowed from land use and transportation planning: it measures the cost that people are willing to incur to reach opportunities (e.g., shopping, restaurants), weighted by the desirability of those opportunities.

What does accessibility mean in the context of information retrieval?

Instead of an actual physical space, in IR, we are predominately concerned with accessing information within a collection of documents (i.e., information space), and instead of a transportation system, we have an Information Access System (i.e., a means by which we can access the information in the collection, like a query mechanism, a browsing mechanism, etc). The accessibility of a document is indicative of the likelihood or opportunity of it being retrieved by the user in this information space given such a mechanism.

It’s a very appealing way to measure the effectiveness with which the an information retrieval system exposes a document collection–as well as the bias the system imposes. While the paper offers more questions than answers, I recommend to anyone who is interested in thinking outside the box of the traditional IR performance measures.

Categories
General

North East DB / IR Day

Last Friday, I had the privilege to attend the Spring 2008 North East DB/IR Day, hosted by Columbia University:

The North East DB/IR Day brings together database and information retrieval researchers and students from both academic and research institutions in the Northeastern United States. The DB/IR Day is a semi-annual workshop that features an exciting technical program as well as informal discussion. The DB/IR Day provides a regular forum for presenting diverse viewpoints on database systems and information retrieval, addressing current topics as well as promoting information exchange among researchers.

The event lived up to its promise, and I was impressed with the quality of student posters. But my favorite part of the event was the keynote by Jeff Naughton entitled “Extracting Problems for Database and IR Researchers.”

Jeff characterized the traditional philosophy of the database community as guaranteeing perfect outputs is the inputs are perfect. He argues that what we need more of today are databases that expect imperfection, and try to help.

To summarize his talk:

  • Provide support for “learn schema as you go.”
  • Develop techniques to explain inconsistency and let users reason about it.
  • Expect errors, provide tools for users to understand/debug them.
  • View task as helping user discover what they want in large space of potential queries.

It is encouraging to see such a prominent database researcher advocating this vision, especially since it aligns so well with the technology we are developing at Endeca.