Categories
General

Micro vs. Macro Information Retrieval

The Probably Irrelevant blog has been quiet for a while, but I was happy to see a new post there by Miles Efron about “micro-IR“. He characterizes micro-IR, as distinct from macro or general IR, as follows:

  1. In ad hoc (text) IR a principal intellectual challenge lies in modeling ‘aboutness.’  In micro-IR settings, the creativity comes into play in posing a useful (and tractable) question to answer.  The engineering comes easily after that.
  2. The constrained nature of micro-IR applications leads to a lightweight articulation of information need.  There is a tight coupling here between task, query, and the unit of retrieval, a dynamic that I think is compelling.  Pushing this a bit farther, we might consider the simple act of choosing to use a particular application from those apps on a user’s palette as part of the information need expression.
  3. The tight coupling of task to data to ‘query’ enables a strong contextual element to inform the interaction.  Context constitutes the foreground of the micro-IR interaction.

He then asks: “is micro-IR something at all?  Is it actually related to IR?” Fernando Diaz answers that “the only difference between micro and macro IR is text.” Jinyoung Kim adds that in micro-IR “the context (searcher goal) is known, with domain-specific notion of relevance (goodness) and similarity measures.”

I hadn’t thought of making this particular distinction, but I like it. While I prefer to think about distinguishing the needs of information seekers–rather than the characteristics of search applications–I would be the first to argue that a well-designed search application caters to particular user needs. Indeed, I think the definition of a good micro-IR application implies that it addresses a highly constrained space of information needs. Just as importantly, micro-IR applications can often assume that their users are highly familiar with the information space the applications address, and thus that those users need less of the basic orienteering support that can be critical for success using macro-IR systems. That said, micro-IR users have (or should have) higher expectations of support for more sophisticated information seeking.

The other day, I speculated about why Google holds back on faceted search. I feel that the distinction between macro- and micro-IR is in the same vein: micro-IR settings (e.g., site search, enterprise search,vertical search) drive needs for more richer interfaces and support for interaction, while macro-IR application developers (e.g., general web search) worry mostly about producing a reasonable answer for the query–and often lead users to micro-IR destinations that offer their own support for information seeking within their constrained domains.

In short, it’s a nice way to think about the IR application space, and it’s increasingly relevant (no pun intended!) as we see a proliferation of micro-IR applications. And it’s great to see activity on the Probably Irrelevant blog after all these months of radio silence!

Categories
General

Not All Google Critics Are Bigots

Jeff Jarvis wrote a post today entitled “Google bigotry“, in which he asserts that:

Google has an image problem – not a PR problem (that is, not with the public) but a press problem (with whining old media people).

He then goes on to launch a tirade against a Le Monde journalist whose offense was to say she was writing “an article about Google facing a rising tide of discontent concerning privacy and monopoly.” He proceeds to stereotype the French as having “national insanity” of Google bigotry. I’ll leave analysis of irony as an exercise to the reader.

But the true irony is that Jarvis has a point. While I haven’t done a rigorous analysis, my impression is that there has been a sensationalist press overreaction against Google, singling out Google for behavior for which all other companies get a pass. As even one of the most vocal Google critics admits, “Google’s [privacy] policies are essentially no different than the policies of Microsoft, Yahoo, Alexa and Amazon.” Moreover, some of the newspapers criticizing Google as parasitic are the same ones who once turned–and still turn–to Google with open arms as a source of traffic–when they could easily cut Google off by configuring robots.txt. Granted, the newspapers are now locked into a prisoner’s dilemma, but they should at least take some responsibility for putting themselves in that position.

That said, there are lots of legitimate reasons to criticize Google, specifically concerning privacy and monopoly. While Google may not have engaged in any illegal or unethical practices to get there, it now holds a position as the primary gatekeeper to the internet for a substantial majority of Americans, as well as much of the western world. On the content creation side, site owners don’t ask “What Would Google Do?“–rather they ask how Google will index their sites. Meanwhile, on the consumption side, the broadening scope of Google’s role in ordinary people’s lives is legitimate cause for concern about privacy. It’s not insane or bigoted to raise these issues.

Moreover, Google claims to hold itself to a higher standard than other companies, so it’s not that surprising that people actually do hold them to it and criticize it when it flls short. Still, that’s no excuse for exaggeration or outright hallucination.

As I commented on Jarvis’s blog, I don’t think he’s the most credible judge of Google’s critics. He responded in kind. Touché. I accept that exchanging personal attacks doesn’t advance the argument. Perhaps more detached voices can chime in.

Categories
General

HCIR: Better Than Magic!

I’m a big fan of using machine learning and automated information extraction to improve search performance and generally support information seeking. I’ve had some very good experiences with both supervised (e.g., classification) and unsupervised (e.g., terminology extraction) learning approaches, and I think that anyone today who is developing an application to help people access text documents should at least give serious consideration to both kinds of algorithmic approaches. Sometimes automatic techniques work like magic!

But sometimes they don’t. Netbase‘s recent experience with HealthBase is, unfortunately, a case study in why you shouldn’t have too much faith in magic. As Jeff Dalton noted, the “semantic search” is hit-or-miss. The hits are great, but it’s the misses that generate headlines like this one in TechCrunch: “Netbase Thinks You Can Get Rid Of Jews With Alcohol And Salt”. Ouch.

It seems unfair to single out Netbase for a problem endemic to fully automated approaches, but they did invite the publicity. It would be easy to dig up a host of other purely automated approaches that are just as embarassing, if less publicized.

Dave Kellogg put it well (if a bit melodramatically) when he characterized this experience as a “tragicomedy” that reveals the perils of magic. His argument, in a nutshell, is that you don’t want to be completely dependent on an approach for which 80% accuracy is considered good enough. As he says, the problem with magic is that it can fail in truly spectacular ways.

Granted, there’s a lot more nuance to using automated content enrichment approaches. Some techniques (or implementations of general techniques) optimize for precision (i.e., minimizing false positives), while others optimize for recall (i.e., minimizing false negatives). Supervised techniques are generally more conservative than unsupervised ones: you might incorrectly assert that a document is about disease, but that’s less dramatic a failure than adding the word “Jews” to an automatically extracted medical vocabulary. In general, the more human input into the process, the more opportunity to improve the effectiveness and avoid embarassing mistakes.

Of course, the whole point of automation is to reduce the need for human input. Human labor is a lot more expensive that machine labor! But there’s a big difference between the mirage of eliminating human labor and the realistic aspiration to make its use more efficient and effective. That what human-computer information retrieval (HCIR) is all about, and all of the evidence I’ve encountered confirms that it’s the right way to crack this nut. Look for yourselves at the proceedings of HCIR ’07 and ’08. Having just read through all of the submissions to HCIR ’09, I can tell you that the state of the art keeps getting better.

Interestingly, even Google CEO Eric Schmidt may be getting around to drinking the kool-aid. In an interview published today in TechCrunch, he says: “We have to get from the sort of casual use of asking, querying…to ‘what did you mean?’.” Unfortunately, he then goes into science-fiction-AI land and seems to end up suggesting a natural language question-answering approach like Wolfram Alpha. Still, at least his heart is in the right place.

Anyway, as they say, experience is the best teacher. Hopefully Netbase can recover from what could generously be called a public relations hiccup. But, as the aphorism continues, it is only the fool that can learn from no other. Let’s not be fools–and instead take away the moral of this story: instead of trying to automate everything, optimize the division of labor between human and machine. HCIR.

Categories
General

Finding, Locating, Discovering

Thanks to Tony Hollingsworth for alerting me to a post by Alex Campbell entitled “Stark realisation: I no longer depend on Google to find stuff“. The title is provocative link bait, but the take-away is very down to earth: Google is primarily useful for locating information than for discovering it.

Library scientists make a distinction between known-item and exploratory search. The former is about locating information: as an information seeker, you know the information exists, and you can even characterize it unambiguously; but the challenge is to convert that description into a location that allows you to retrieve the information. The latter is about discovery: you don’t know that the information you seek exists, and you may be sure of how to characterize what you are looking for–or even know what exactly you want until you’ve learned something about what is available.

These are extreme points on the information seeking spectrum, and most real-world tasks are in the middle, or combine subtasks of both types. For example, in physical libraries (yes, I’m that old!), I remember finding a book in the stacks and then browsing the nearby books in the hopes of serendipitous discovery. These days, I’d be more likely to scan its bibliography–or to look at the books and articles citing it. A known item can be an excellent entry point for exploration. Conversely, exploration can lead you to discover the existence of information that you then simply need to retrieve.

In common use, words like searching and finding cover this entire spectrum of information seeking activity. This breadth of meaning causes a lot of confusion. I’ve blogged about this before: “What is (Not) Search?“:

At the very least, I propose that we distinguish “search” as a problem from “search” as a solution. By the former, I mean the problem of information seeking, which is traditionally the domain of library and information scientists. By the latter, I mean the approach most commonly associated with information retrieval, in which a user enters a query into the system (typically as free text) and the system returns a set of objects that match the query, perhaps with different degrees of relevancy.

Back to Campbell’s article. His main points:

  • Social networks have dramatically expanded our network of contacts.
  • Search engine optimization (SEO) experts have killed their own game.
  • The flow of information has changed: information now comes to us, rather than us having to go out and find it.

I like the spirit of the post, but I think he overstates his case. SEO isn’t all bad–in fact, it’s probably a key factor in Google’s effectiveness. And, while social networks enable social search in theory, and information does come to us; we are experiencing filter failure (Clay Shirky’s term) in a big way.

My conclusion: I agree with him about Google’s limitations–Google is primarily a locating tool, not a discovery tool. Unfortunately, I’m not persuaded that social networks and our theoretical ability to construct an ideal in-flow of information have actually delivered on the promise of more efficient information access. But I’m optimistic that we’ll eventually get there.

Categories
General

Free as in Freebase

It’s been a while since I’ve blogged about Freebase, the semantic web database maintained by Metaweb. But I recently had the chance to meet Freebasers Robert Cook and Jamie Taylor and hear them present to the New York Semantic Web Meetup on “Content, Identifiers and Freebase” (slides embedded above).

It was a fun and informative presentation. Perhaps the most surprising revelation about Freebase was that all of their data fits in RAM on a 32G box (yes, some of you caught me live-tweeting that during the presentation). Their biggest challenge is collecting good data that lends itself to the reconciliation needed to make Freebase useful as a data repository. Despite the lack of a near-term revenue model, the Freebasers are bullish about their approach: strong identifiers, strong semantics, open data. On the last point, almost all of Freebase is available under the  Creative Commons Attribution License (CC-BY)–which, as far as I can tell, make anyone free to develop a mirror of Freebase. Indeed, many people are using this data, including Google and Bing.

You might wonder whether Freebase is a business or a non-profit foundation–and the question did come up. The answer is that Freebase eventually expects to make money by providing services, e.g., helping advertisers. They see their graph store as a competitive advantage–but they freely admit that this advantage will erode over time. Indeed, the surprisingly small size of their graph makes me wonder how much speed and scalability matter, compared to the challenge of data scarcity.

I’d like to see Freebase succeed. I’m particularly a fan of the work David Huynh has done there on interfaces for semantic web browsing. Clearly their investors are true believers–Metaweb has raised a total of $57M in funding. I don’t quite get it, but I’m happy we can all benefit from the results.

Categories
General

Social Networking: Theory and Practice

I’ve been a student of social network theory for years, enjoying the work of Duncan Watts, Albert-László Barabási, Jon Kleinberg, and a number of other researchers investigating this field. It should be no surprise that a topic that is so core to our humanity has attracted attention from some of our best and brightest.

And I’ve dabbled a bit on the theoretical side myself. The TunkRank measure (I’m indebted to Jason Adams for his implementing it on a live site!) attempts to take the most basic assumption about our social behavior–the constraint that we have a finite attention budget–and explore its implications for influence over social networks. I have a few unexplored hypotheses queued up for when I can find the spare time to try validate them empirically!

But why settle for theory? We live in an age where social networks compete with web search (and perhaps complement search) as the hottest online technologies. If we’re not reading about Google vs. Bing, we’re reading about Facebook vs. Twitter, with LinkedIn offering a third way that seems to co-exist with its more storied peers. In this post, I’d like to focus on LinkedIn.

LinkedIn, despite its feature creep, is still fairly old-school: its raison d’être is for users to build, maintain, and exploit their professional networks. In theory, connections on LinkedIn represent present or past working relationships that become the basis for referrals–whether the goal is employment, sales, or partnership. LinkedIn is not the only professionally oriented social network, but at this point it’s certainly the dominant one.

But I’ve found at least two additional ways to use LinkedIn that I’d like to share:

Intelligence gathering. For reasons I don’t yet claim to understand, people share far more information about themselves–and in a much cleaner, structured form–on LinkedIn than in perhaps any other online medium. Most people’s resumes are not available online, but their LinkedIn profiles are tantamount to resumes. Moreover, their structured format makes it possible for LinkedIn to assemble aggregate profiles of companies, revealing composite pictures that must drive some of those companies’ legal and HR departments batty! At a higher level, LinkedIn also works well as a discovery tool–much more so now they’ve enabled faceted search. It’s still a bit tricky to explore people and companies by topic, but far more effective using LinkedIn than using any other tool I’m aware of.

Meeting new people. Cold-calling, spamming–pick your poison. In short, LinkedIn doesn’t have to only be about connecting with people you already know. But there’s an art to sending unsolicited messages: you have to pass the moral equivalent of a CAPTCHA by proving that your communication strategy isn’t indiscriminate. Let me use a personal example (that Maisha Walker was nice enough to write up in her Inc. magazine column). I decided that I wanted to find everyone on LinkedIn who might be interested in HCIR ’09. So I searched for everyone whose profiles indicated interests in both IR and HCI and sent out a targeted message (in fact, a invite with personalized message–a feature I recently feared they’d killed). The results were overwhelmingly positive. I’m not sure how many of the people I contacted will attend, but I raised awareness without inflicting annoyance. Better yet, one of the people I contacted then discovered I was looking for volunteers to review the draft of my book–and I thus obtained hours of help of someone who, just a day before, had never heard of me!

What intrigues me about LinkedIn (and other social networks) is the extent to which I am exploiting attention market inefficiencies (as LinkedIn may be doing as well). For example, LinkedIn makes it easy to send unsolicited invitations to anyone. Granted, you can lose this privilege by even having a couple of people respond to invitations with “I don’t know this person”. There’s also the question of why people’s social norms around disclosure are so different on LinkedIn than anywhere else–people not only post the content of their resumes, but go through the effort of providing it to LinkedIn in a structured form! Meanwhile, LinkedIn keeps tightfisted control over the information it aggregates–understandably, they recognize that this content is their most valuable asset.

People are still getting used to the idea of social networks. It will be interesting to see how their use evolves, particularly in term of information and attention market efficiency.

Categories
General

Payola? There’s An App For That!

Remember a few months ago when there was a scandal about a Belkin employee paying people $0.65 per review to post 5-star reviews to Amazon?

Well, that was child’s play compared to what PR firm Reverb Communications has allegedly been doing for it clients. According to Gagan Biyani at  TechCrunch, Reverb hired interns to post positive review to Apple’s App Store for clients. Indeed, TechCrunch posted documentation obtained through an anonymous tipster, including the following:

Reverb employs a small team of interns who are focused on managing online message boards, writing influential game reviews, and keeping a gauge on the online communities. Reverb uses the interns as a sounding board to understand the new mediums where consumers are learning about products, hearing about hot new games and listen to the thoughts of our targeted audience. Reverb will use these interns on Developer Y products to post game reviews (written by Reverb staff members) ensuring the majority of the reviews will have the key messaging and talking points developed by the Reverb PR/marketing team.

What makes this story especially newsworthy is that Reverb’s client list includes some big names, such as Harmonix (i.e., Guitar Hero and Rock Band) and MTV Games.

Apparently the reviewer system isn’t entirely anonymous, so Biyani was able to look for patterns:

iTunes allows you to see other reviews posted by the same reviewer. So, we clicked on the reviewer “Vegas Bound” (iTunes link) and started to look at his reviews. He reviewed 7 applications, and gave each one of them 5 stars. Each review was short and sweet, and extremely positive. These reviews represented 6 different developers. A quick Google search revealed an infuriating truth: every single one of these developers was a client of one PR firm: Reverb Communications.

I can only hope that scandals like these will cause people to be more skeptical of reviews (or opinions in general) that come from anonymous or obfuscated sources. While most reviews are probably sincere, it doesn’t take much to erode public trust. Moreover, a few shill reviews can attract attention to a product, thus leading legitimate reviews to follow afterward. Where’s the harm? Products without those shill reviews are starved of the attention they might deserve. Money substitutes for authentic endorsement.

Our brave new world of social media makes it possible to truly democratize the sharing of knowledge and opinions. But gaming the system like this erodes the trust that is essential for this process to work–and thus devalues all of the information available to us online. The key enabler of such gaming is anonymity. Fortunately the miscreants do get caught on occasion. Hopefully we will learn from this experience and build more robust systems that aren’t so easily gamed. Transparency or FAIL.

Categories
General

Google Search Appliance: Now Without HCIR!

In an earlier post, I speculated about why Google is holding back on faceted search. Of course, I was talking about their web search properties, not their enterprise offerings. I thought that they’d seen the light by now that faceted search–and HCIR in general–is especially important in the enterprise, where you can’t rely on PageRank, anchor text, and SEO–not to mention the large fraction of navigational and straight-to-Wikipedia queries.

But I was wrong. Don’t take it from me–watch the video below (or read this blog post) and listen to what Cyrus Mistry,  the product manager for the Google Search Appliance has to say. I might give him a pass on his dubious conflation all features other than ranked retrieval with “advanced search”. But here’s a direct quote: “users care about one thing: the right result coming to the top”.

Sigh. I don’t dismiss the value of relevance ranking. Some search queries are easy and clearly point to single documents as answers–and any search engine should do well on them. But lots of queries in site search and enterprise search environments (more so than on the web) don’t have a single best answer. That’s why we have faceted search and interfaces that offer useful information scent to users.

I understand that Google is, on the whole HCIR-averse. But I expect more from their enterprise division. To be clear, the “side by side” feature that Mistry touts is nice. It reminds me of Blind Search (built by a Microsoft employee in his spare time), and of a relevance ranking evaluator that Endeca customers have been using for years.

But there’s more to search results than ten blue links. Even the Google web folks seem to be slouching towards accepting the importance of interaction. Their enterprise team should be leading, not lagging.

Categories
General

Prediction Is Hard, Especially About The Future

That Niels Bohr certainly knew what he was talking about! But that hasn’t discouraged folks in any number of industries from trying to make predictions.

Google in particular has been researching the predictability of search trends (just to be fair and balanced, so have Bing and Yahoo). Yossi Matias, Niv Efron, and Yair Shimshoni at Google Labs Israel have made some fascinating observations based on Google Trends, including the following:

  • Over half of the most popular Google search queries are predictable in a 12 month ahead forecast, with a mean absolute prediction error of about 12%.
  • Nearly half of the most popular queries are not predictable (with respect to the model we have used).
  • Some categories have particularly high fraction of predictable queries; for instance, Health (74%), Food & Drink (67%) and Travel (65%).
  • Some categories have particularly low fraction of predictable queries; for instance, Entertainment (35%) and Social Networks & Online Communities (27%).
  • The trends of aggregated queries per categories are much more predictable: 88% of the aggregated category search trends of over 600 categories in Insights for Search are predictable, with a mean absolute prediction error of of less than 6%.

You can read their full 32-page paper here.

I’m not surprised at the predictability of human search behavior, especially for stable topics or even for unstable ones viewed as aggregates–one could argue the celebrities and scandals du jour are unpredictable but interchangeable. What I’m curious about is what we can do with this predictability.

In the SIGIR ’09 session on Interactive Search, Peter Bailey talked about “Predicting User Interests from Contextual Information“, analyzing the predictive performance of contextual information sources (interaction, task, collection, social, historic) for different temporal durations. Max Van Kleek wrote a nice summary of the talk at the Haystack blog. The paper doesn’t investigate seasonality (perhaps because they only looked at four months of data), but I’d imagine they would subsume it under the broader categories of historic and social context. But they do set a clear goal:

Postquery navigation and general browsing behaviors far outweigh direct search engine interaction as an information-gathering activity…Designers of Website suggestion systems can use our findings to provide improved support for post-query navigation and general browsing behaviors.

I hope Google is following a similar agenda. If you’re going to go through the trouble of predicting the future, then help make it a better one for users!

Categories
General

The Raging Debate Over The Link Economy

Arnon Mishkin wrote a post last Thursday on paidContent called “The Fallacy Of The Link Economy” that has been generating a lot of discussion, so I figured I’d join in the free-for-all. First, let me try to reduce each person’s argument to a direct quote that best sums up his position.

Arnon Mishkin:

The vast majority of the value gets captured by aggregators linking and scraping rather than by the news organizations that get linked and scraped.

Jeff Jarvis:

Links are worth what the recipient makes of them.

Mike Masnick:

It’s not the link alone that has value or the story alone that has value, but the overall process of building a community.

Erick Schonfeld:

If a news site or a blog can say enough interesting things enough times that news aggregators (or other sites) keep linking to them, then they can build up their brand and reader loyalty.

Sigh. I thought the health care debate was bad enough, but I suppose that almost all impassioned debates come down to opposing sides exchanging half-truths.

In Mishkin’s defense: news organizations are in a catch-22. Many have suggested that if a news organization doesn’t want its content showing up on aggregators’ sites, it simply has to modify robots.txt accordingly. But news organizations can only do so individually–which puts them in a prisoner’s dilemma. Anti-trust law prevents news organizations from collectively bargaining with those who aggregate their content. For all intensive purposes, they are forced to abide by the status quo.

In Jarvis’s defense (yes, I’m actually defending Jeff Jarvis!): there isn’t much point in producing content for which most of the value is captured in a teaser so small as to be covered under fair use rights. As he’s said elsewhere, newspapers are inefficient, and the industry will have to shrink a lot to be healthy.

In Masnick’s defense: I cite my own blog post (also inspired by one of his posts) about monetizing community because participation is inherently uncopiable. It’s hard for me to agree with him more strongly than that!

In Schonfeld’s defense: his argument sounds a lot like the “freemium” strategy, which has a respectable track record. In order to build a loyal customer base, you often need to give away free trials as teasers–and that’s effectively what happens when media sites make some of their content available through aggregators. And, as in the freemium model, the actual product has to be significantly more interesting that the free teaser to earn the consumer’s investment–whether that investment is in the form of money, attention, or loyalty.

So, do I agree with them all? Not exactly. Mishkin’s first prescription to news organization should probably be to cut investment in undifferentiated content. Jarvis should acknowledge that the inability of news organizations to collectively bargain is unfair to them. Masnick–well, I basically do agree with him on the limited point he’s making. I suppose the strongest objection would be that not all media sites should be forced to become communities just because they’re hobbled in their ability to negotiate the monetization of the content they produce. And Schonfeld’s argument assumes the current link economy as a given–and one of the biggest points of contention is whether news organizations should be allowed to try to change that economy.

Sadly, I don’t see any of these guys giving the other an inch, which is why this discussion will probably continue unchanged for the foreseeable future. Hopefully the passion of the debate helps sell, um, papers.