Categories
General

Qui, Quae, Quora

A friend of mine at Quora invited me into their private beta a couple of weeks ago, and by now I suspect that many of you are using it–especially since I’ve somehow managed to be the top hit for [quora invite]. Speaking of which, I appreciate that those of you with spare invites have continued sharing them with the stream of folks requesting them.

Anyway, if you haven’t heard about Quora yet, here’s a summary from the site:

Quora is a continually improving collection of questions and answers created, edited, and organized by everyone who uses it.

For those of you who studied Latin, the title of this post hopefully triggers at least a faint memory of relative pronouns and declensions. It’s been suggested that “quora” is a faux-Latin plural of quorum, which in turn is the genitive plural of qui. A less arcane possibility is that quora is intended to evoke the modern-day meaning of “quorum”: a gathering of the minimum number of people of an organization to conduct business. Or perhaps “quora” is a contraction of question or answer, befitting a question-and-answer site.

How did I come up with all of these possibilities? Well, I did study Latin (semper ubi sub ubi!), but I found all of the above from the Quora entry entitled “What does Quora mean?” (membership required to view). Indeed, Quora is a great place to learn about Quora, as well as about Aardvark, Hunch, and other question-and-answer startups. Because it’s been launched as a private beta and virally marketed among friends, the community–and thus its interests–are highly skewed towards tech startups. Indeed, people seem more inclined to compare it to programmer-oriented site Stack Overflow than to Yahoo! Answers–which speaks volumes about the current user base.

All that said, is Quora a useful site? It certainly offers useful information, but that’s a pretty low bar–after all, the open web already offers lots of useful information. The better question is what Quora offers that the open web does not.

Indeed, the closed nature of the site puts it at a disadvantage to the open web: no links, search engine optimization, etc. That said, I also haven’t seen spam or any of the other abuse endemic to the open web.

In any case, I don’t see Quora as a knowledge base of first resort–except possibly to learn more about software startups. Whether by design or by virtue of its early membership, the site is a very narrow scope.

The more interesting value proposition of Quora is the community it is creating. Quora facilitates conversation, much like a members-only blog where everyone uses their real names. It’s a well-designed social site, and I like that it revolves around substantive topics.

But I worry that Quora faces a catch-22. If the focus stays narrow, I can’t imagine it creating enough utility to justify its $86M valuation. But it’s not clear that Quora can scale up to a broader scope. Given what I’ve seen of question-and-answer sites, I’m skeptical.

What Quora does have going for it is an all-star team, and I’m sure they have big plans for the site. I’m very curious to see what those plans are, and how they play out.

Categories
Uncategorized

Deadline to Register for HCIR Challenge

If you are interested in participating in the HCIR Challenge, please let me know as soon as possible–and in any case by April 30th. The New York Times and the LDC are graciously providing access to The New York Times Annotated Corpus for free (waiving the usual $300 fee), but we need to let the LDC know who will be participating. Hope to see lots of you presenting your systems at HCIR 2010!

Also, participants building their systems in Solr can take advantage of the scripts that Tommy Chheng has prepared. Of course, you are welcome to use your own system. Commercial software companies are especially encouraged to show off their HCIR wares!

Categories
Uncategorized

Google Follow Finder

I know there’s lots of interesting stuff coming out at the Chirp Twitter developer conference this week, and I’m still catching up on it all. But I am happy to point folks to a Google Labs application that was announced this morning: Follow Finder.

It’s not the first application to suggest Twitter followers based on analysis of the social graph, but I’ve actually found its suggestions to be quite plausible. For example, it suggests @fredwilson, @cshirky, @mattcutts, @peteskomoroch, and @msftresearch as “tweeps” I should follow, and suggests that the following users have similar followers to mine: @endeca, @lemire, @yahooresearch, @googleresearch, and @mattcutts.

There’s a bit of an “everything sounds like Coldplay” effect (e.g., @fredwilson shows up in a lot of the searches I tried), but overall I’m impressed with the quality, especially compared to the other suggestion tools I’ve tried.

Categories
General

HCIR 2010: An Update

I hope a number of you are planning to participate in the HCIR 2010 workshop! Here is a quick update:

More details–particularly about the Challenge–are forthcoming and will be posted on the HCIR Challenge page. Meanwhile, feel free to ask me questions, either publicly here or by email, and I’ll be happy to answer them. Looking forward to seeing many of you this August!

Categories
Uncategorized

Fernanda Viégas and Martin Wattenberg Start a New Company: Flowing Media

This just in: Fernanda Viégas and Martin Wattenberg, two of the biggest rock stars in the world of data visualization (their long list of accomplishments includes Many Eyes and the Baby Name Voyager), have left IBM to form their own company, Flowing Media, headquartered in Cambridge, MA. As they wrote me, “if you know of anyone who has interesting data and would like help bringing it to life, spread the word.”

I am very excited for them, and am eagerly anticipating the work they will produce as free agents.

Categories
General

Go TunkRank!

I haven’t talked much about TunkRank in the past months, largely because Jason Adams, who stepped up to the TunkRank Implementation Challenge last year, has been leading the charge. Indeed, all I did, beyond lending my first syllable to its name, was to propose the measure and get it implemented “Tom Sawyer” style.

Since then:

And, most recently:

  • University of Oviedo professor Daniel Gayo-Avello published a research paper entitled “Nepotistic Relationships in Twitter and their Impact on Rank Prestige Algorithms“, based on a follower graph of 1.8M Twitter users, in which he reports:

    Lastly, there are one method clearly outperforming PageRank with respect to penalization of abusive users while still inducing plausible rankings: TunkRank. It is certainly similar to PageRank but it makes a much better job when confronted with “cheating”: aggressive marketers are almost indistinguishable from common users –which is, of course, desirable; and spammers just manage to grab a much smaller amount of the global available prestige and reach lower positions –although they still manage to be better positioned than average users. In addition to that, the ranking induced by TunkRank certainly agrees with that of PageRank, specially at the very top of the list, meaning that many users achieving good positions with PageRank should also get good positions with TunkRank. Thus, TunkRank is a highly recommendable ranking method to apply to social networks: it is simple, it induces plausible rankings, and severely penalizes spammers when compared to PageRank.

    You can read a summary version in his blog post, descriptively titled “Research on a 1.8M Twitter user graph. Conclusion: TunkRank is your best option.

I’ve excited that an idea I came up with on a whim (or perhaps out of excessive idealism) has taken such a life of its own. And hey, I do work for a company that is into real-time search and that knows a thing or two about adversarial information retrieval. Hopefully I’ll find way to apply TunkRank–or at least its intuition–in my own work. In the mean time, I offer those who have already done so my congratulations and gratitude.

Categories
Uncategorized

Build Your Own NYT Linked Data Application

Regular readers may recall hearing about the New York Times Annotated Corpus (which is the basis for the HCIR Challenge), and decision to publish their tags as Linked Open Data, Given that linked data applications are still a bit exotic, NYT semantic technologist and Noisy Community regular Evan Sandhaus published a tutorial and example application to help you build your own. If you’d like to get your feet wet in the semantic web (and can forgive the mixed metaphor), this is an excellent opportunity.

Categories
Uncategorized

Guest Post: Information Retrieval using a Bayesian Model of Learning and Generalization

Dinesh Vadhia, CEO and founder of “item search” company Xyggy, has been an active member of the Noisy Community for at least a year, and it is with pleasure that I publish this guest post by him, University of Cambridge / CMU Professor Zoubin Ghahramani, and University of Cambridge / Gatsby Computational Neuroscience Unit researcher Katherine Heller. I’ve annotated the post with Wikipedia links in the hope of making it more accessible to readers without a background in statistics or machine learning.

People are very good at learning new concepts after observing just a few examples. For instance, a child will confidently point out which animals are “dogs” after having seen only a couple of examples of dogs before in their lives. This ability to learn concepts from examples and to generalize to new items is one of the cornerstones of intelligence. By contrast, search services currently on the internet exhibit little or no learning and generalization.

Bayesian Sets is a new framework for information retrieval based on how humans learn new concepts and generalize.  In this framework a query consists of a set of items which are examples of some concept. Bayesian Sets automatically infers which other items belong to that concept and retrieves them. As an example, for the query with the two animated movies, “Lilo & Stitch” and “Up”, Bayesian Sets would return other similar animated movies, like “Toy Story“.

How does this work? Human generalization has been intensely studied in cognitive science and various models have been proposed based on some measure of similarity and feature relevance. Recently, Bayesian methods have emerged as models of both human cognition and as the basis of machine learning systems.

Bayesian Sets – a novel framework for information retrieval

Consider a universe of items, where the items could be web pages, documents, images, ads, social and professional profiles, publications, audio, articles, video, investments, patents, resumes, medical records, or any other class of items we may want to query.

An individual item is represented by a vector of features of that item.  For example, for text documents, the features could be counts of word occurrences, while for images the features could be the amounts of different color and texture elements.

Given a query consisting of a small set of items (e.g. a few images of buildings) the task is to retrieve other items (e.g. other images) that belong to the concept exemplified by the query.  To achieve the task, we need a measure, or score, of how well an available item fits in with the query items.

A concept can be characterized by using a statistical model, which defines the generative process for the features of items belonging to the concept.  Parameters control specific statistical properties of the features of items.  For example, a Gaussian distribution has parameters which control the mean and variance of each feature. Generally these parameters are not known, but a prior distribution can represent our beliefs about plausible parameter values.

The score

The score used for ranking the relevance of each item x given the set of query items Q compares the probabilities of two hypotheses. The first hypothesis is that the item x came from the same concept as the query items Q. For this hypothesis, compute the probability that the feature vectors representing all the items in Q and the item x were generated from the same model with the same, though unknown, model parameters. The alternative hypothesis is that the item x does not belong to the same concept as the query examples Q. Under this alternative hypothesis, compute the probability that the features in item x were generated from different model parameters than those that generated the query examples Q. The ratio of the probabilities of these two hypotheses is the Bayesian score at the heart of Bayesian Sets, and can be computed efficiently for any item x to see how well it “fits into” the set Q.

This approach to scoring items can be used with any probabilistic generative model for the data, making it applicable to any problem domain for which a probabilistic model of data can be defined.  In many instances, items can be represented by a vector of features, where each feature can either be present or absent in the item.  For example, in the case of documents the features may be words in some vocabulary, and a document can be represented by a binary vector x where element j of this vector represents the presence or absence of vocabulary word j in the document.  For such binary data, a multivariate Bernoulli distribution can be used to model the feature vectors of items, where the jth parameter in the distribution represents the frequency of feature j.  Using the beta distribution as the natural prior the score can be computed extremely efficiently.

Automatically learns

An important aspect of Bayesian Sets is that it automatically learns which features are relevant from queries consisting of two or more items. For example, a movie query consisting of “The Terminator” and “Titanic” suggests that the concept of interest is movies directed by James Cameron, and therefore Bayesian Sets is likely to return other movies by Cameron. We feel that the power of queries consisting of multiple example items is unexploited in most search engines. Searching using examples is natural and intuitive for many situations in which the standard text search box is too limited to express the user’s information need, or infeasible for the type of data being queried.

Uses

The Bayesian Sets method has been applied to diverse problem domains including: unlabelled image search using low-level features such as color, texture and visual bag-of-words; movie suggestions using the MovieLens and Netflix ratings data; music suggestions using last.fm play count and user tag data; finding researchers working on similar topics using a conference paper database; searching the UniProt protein database with features that include annotations, sequence and structure information; searching scientific literature for similar papers; and finding similar legal cases, New York Times articles and patents.

Apart from web and document search, Bayesian Sets can also be used for ad retrieval through content matching, building suggestion systems (“if you liked this you will also like these” which is about understanding the user’s mindset instead of the traditional “people who liked your choice also liked these”) and finding similar people based on profiles (e.g. for social networks, online dating, recruitment and security). All these applications illustrate the countless range of problems for which the patent-pending Bayesian Sets provides a powerful new approach to finding relevant information. Specific details of engineering features for particular applications can be provided in a separate post (or comments).

Interactive search box

An important aspect of our approach is that the search box accepts text queries as well as items, by dragging them in and out of the search box.  An implementation using patent data is at http://www.xyggy.com/patent.php.  Enter keywords (e.g., “earthquake sensor”) and relevant items to the keywords are displayed.  Drag an item of interest from the results into the search box and the relevance changes.  When two or more items are added into the search box, the system discovers what they have in common and returns better results.  Items can be toggled in/out of the search by clicking the +/- symbol and items can be completely removed by dragging them out of the search box.  Each change to an item in the search box automatically retrieves new relevant results.  A future version will allow for explicit relevance feedback.  Certain data sets also lend themselves to a faceted search interface and we are working on a novel implementation in this area.

In our current implementation, items are dragged into the search box from the results list, but it is easy to see how they could be dragged from anywhere on the web or intranet.  For example, a New York Times reader could drag an article or image of interest into the search box to find other items of relevance. There is a natural affinity between an interactive search box as described and the new generation of touch devices.

Summary

Bayesian Sets demonstrates that intelligent information retrieval is possible, using a Bayesian statistical model of human learning and generalization.  This approach, based on sets of items encapsulates several novel principles.  First, retrieving items based on a query can be seen as a cognitive learning problem; where we have used our understanding of human generalization to design the probabilistic framework.  Second, retrieving items from large corpora requires fast algorithms and the exact computations for the Bayesian scoring function are extremely fast.  Finally, the example-based paradigm for finding coherent sets of items is a powerful new alternative and complement to traditional query-based search.

Finding relevant information from vast repositories of data has become ubiquitous in modern life.  We believe that our approach, based on cognitive principles and sound Bayesian statistics, will find many uses in business, science and society.

Categories
General

Get Unvarnished!

Earlier this week, I read about Unvarnished on TechCrunch and was extremely curious about this “Yelp for LinkedIn” making a bold play in the online reputation space. My curiosity should be no surprise to folks who have read my recent posts about distributed trust networks and solicited reviews. Anyway, I decided to go straight to the source and persuaded Unvarnished CEO Peter Kazanjy to invite me to the beta.

My impression so far: they’ve done a great job of collecting profiles, but the reviews themselves are pretty sparse. Moreover, most of the reviews I’ve seen so far are positive–hardly the bloodbath that the blogosphere has been predicting. Membership is relatively non-anonymous (you need to sign in through Facebook Connect), but your actual reviews are posted anonymously.

Since Unvarnished is trying to collect reviews, it’s not surprising that the way to join the beta…is to review someone. If you want to try it out and don’t mind reviewing me (anonymously) as the price of entry, let me know, and I’ll send you an invite (we’ll have to connect on Facebook first).

p.s. No, this is not an April Fool’s Joke. At least not on my part!

Categories
General

CFP: HCIR 2010

The 4th Annual Workshop on Human-Computer Interaction and Information Retrieval (HCIR 2010) will be held in conjunction with the IIiX 2010 conference in New Brunswick, New Jersey on August 22, 2010. We’re pleased to announce that our keynote speaker will be Dan Russell from Google. New for this year, we will also be running an HCIR Challenge based on The New York Times Annotated Corpus!

Web Site

Workshop Chairs

Program Chair

  • Rob Capra, University of North Carolina at Chapel Hill

Local Arrangements Chair

Sponsors

Background

HCIR combines research from the fields of human-computer interaction (HCI) and information retrieval (IR), placing an emphasis on human involvement in search activities. The HCIR workshop has run annually since 2007. The workshop unites academic researchers and industrial practitioners working at the intersection of HCI and IR to develop more sophisticated models, tools, and evaluation metrics to support activities such as interactive information retrieval and exploratory search. It provides an opportunity for attendees to informally share ideas via posters, small group discussions and selected short talks.

New for 2010: the HCIR Challenge!

New this year, we will be running the HCIR Challenge! The aim of the challenge is to encourage HCIR researchers and practitioners to build and demonstrate effective information access systems. Challenge participants will have no-cost access to a large collection of almost two million newspaper articles with rich metadata generously provided for use in this challenge by The New York Times. The focus of participation is building systems (or using existing ones) to help people search the collection interactively. Entries will be judged by an expert panel based on HCIR criteria (specifically: effectiveness, efficiency, control, transparency, guidance, fun) and also judged by workshop attendees at the event. More information on the challenge will be made available on the workshop website.

Format

We invite 4-page papers that will be reviewed by an international program committee. Papers fall into two categories: position papers describing an idea, an opinion, or early-stage research, and research papers describing a conducted research study, an implemented system, or a review of prior research. Papers will be judged based on relevance to HCIR. Idea diversity across all submissions may also be considered. The revised versions will be published on the workshop website. The workshop time will be used for what participants have told us that they found most valuable in previous events: posters and directed group discussions.

We will select 4-6 papers for presentation in a workshop panel. All other attendees are strongly encouraged to present posters during the morning “poster boaster” session. Selected HCIR challenge papers will also have an opportunity to present their work orally at the event.

Our target is to have 50-75 participants.

Possible topics for discussion and presentation at the workshop include, but are not limited to:

  • Novel interaction techniques for information retrieval.
  • Modeling and evaluation of interactive information retrieval.
  • Exploratory search and information discovery.
  • Information visualization and visual analytics.
  • Applications of HCI techniques to information retrieval needs in specific domains.
  • Ethnography and user studies relevant to information retrieval and access.
  • Scale and efficiency considerations for interactive information retrieval systems.
  • Relevance feedback and active learning approaches for information retrieval.

Demonstrations of systems and prototypes are particularly welcome.

Important Dates

  • Mon June 14, 2010: Submission deadline for position papers (midnight Pacific Time)
  • Fri July 16, 2010: Decisions sent to authors
  • Fri July 23, 2010: Deadline for accepted participants to register
  • Fri July 30, 2010: Submission deadline for camera-ready copies

Workshop Fees and Travel Support

After careful consideration, we have decided to implement a $75 workshop fee. This will help offset the workshop costs and allow us to defray expenses for 1-2 graduate students.

We also appreciate the support of our corporate sponsors. It is our hope that the continuing success of this workshop will attract additional funding in future years. If your company or organization is interested in sponsoring travel scholarships, please let us know as soon as possible.