Dinesh Vadhia, CEO and founder of “item search” company Xyggy, has been an active member of the Noisy Community for at least a year, and it is with pleasure that I publish this guest post by him, University of Cambridge / CMU Professor Zoubin Ghahramani, and University of Cambridge / Gatsby Computational Neuroscience Unit researcher Katherine Heller. I’ve annotated the post with Wikipedia links in the hope of making it more accessible to readers without a background in statistics or machine learning.
People are very good at learning new concepts after observing just a few examples. For instance, a child will confidently point out which animals are “dogs” after having seen only a couple of examples of dogs before in their lives. This ability to learn concepts from examples and to generalize to new items is one of the cornerstones of intelligence. By contrast, search services currently on the internet exhibit little or no learning and generalization.
Bayesian Sets is a new framework for information retrieval based on how humans learn new concepts and generalize. In this framework a query consists of a set of items which are examples of some concept. Bayesian Sets automatically infers which other items belong to that concept and retrieves them. As an example, for the query with the two animated movies, “Lilo & Stitch” and “Up”, Bayesian Sets would return other similar animated movies, like “Toy Story“.
How does this work? Human generalization has been intensely studied in cognitive science and various models have been proposed based on some measure of similarity and feature relevance. Recently, Bayesian methods have emerged as models of both human cognition and as the basis of machine learning systems.
Bayesian Sets – a novel framework for information retrieval
Consider a universe of items, where the items could be web pages, documents, images, ads, social and professional profiles, publications, audio, articles, video, investments, patents, resumes, medical records, or any other class of items we may want to query.
An individual item is represented by a vector of features of that item. For example, for text documents, the features could be counts of word occurrences, while for images the features could be the amounts of different color and texture elements.
Given a query consisting of a small set of items (e.g. a few images of buildings) the task is to retrieve other items (e.g. other images) that belong to the concept exemplified by the query. To achieve the task, we need a measure, or score, of how well an available item fits in with the query items.
A concept can be characterized by using a statistical model, which defines the generative process for the features of items belonging to the concept. Parameters control specific statistical properties of the features of items. For example, a Gaussian distribution has parameters which control the mean and variance of each feature. Generally these parameters are not known, but a prior distribution can represent our beliefs about plausible parameter values.
The score used for ranking the relevance of each item x given the set of query items Q compares the probabilities of two hypotheses. The first hypothesis is that the item x came from the same concept as the query items Q. For this hypothesis, compute the probability that the feature vectors representing all the items in Q and the item x were generated from the same model with the same, though unknown, model parameters. The alternative hypothesis is that the item x does not belong to the same concept as the query examples Q. Under this alternative hypothesis, compute the probability that the features in item x were generated from different model parameters than those that generated the query examples Q. The ratio of the probabilities of these two hypotheses is the Bayesian score at the heart of Bayesian Sets, and can be computed efficiently for any item x to see how well it “fits into” the set Q.
This approach to scoring items can be used with any probabilistic generative model for the data, making it applicable to any problem domain for which a probabilistic model of data can be defined. In many instances, items can be represented by a vector of features, where each feature can either be present or absent in the item. For example, in the case of documents the features may be words in some vocabulary, and a document can be represented by a binary vector x where element j of this vector represents the presence or absence of vocabulary word j in the document. For such binary data, a multivariate Bernoulli distribution can be used to model the feature vectors of items, where the jth parameter in the distribution represents the frequency of feature j. Using the beta distribution as the natural prior the score can be computed extremely efficiently.
An important aspect of Bayesian Sets is that it automatically learns which features are relevant from queries consisting of two or more items. For example, a movie query consisting of “The Terminator” and “Titanic” suggests that the concept of interest is movies directed by James Cameron, and therefore Bayesian Sets is likely to return other movies by Cameron. We feel that the power of queries consisting of multiple example items is unexploited in most search engines. Searching using examples is natural and intuitive for many situations in which the standard text search box is too limited to express the user’s information need, or infeasible for the type of data being queried.
The Bayesian Sets method has been applied to diverse problem domains including: unlabelled image search using low-level features such as color, texture and visual bag-of-words; movie suggestions using the MovieLens and Netflix ratings data; music suggestions using last.fm play count and user tag data; finding researchers working on similar topics using a conference paper database; searching the UniProt protein database with features that include annotations, sequence and structure information; searching scientific literature for similar papers; and finding similar legal cases, New York Times articles and patents.
Apart from web and document search, Bayesian Sets can also be used for ad retrieval through content matching, building suggestion systems (“if you liked this you will also like these” which is about understanding the user’s mindset instead of the traditional “people who liked your choice also liked these”) and finding similar people based on profiles (e.g. for social networks, online dating, recruitment and security). All these applications illustrate the countless range of problems for which the patent-pending Bayesian Sets provides a powerful new approach to finding relevant information. Specific details of engineering features for particular applications can be provided in a separate post (or comments).
Interactive search box
An important aspect of our approach is that the search box accepts text queries as well as items, by dragging them in and out of the search box. An implementation using patent data is at http://www.xyggy.com/patent.php. Enter keywords (e.g., “earthquake sensor”) and relevant items to the keywords are displayed. Drag an item of interest from the results into the search box and the relevance changes. When two or more items are added into the search box, the system discovers what they have in common and returns better results. Items can be toggled in/out of the search by clicking the +/- symbol and items can be completely removed by dragging them out of the search box. Each change to an item in the search box automatically retrieves new relevant results. A future version will allow for explicit relevance feedback. Certain data sets also lend themselves to a faceted search interface and we are working on a novel implementation in this area.
In our current implementation, items are dragged into the search box from the results list, but it is easy to see how they could be dragged from anywhere on the web or intranet. For example, a New York Times reader could drag an article or image of interest into the search box to find other items of relevance. There is a natural affinity between an interactive search box as described and the new generation of touch devices.
Bayesian Sets demonstrates that intelligent information retrieval is possible, using a Bayesian statistical model of human learning and generalization. This approach, based on sets of items encapsulates several novel principles. First, retrieving items based on a query can be seen as a cognitive learning problem; where we have used our understanding of human generalization to design the probabilistic framework. Second, retrieving items from large corpora requires fast algorithms and the exact computations for the Bayesian scoring function are extremely fast. Finally, the example-based paradigm for finding coherent sets of items is a powerful new alternative and complement to traditional query-based search.
Finding relevant information from vast repositories of data has become ubiquitous in modern life. We believe that our approach, based on cognitive principles and sound Bayesian statistics, will find many uses in business, science and society.
67 replies on “Guest Post: Information Retrieval using a Bayesian Model of Learning and Generalization”
Yes, the Greiff paper was a question about the mathematics of Bayesian sets, not a question about relevance feedback.
And I still think that the only difference between query by example and relevance feedback is at the stage of the process at which they are executed. Conceptually, I mean. Marking a document as relevant is, imho, no different than saying “add this document to the next query that I execute”. But maybe I’m just splitting hairs.
But I do think our conversation is becoming complex enough, and with enough subtle subthreads, that it’s becoming difficult to continue properly. Perhaps in person? Will you be at SIGIR?
Oh, and @Daniel: there is a middle ground between taking exactly the features that are given to you on the one hand, and trying to draw water from a stone on the other. Look at some of the old Della Pietra work on feature induction. That’s where you create new features out of existing raw data, in a task-driven manner.
Re feature induction / selection: I hear you. I can imagine applying a technique like streamwise feature selection toward this end. But even there the richness of the raw input matters.
I probably won’t make it to SIGIR this year, but I hope you guys carry this conversation forward there–and continue it at the HCIR workshop a few weeks later!
But even there the richness of the raw input matters.
Yes, but if you’ve got the audio of each of the songs listed above, then you should have enough raw input to help with my WCS seeking task. The song itself contains the song itself.
I’m going to try and make it to IIiX as well as HCIR.
@ jeremy & daniel wrt feature engineering
The following are the data sets that we created demos from at Xyggy:
– Content-based image search using unlabelled and labelled images with corel and flickr pictures
– last.fm listener playcount and tag data by song to provide a music suggestion service (“if you liked this you will also like these” which is about understanding the users mindset instead of the traditional “people who liked your choice also liked these”)
– Netflix ratings data to provide a movie suggestion service
– Patents using patent bibliographic data only
– Legal cases with citations
– New York Times annotated corpus consisting of 1.8m articles from 1987 to 2007
The feature vectors are defined and created prior to the search indexes. Some of the above data sets are rich (eg. images, patents and NYT) and others such as last.fm and Netflix would be difficult to classify as rich data. Rich or not, all the above demos worked very well. With flickr images we also built a version that combined the low-level features (eg. color, texture etc.) with available text data (labels, tags, user annotations and so on). This mixing of types could also be applied to music/audio data. With the last.fm data two versions were built: one with playcount data only and the other with playcount plus tag data. With text data, you can create new ‘custom’ features based on concepts or semantic relationships. In general, it really doesn’t hurt to have too many features – directly from the raw data and custom ones – if they are at least plausibly relevant to search. As can be seen there is plenty of flexibility and creativity available for feature engineering using Bayesian Sets.
I’m planning to be in the SF area for most of the summer.. If you’re around perhaps we can meet up there.
[…] Guest Post: Information Retrieval using a Bayesian Model of Learning and Generalization – Noisy channel […]
Thank you, Dinesh and Katherine, for all your patient explanations!
Katherine, let me know when you get here! I’m around.
Great! Dinesh forwarded your email to me. I’ll let you know when I’m in town 🙂
I posted a reply on Google’s Research blog that references my concerns about whether this method should be compared to human concept learning until it can better deal with polysemy.
Here are the details of my test. I was after patents related to regression of disease, such as the regression of cancer. The seach term was regression and this is what my search looks like after I added four patents that attempt to refine the meaning of regression in this context.
5414019 Regression of mammalian carcinomas
5356817 Method for detecting the onset, progression and regression of gynecolic cancers
4544305 Aiding the regression of neoplastic diseases
4757056 Method for tumor regression in rats, mice and hamsters
When you do this, you will see that 6 of the top 12 patents returned have the meaning of regression in the statistical sense rather than the disease sense. I would be interested in the comments of the bloggers on this anomaly. In fairness, when I run the search for “regression” on Google, it also give the more frequent usage related to statistics as the predominant result. I would need to qualify it as “regression cancer disease” before Google would understand what I was after. Yet, humans would easily be able to learn what I was after in my search with just a few added items by just looking at the titles as in this example. The problem appears to be that “statistical regression” was mentioned in the methods section of some of these patents even though the predominant idea in all of them was “regression” in the disease sense.
I do not believe that Google’s research blog will publish my last comment which was an answer to you because I mentioned the name of your product. As all that I said was that a “bag of words” approach will not be powerful enough to deal with polysemy, whereas an approach that can handle interactions would be. In other words, the statistical interaction of frequency counts of “regression” and “tumor” would be a feature that would uniquely separate the two meanings of “regression” in this example even with a small training sample size. In order to do this though, one would need massive parallel processing capabilities to compute the very large number of interaction features.
The patent demo uses bibliographic data only which generally represents about 10% of the available textual information (for non-design patents). We are using a sophisticated bag-of-words method for the feature vectors but using all available textual data would improve the search results further. With Bayesian Sets you have the flexibility to define the features of the feature vectors to fit your needs. For example, you can use tri-grams instead of bag-of-words and also include concepts and semantic relationships.
Wrt the demo, experiment by toggling on/off the keyword search line in the search box as well as toggling each item on/off to improve relevance. Also, take a look at the psychological research literature cited in the Bayesian Sets paper.
I appreciate your response. I did have a look at the Bayesian Sets paper. I do not know enough about this method to know if it would perform better under the circumstances that you list. I am not sure if the trigram would be generally sensitive to the type of interaction that I mention. I am also not sure if adding the complete text would help with this anomaly that I find. These are empirical questions. However, I congratulate you and the rest of the team on a very interesting approach to a very difficult problem.
I would again though mention that this does not seem to be a model of concept or category learning, such as when a child learns the category of ‘dog’. It seems more to be a model for cued recall of the contents of previously stored categories. It is more like if I ask a professor who is an expert on patents to give me a reference to all patents that concern the concept of ‘dog’ or ‘regression’. If she stumbles on ‘regression’ because of the multiple meanings, you might give a couple of examples and then she suddenly will be able to know that you mean regression related to disease or tumor. Humans can obviously do this type of task with very few examples, but the original learning of the category actually seems to require a much larger training sample size. Please see this paper for example :
Click to access Hoffman_&_Rehder_10.pdf
This paper presents a couple of experiments with adult humans where category learning requires at least 150-250 training trials (and in some cases it may not have even asymptoted yet).
Human category learning is obviously a complex issue. In terms of how quickly adults learn new categories, well, that really depends on the category being learned. See for example Nosofsky, Gluck et al’s replication of the Shepard, Hovland, Jenkin’s data: http://love.psy.utexas.edu/~love/papers/love_etal_2004.pdf – figure 3. Some categories people learn very quickly, while others are much more complicated and it’s not even clear that they ever completely learn them.
There has been alot of recent work on Bayesian models for human category learning and generalization, by Tenenbaum, Griffiths, we had a paper recently: http://www.gatsby.ucl.ac.uk/~heller/prior_knowledge_nips_revision.pdf — we have another paper in submission in fact which compares human generalization off of one versus two training examples — and these Bayesian models seem to match up with human data quite well. While much of the discussion in these papers focuses on the prior, the basic way a category is being modeled, or learned, is the same way in which we model a category, or concept, in Bayesian Sets.
The congratulations are really appreciated. Hopefully, down the road we or others will generate empirical data that compares the results of using different feature vectors on the same data set. Katherine has jumped in to answer the category learning portion of the question.
Thanks for the references. I agree that human category learning is complex and the actual rate of learning must depend upon individual difference factors as well as the nature of the category, such as the number of features. The same may be said about animal and machine learning of categories. A complex category with many features like ‘dog’ has features that overlap with many other similar categories like ‘cat’ and ‘goat’ etc. If you have an example of where something like this can be learned by a human learner with just a couple of training trials, I would be very interested.
I am just not sure that category learning is happening with one or two examples in Zygy, as it seems to me that the cued recall of the previously stored category is what is being aided by the one or two examples in the query. That is not to discount Zygy as being a potentially new and interesting retrieval system that may have benefits beyond LDA and Google Sets. I do not know enough about Bayesian Sets to comment on that. I do know a lot about how many trials it takes a supervised machine learning system to accurately learn a category with many features that overlap with similar categories, as that has been our business for many years and in multitudes of applications. I can assure you that this cannot be accomplished with the typical high dimensional, multicollinear business data that we see daily with just a couple of training examples.
The entire query in xyggy is formed from examples, the text word is just a quick way of indexing a few examples.
In terms of “category learning”, if you give people even one or two examples they will use those examples to learn, or generalize, about the category. They may not always be able to generalize completely correctly, but that doesn’t mean that they aren’t learning. Just because someone is confused about whether a whale is a mammal or a fish doesn’t mean they don’t know anything about mammals or fish.
References for people learning novel categories from one or just a couple of examples: Xu and Tenenbaum 2007 Word learning as bayesian
inference. Psychological Review, and Pinker 1999, How the Mind Works.
I agree with what you said that the process of category learning is happening with one or two examples. The more interesting questions are how long does it take to reach its peak performance and whether you have a model for this process that performs as well as human learners (or better) in reaching its peak performance and whether this process actually works very well in a test environment without labeled data. Everything else is theoretical and academic, as I am not sure that we yet have an understanding of ‘How the Mind Works’ and whether Bayesian learning will ultimately prove to be helpful here.