After my recent posts about Google’s similarity browsing for images, a colleague reached out to me to educate me about some of the recent advances in image retrieval. This colleague is involved with an image retrieval startup and felt uncomfortable posting comments publicly, so we agreed that I would paraphrase them in a post under my own name. I thus accept accountability for the post, but cannot take credit for expertise or originality.
Some of the discussion in the comment threads mentioned scale-invariant feature transform (SIFT), an algorithm to detect and describe local features in images. What I don’t believe anyone mentioned is that this approach is patented–certainly a concern for people with commercial interest in image retrieval.
There’s also the matter of scaling in a different sense–that is, handling large sets of images. People interested in this problem may want to look at “Scalable Recognition with a Vocabulary Tree” by David Nistér and Henrik Stewénius. They map image features to “visual words” using a hierarchical k-means approach. While mapping image retrieval to text retrieval approaches is not new, their large-vocabulary approach was novel and made significant improvement to scalability, as well as being robust to occlusion, viewpoint and lighting change. The paper has been highly cited.
But there are problems with this approach in practice. For example, images from cell phone cameras are low-quality and blurry, and Nistér and Stewénius’s approach is unfortunately not resilient to blur. Accuracy and latency are also challenges.
In general, some of the vision literature about which are the best features to use don’t seem to work so well outside the lab, and the reason may be that images used for such experiments in the literature are of much higher quality than those in the field–particuarly for cell phone images.
An alternative to SIFT is “gist”, an approach based on global descriptors. This approach is not resilient to occlusion or rotation, but it does scale much better than SIFT, and may serve well for some duplicate detection–a problem that, in my view, is a deal-breaker for applications like similarity browsing–and which certainly is a problem for Google’s current approach.
In short, image retrieval is still a highly active area, and different approaches are optimized for different problems. I was delighted to have a recent guest post from AJ Shankar of Modista about their approach, and I encourage others to contribute their thoughts.
6 replies on “More Thoughts on Image Retrieval”
Web-scale scalability is the primary challenge for image search today. GIST descriptors describe scenes well and are becoming popular. Our approach at Xyggy is that image features should range from the simple (eg. color and texture as we have in our image demo) to the complex (eg. color, texture, edge, structure, layout etc.) – including GIST. If available, textual metadata can be added to the image feature too.
Could you explain what is meant by “duplicate detection is a deal-breaker for applications like similarity browsing”?
Dinesh, I should have been more specific. I meant that, for applications like Google Image Search whose collections contain large numbers of duplicate and near-duplicate images, similarity browsing is pretty useless without deduping, since it mostly shows you duplicate images of the one you start from.
With very large image collections, identical images should be removed but near-identical ones should be retained. There are users who will want to see near-identical images and others who won’t. Ideally, you want to offer an interface that allows the user to control/select the amount of similarity required.
Indeed, that’s a nice first step. Even better would be some clustering–ideally some kind of faceting–in order to make exploration effective.
You should look at Idee Inc in Toronto and their Tineye Reverse Image Search system
Thanks–the email I paraphrased did cite some vendors, but I deliberately excised them.