After my recent posts about Google’s similarity browsing for images, a colleague reached out to me to educate me about some of the recent advances in image retrieval. This colleague is involved with an image retrieval startup and felt uncomfortable posting comments publicly, so we agreed that I would paraphrase them in a post under my own name. I thus accept accountability for the post, but cannot take credit for expertise or originality.
Some of the discussion in the comment threads mentioned scale-invariant feature transform (SIFT), an algorithm to detect and describe local features in images. What I don’t believe anyone mentioned is that this approach is patented–certainly a concern for people with commercial interest in image retrieval.
There’s also the matter of scaling in a different sense–that is, handling large sets of images. People interested in this problem may want to look at “Scalable Recognition with a Vocabulary Tree” by David Nistér and Henrik Stewénius. They map image features to “visual words” using a hierarchical k-means approach. While mapping image retrieval to text retrieval approaches is not new, their large-vocabulary approach was novel and made significant improvement to scalability, as well as being robust to occlusion, viewpoint and lighting change. The paper has been highly cited.
But there are problems with this approach in practice. For example, images from cell phone cameras are low-quality and blurry, and Nistér and Stewénius’s approach is unfortunately not resilient to blur. Accuracy and latency are also challenges.
In general, some of the vision literature about which are the best features to use don’t seem to work so well outside the lab, and the reason may be that images used for such experiments in the literature are of much higher quality than those in the field–particuarly for cell phone images.
An alternative to SIFT is “gist”, an approach based on global descriptors. This approach is not resilient to occlusion or rotation, but it does scale much better than SIFT, and may serve well for some duplicate detection–a problem that, in my view, is a deal-breaker for applications like similarity browsing–and which certainly is a problem for Google’s current approach.
In short, image retrieval is still a highly active area, and different approaches are optimized for different problems. I was delighted to have a recent guest post from AJ Shankar of Modista about their approach, and I encourage others to contribute their thoughts.