The following is a guest post by Rich Marr. Rich is the Director of Engineering at Pixsta, where he’s been working on Empora.com, a consumer-facing site that enables browsing of fashion products according to image similarity (much like Modista). Pixsta is a growing start-up focused on turning our R&D team’s ongoing search and image processing work into workable products. The post is entirely his, with the exception of links that I have added so that readers can find company sites and Wikipedia entries.
It’s an often-heard urban myth that Eskimos have many words for snow, but that we only have one. This idea rings true because there’s value in being able to make precise distinctions when dealing with something important to you. You can find specialised vocabularies in cultures and sub-cultures all over the world, from surfers to stock brokers. When there’s value in describing something, you’ll usually find someone has created a word to do the job.
In search we often come across problems caused by insufficient vocabulary. People have an inconvenient habit of describing things in different ways, and some types of document are just plain difficult to describe.
This vocabulary problem has spawned armies of semantic search start-ups, providing search results based on inferred meaning rather than keyword matching, but semantic systems are text-driven which means there are still vocabulary problems, for example you might overhear some lyrics and then use a search engine to look up who wrote the song but how would you identify a piece of unknown instrumental music? Most people don’t have the vocabulary to describe music in a way that can identify it. This type of problem is addressed by search tools that use Media As a Search Term, which I’ll abbreviate to MAST.
MAST applications attempt to fill the vocabulary gap by extracting meaning from the query media. These apps break down into two rough groups, one concerned with identification (e.g. SnapTell, TinEye, MusicBrainz, Shazam, and the field of biometrics) and the other concerned with similarity search (e.g. Empora, Modista, Incogna, and Google Similar Images).
These apps use media-specific methods to interpret objects and extract data in a meaningful form for the given context. Techniques used here include wavelet decomposition, Fourier transforms, machine learning techniques, and a whole load of good old-fashioned pixel scraping.
The interpreted data available is then made available in a searchable index, usually either a vector space that judges similarity using distance, or a conventional search index containing domain-specific ‘words’ extracted from the media collection. Both of these indexing mechanisms are a known quantity to the programmers which leaves the main challenge as the extraction of useful meaning, conceptually similar to using natural language processing (NLP) to interpret text.
The challenge of extracting useful meaning is based largely around establishing context, i.e. what exactly the user intends when they request an item’s identity, or want to see a ‘similar’ item. What properties of a song identify it as the same? Should live versions of the same song also match studio versions? Is the user more interested in the shape of a pair of shoes, or the colour, or the pattern?
Framed in the context the difficulties of NLP it’s clear that there’s not likely to be an immediate leap in the capabilities of these apps but rather a gradual evolution. That said, these technologies are already good enough to surprise people and they’re quickly finding commercial use, which adds more resources and momentum. As our researchers chip away at these big challenges you’ll find MAST systems appearing in more and more places and becoming more and more important to the way people acquire and manage information.