The following is a guest post by Rich Marr. Rich is the Director of Engineering at Pixsta, where he’s been working on Empora.com, a consumer-facing site that enables browsing of fashion products according to image similarity (much like Modista). Pixsta is a growing start-up focused on turning our R&D team’s ongoing search and image processing work into workable products. The post is entirely his, with the exception of links that I have added so that readers can find company sites and Wikipedia entries.
It’s an often-heard urban myth that Eskimos have many words for snow, but that we only have one. This idea rings true because there’s value in being able to make precise distinctions when dealing with something important to you. You can find specialised vocabularies in cultures and sub-cultures all over the world, from surfers to stock brokers. When there’s value in describing something, you’ll usually find someone has created a word to do the job.
In search we often come across problems caused by insufficient vocabulary. People have an inconvenient habit of describing things in different ways, and some types of document are just plain difficult to describe.
This vocabulary problem has spawned armies of semantic search start-ups, providing search results based on inferred meaning rather than keyword matching, but semantic systems are text-driven which means there are still vocabulary problems, for example you might overhear some lyrics and then use a search engine to look up who wrote the song but how would you identify a piece of unknown instrumental music? Most people don’t have the vocabulary to describe music in a way that can identify it. This type of problem is addressed by search tools that use Media As a Search Term, which I’ll abbreviate to MAST.
MAST applications attempt to fill the vocabulary gap by extracting meaning from the query media. These apps break down into two rough groups, one concerned with identification (e.g. SnapTell, TinEye, MusicBrainz, Shazam, and the field of biometrics) and the other concerned with similarity search (e.g. Empora, Modista, Incogna, and Google Similar Images).
These apps use media-specific methods to interpret objects and extract data in a meaningful form for the given context. Techniques used here include wavelet decomposition, Fourier transforms, machine learning techniques, and a whole load of good old-fashioned pixel scraping.
The interpreted data available is then made available in a searchable index, usually either a vector space that judges similarity using distance, or a conventional search index containing domain-specific ‘words’ extracted from the media collection. Both of these indexing mechanisms are a known quantity to the programmers which leaves the main challenge as the extraction of useful meaning, conceptually similar to using natural language processing (NLP) to interpret text.
The challenge of extracting useful meaning is based largely around establishing context, i.e. what exactly the user intends when they request an item’s identity, or want to see a ‘similar’ item. What properties of a song identify it as the same? Should live versions of the same song also match studio versions? Is the user more interested in the shape of a pair of shoes, or the colour, or the pattern?
Framed in the context the difficulties of NLP it’s clear that there’s not likely to be an immediate leap in the capabilities of these apps but rather a gradual evolution. That said, these technologies are already good enough to surprise people and they’re quickly finding commercial use, which adds more resources and momentum. As our researchers chip away at these big challenges you’ll find MAST systems appearing in more and more places and becoming more and more important to the way people acquire and manage information.
10 replies on “Guest Post: Rich Marr, Media As a Search Term”
for example you might overhear some lyrics and then use a search engine to look up who wrote the song but how would you identify a piece of unknown instrumental music?
Yamaha research solved this problem in 1993. MSR (Chris Burges) did it in 1998. Shazam/Avery Wang in 2000. Nokia research in 2001. Fraunhofer in 2002. Google in 2009.
What properties of a song identify it as the same? Should live versions of the same song also match studio versions?
In my dissertation work five years ago, I used locally-sequential harmonic similarity as the basis for same-ness, and was able to identify not only live vs. studio versions, but also do a fair job on identifying composed variations on a piece… things like cover songs, but where the covering artist takes it in a vastly new direction (for example, “Me First and the Gimme Gimmes” sing a version of “Somewhere Over the Rainbow” in a fast, hard-driving punk style that acoustically sounds quite different from the original.)
Tim Crawford has a good writeup of some of these issues, and some good anecdotes/examples based on the work that we did:
Click to access Crawford_AfterSearchIsOver_ExtAbst.pdf
You’re correct, though, in pointing out that the dimensions or facets that are important to the user, in terms of similarity, are not the same dimensions that the system designer has specified. It is an interesting research problem: How can a user express or choose from among different kinds of similarity? How does one make this process relevance driven?
Instead of guessing what “similar” means, why don’t we offer the user our guesses as available options instead of answers?
An orange is indeed like an apple in many respects (fruit, round, nutritious). We can’t know who will want or exect what connections.
why don’t we offer the user our guesses as available options instead of answers
I totally agree. I would call this “explanatory search”.
And I think this problem even extends beyond music/media. I think we could do the same thing for text-based web search. The web search engine’s approach to relevance is based on mass-market tastes and preferences. You should be able to steer that away, if you so desire, if your information need doesn’t match the mass market taste.
I was definitely thinking of this challenge from a text/search perspective, which is where I live and play with data. I can’t even imagine how to address the same challenges in the music/media world, where similarity is so much more ephemeral.NLP-based methods are useful…until they aren’t.
Thanks for the comments guys.
@jeremy, thanks for the background
@david, the comparison to NLP was an indirect one. Both sets of technologies try to infer useful forms of meaning from ‘natural’ input data.
You’re right that where possible users should be offered the ability to cut through the data themselves, but there are UI design challenges there when the criteria being used in the search aren’t meaningful to users. Most attempts at the problem so far have settled on the browsing model where various forms of nearest neighbours are shown ordered by similarity. Obviously where data is meaningful (colour for example) users can use facets, filters or any other tools from the text search arsenal.
but there are UI design challenges there when the criteria being used in the search aren’t meaningful to users.
Then make them meaningful to the users.
I don’t think we should make it our central goal to expose raw features to the user. Instead, we should expose some sort of “macro” handle on top of those features, that give the user some control that they otherwise would not have.
Concrete example: A few years ago Yahoo! introduce MindSet. It was a way to allow you to factor your search results into more commercial vs. more informational results.
I still think we haven’t seen enough of this type of thing.
Just a link to the comment thread for an earlier post that explored some of these issues.
A very nice guest post. Thanks for bringing it to us. 🙂
@jeremy, thanks for the example. I hope to be able to comment in the near future.
@Richard — no problem. I’m up for discussion anytime.