Omar Alonso recently pointed me to work he and his colleagues at A9 did on relevance evaluation using Mechanical Turk. Perhaps anticipating my predilection for wordplay, the authors showed off some of their own:
Relevance evaluation is an essential part of the development and maintenance of information retrieval systems. Yet traditional evaluation approaches have several limitations; in particular, conducting new editorial evaluations of a search system can be very expensive. We describe a new approach to evaluation called TERC, based on the crowdsourcing paradigm, in which many online users, drawn from a large community, each performs a small evaluation task.
Yes, TERC for TREC. In any case, their results show lots to be thankful for:
- Fast Turnaround. We have uploaded an experiment requiring thousands of judgments and found all the HITs completed in a couple of days. This is generally much faster than an experiment requiring student assessors; even creating and running an online survey can take longer.
- Low Cost. Many typical tasks, such as judging the relevance of a single query-result pair based on a short summary, are completed for payment of one cent. (Obviously, tasks that require more detailed work require higher payment.) In our example, we could have all our 2500 judgments completed by 5 separate workers for a total cost of $125.
- High Quality. Although individual performance of workers varies, low cost makes it possible to get several opinions and eliminate the noise. As described in Section 5, there are many ways to improve the quality of the work.
- Flexibility. The low cost makes it possible to obtain many judgments, and this in turn makes it possible to try many different methods for combining their assessments. (In addition, the general crowdsourcing framework can be used for a variety of other kinds of experiments — surveys, etc.)
Other folks, particularly Panos Ipeirotis, have worked extensively with Mechanical Turk in their research. At the risk of political incorrectness, today, I’d like thank these folks for the successful exploitation of digital natives to explore new worlds of research.
3 replies on “Mechanical Turkey”
We did this at Dolores Labs earlier this year. Here’s one public blog post we wrote — http://blog.doloreslabs.com/2008/04/search-engine-relevance-an-empirical-test/
though we didn’t rigorously evaluate the data in that particular study; (like we did for NLP, http://blog.doloreslabs.com/2008/09/amt-fast-cheap-good-machine-learning/ )
LikeLike
hah, they cited out blog post!
LikeLike
Well, I’m glad that the IR community is growing more open to cost-effective evaluation strategies that don’t require trained assessors. I’m hopeful that this openness will in turn lead to work on interactive IR becoming more mainstream, since–at least as I understand it–the bottleneck of such research is the cost of evaluation.
LikeLike