Otis just wrote a post in which he cited the Open Relevance Project, an embryonic effort by the Lucene project to build a free, public information retrieval evaluation framework analogous to the TREC conference. Not surprisingly, he sees this as an opportunity for Lucene to prove that it is just as good as the commercial enterprise search engines.
On one hand, I’m delighted to see an attempt to make a TREC-like infrastructure more widely accessible. While the Linguistic Data Consortium and the University of Glasgow may only be charging enough to cover their costs, perhaps there are more efficient ways to manage corpora today. Indeed, other alternatives include publishing torrents and Amazon’s public data sets. If the bottleneck is licensing costs, then perhaps there should be a call to donate data–or to assemble collections from public domain sources.
On the other hand, if the goal of this project is to help companies evaluate competing search offerings, then I think its proponents are chasing the wrong problem. Lest you think I’m biased because of my affiliation with one of those commercial search vendors Otis taunts in his post, I encourage you to check out a post that Jeff Dalton (who is certainly pro-Lucene) wrote a year ago, entitled “Open Source Search Engine Evaluation: Test Collections“. In it, he raises a number of issues that go beyond the issue of data availability. One of the issues he brings up is the evaluation of interactive retrieval, an area for where even TREC has struggled.
I understand the desire for Lucene advocates to prove that Lucene is just as good as or better than the commercial search engines–it’s not that different from the desire every vendor has to make competitive claims about his or her own technology. To Otis’s credit, he recognizes that relevance isn’t the only criterion worthy of assessment–he also suggests extending the Open Relevance Project to include the non-functional metrics of efficiency and scalability. But he still seems to accept an evaluation framework that would treat search engines as out-of-the-box relevance ranking engines.
I dare say I have a little bit of experience with how companies make decisions about search technnology, so let me offer my perspective. Companies build search applications to support specific tasks and information needs. For example, ecommerce sites want to help users find what they are looking for, as well as to target those users with their marketing strategies. Manufacturing companies want to optimize their own part reuse, as well as to make sense of their supply chains. Staffing agencies want to optimize utlization of their consultants and minimize their own costs. Etc.
All of the above rely on search applications to meet their needs. But I don’t think they’d be swayed by a TREC-style relevance bake-off. That’s why companies (and vendors) trumpet success in the form of metrics that reflect task performance (and there are often standard key performance indicators for the various application areas) rather than information retrieval performance. Yes, non-functional requirements like efficiency and scalability matter too–but they presume the functional requirements. If an application can’t meet the functional needs, it really doesn’t matter how quickly it processes queries, or how many documents it can index. Moreover, many companies ask for a proof of concept as part of the sales process. Why? Because they recognize that their needs are idiosyncratic, and they are even skeptical of vendors who have built similar solutions in their space. They see success stories and satisfied customers as positive–but not definitive–evidence.
To summarize: the quest to open up TREC may be of great interest to information retrieval researchers, but I’m highly skeptical that it will create a practically useful framework for comparing search technologies. I think it would be more useful to set up public frameworks where applications (both vendor-sponsored and open-source) can compete on how effectively they help users complete information seeking tasks that are representative of practical applications. I’d love to see a framework like Luis Von Ahn’s “games with a purpose” used for such an endeavor. I would happily participate in such an effort myself, and I’m pretty sure I could drag my employer into it.
45 replies on “Copying TREC is the Wrong Track for the Enterprise”
Daniel, do you have a set of KPI’s that can be used to evaluate enterprise search solutions and that can be objectively measured?
I had such a conversation a few weeks back when people were asking me how to evaluate enterprise search solutions, going beyond subjective evaluations.
I completely agree that interactivity is key, and that no one, TREC, web, enterprise, or otherwise, has a good handle on it.
But what you seem to be suggesting is that just because every enterprise solution is tailored, we cannot do any sort of relevance-ranking evaluation at all.
Certainly, we would want such an evaluation to comprise only a part of a suite of metrics. But I don’t understand why you’re against anything trecish whatsoever. Even TREC acknowledges that the standard queries they use are “average”, and therefore various tracks have arisen for “HARD” topics and “robust” topics. So you can tailor TREC to your task.
A slightly related question: What about general web search, as opposed to enterprise search? Can we do something TRECish there? What are your thoughts?
I see my post was slightly misinterpreted – sorry about that. Explanation:
Otis, my apologies if I misinterpreted your intent. But in that case I’ll make a broader point, which is that I don’t think relevance bake-offs are particularly useful except for information retrieval researchers–and I wish even they would take more of an HCIR approach. Though, as Ellen Voorhees points out, it won’t be easy.
Panos, you have to be more specific about what you are calling an enterprise search solution. Different solutions call for different metrics. For example, consider a company that installs a self-service help desk application (which is essentially a way for users to search for information) in order to reduce the load on human operators. Key performance indicator: the reduction in calls to the operators. Easy to test in the field. Hard to test in the lab. That’s why I look to frameworks like GWAP for inspiration to create a more realistic evaluation framework for information seeking support systems. I have ideas here that I think are quite doable, but don’t have the resources to bring to fruition. Maybe someday.
Jeremy, I’m not against anything TREC-ish whatsoever. In fact, I think there is a way to get a best of both worlds, as I wrote in a position paper last year. As for web search vs. enterprise search, at least the former has the advantage of a single corpus (at least in theory). I still would rather see a framework that assumes interaction than yet more MAP comparisons. But I know a lot of information retrieval researchers feel differently. Still, you’re just talking about a re-implementation of what TREC already does. Which is fine, but doesn’t particularly excite me.
(Cross-commented on Otis’s blog — I still don’t have a good feel for the right thing to do with cross-blog threaded conversations!)
While search technology is interesting to the search researchers and developers (i.e. TREC and SIGIR types), good customers only care about their applications.
So many variables are in play in an application that it’s almost impossible to run a “fair” a priori test, because you learn about the app more as you go. Often customers have no idea what they want and need the help of people like Otis (engineer + app developer) to figure that out. Unlike more mature technology, it’s often unclear what’s possible, and because it’s user facing, you often don’t know what will work until you try it on your users.
A bit of a tangent, but I don’t know how to handle cross-blog threaded conversations either. You’d think there would be a nice way to share a comment thread among multiple posts.
I’ve been thinking about the following a hack. WordPress lets you export a comment feed for each post. What I don’t know is if there’s a simple way to add an RSS feed to a single post. If there is, then a post should be able to embed another post’s comment feed. A bit crude, but a lot better than what we have today.
If anyone has suggestions about how to add an RSS feed to a single post, please let me know!
But MAP != TREC. Any more than Prec@10 == TREC, or Prec@Interpolated0.0 == TREC, or R-Prec == TREC, etc.
When I hear TREC, I hear a certain procedure, more than anything else: (1) Start with an info need, (2) Define the criterion(ia) by which success can be declared (i.e. by which that info need is considered “satisfied”), and then (3) measure how difficult is to attain that satisfaction.
Why would one not want to do that?
And how is that really different from what Bob Carpenter says, about customers only caring about their applications? If one search system makes it easier/quicker/better than another system, for the customer to find the information that they need to make their applications work better, isn’t that the goal? Don’t we all share that same goal? I don’t see quite the same distinction between customers and SIGIR types.
Fair point. But, a few exceptions notwithstanding (the interactive track wasn’t exactly a success), TREC = batch testing against a vanilla interface. That is a serious limitation.
Daniel & Bob:
Regarding cross-posting – I’d imagine keeping the discussion tied with the original post would make the most sense. Taking the discussion elsewhere feels a bit like thread hijacking – http://www.urbandictionary.com/define.php?term=Thread%20Jacking&defid=2010833
Regarding taking more of an HCIR approach – can HCIR alone work well if the underlying engine doesn’t provide relevant results? (again: near commodity, but different for different engines and different setups)
Look at what the Music Information Retrieval is trying to do, with M2K and MIREX. The idea is that you have a whole system, a whole chain of interfaces, retrieval algorithms, music signal processing algorithms, etc. Then, not only do you have batch-testing at different cut-points in the whole system (e.g. how well does this algorithm vs. that algorithm do at the task of polyphonic note transcription), but through M2K you can swap various algorithms in and out, and run “what if” scenarios.
To translate that into our task, you could do something like let a user work with a single HCIR front-end and exploratory data algorithm for 5 rounds, hit the “pause” button, and then swap in a bunch of different algorithms for the 6th round.. and see how well each of the algorithms do in that context.
There is no reason why the interface has to be vanilla. You could “prime” this evaluation with whatever interface you wanted. But at some point, you can make a cut, and swap in a bunch of different algorithms, and evaluate, in a side-by-side manner, each of those.
Or, you could do the same thing with interfaces. Run for 5 rounds, as normal, and give everyone access to those same initial 5 rounds. Then, on the 6th round, swap in the new interface. And see how easy it is for users to find the information they’re after, in each of the various new interfaces.
The point is, there is almost always some way in which search systems can be factored, so as to be able to evaluate (and compare) them. You can compare systems by populating different interfaces with the same data, or the same interface with different data, etc.
Maybe I’m being naive, though.
I know I can’t dictate the social norms of the blogosphere–my approach when I have a lot to say about someone else’s blog post is to comment briefly there, but to write a longer post on my own blog that cites and links to it. I thought that was a pretty common approach. Regardless, I think a shared comment thread would address the situation when there’s a meaningful discussion than spans multiple posts–including multiple posts on the same blog.
Re HCIR and relevance, I like to cite the old masters:
Relevance is defined as a measure of information conveyed by a document relative to a query. It is shown that the relationship between the document and the query, though necessary, is not sufficient to determine relevance.
William Goffman, On relevance as a measure, 1964.
Search engines need to act in predictable ways. For a certain class of queries, that means responding to queries with results that anyone who had made them would surely consider relevant, because the query makes the information need obvious. For others, it means providing the scent users can follow to elaborate their information needs.
Jeremy, I don’t object to that sort of testing methodology. But I don’t see how it would play well with TREC.
But at the point that you press “pause”, and do the swapping in/out of various algorithms/interfaces/etc., that is TREC. Non?
If I understand you correctly, then you’re agreeing with an approach I advocate in my aforementioned position paper: employ user studies to determine which systems are most helpful to users; correlate user success to one or more system measures; and then evaluate these system measures in a repeatable, cost-effective process that does not require user involvement. Oui?
Daniel – could that type of functionality, that type of data collection and analysis/evaluation not become a part of Open Relevance Project?
I don’t think we disagree on methodology. Where I’m struggling is with how this is really all that different from TREC, to begin with. TREC has always included things like starting with a topic description, and then allowing user(s) to manually formulate the query. It was that whole distinction between “automatic” and “manual” in the early ad hoc TREC tracks. The “manual” could be anything, including HCIR-supported interfaces, if so desired. But it was still all TREC.
See here for more discussion of the manual runs, though there is probably an even better reference somewhere:
Click to access Trec_8.pdf
Point taken–I didn’t realize that they were (or used to be–TREC 8 was a decade ago) that flexible for the ad hoc task. But your point is taken–TREC is amenable, at least in theory, to testing interfaces for interactive retrieval, as long as you submit final results and mark them as manual. I’m curious how that works out in practice. I still think a game framework could be more effective–and more cost-effective.
Let me turn the question around: what does TREC supply for such an effort? Data collections and relevance judgments? The blessing of a neutral party? Or, put another way, why do library and information scientists not seem, in general, to be enthusiastic about TREC?
Otis, the Open Relevance Project could go that way. But my impression from the site is that what they have in mind is to batch-test for precision and ranked precision variants in the top 20 results. Perhaps they are open-minded, but the TREC folks also enlist calls for new tracks. All it would take in either case is a critical mass of people and resources. A shared, funded agenda. That’s easier said than done.
Well, I think LIS folks still don’t like TREC, because even though they allow for user-interactivity-driven evaluation, it is still/usually only a single-round of interactivity.. no evolution or longer-session evaluation happens. And I think LIS wants to see more organic-ness than that. imho.
And for such an effort, yes, TREC only supplies collections and relevance judgments. Oh, and the topics themselves (the statements of user info needs). Which teams are then free to manually/HCIR interpret however they want.
Which I think argues that while LIS and, for that matter, commercial vendors can conform to TREC, it isn’t necessarily ideal. Perhaps a framework with a better fit would inspire more participation.
No, it’s not ideal. But until we come up with that better framework, it’s not a bad system, either. Or, at least, it’s not so bad, that we shouldn’t at least try to standardize on some basic comparisons. That’s how scientific research progresses.. you know things are more fundamentally interconnected than your experiments are able to capture, just like information seeking sessions are/should be more interactive and evolving than TREC accounts for. But that doesn’t stop the scientist from doing his or her best to isolate various variables, hold as much constant as possible, and test those variables.
I remember at SIGIR 2004, having a conversation with some well-known folks from Google, Ask, Yahoo, and Live Search. The fellow from Ask was asking.. clamoring.. for G, Y, and M to submit to some rigorous, 3rd-party driven, TREC-style evaluation measures. Just to see whether Prec@1, @5, @10, etc was really any better or worse among the various engines. The others just sorts smiled and shook their heads.. knowing that it was never going to happen. And while they didn’t say it, it was obviously for political, rather than scientific, reasons.
Anyway, here is Karen SJ’s take on the matter:
Click to access 2006j_sigirforum_sparck_jones.pdf
It’s a good read, if you haven’t seen it already.
For web search, I think this sort of comparison makes sense, since the majors (G, Y, and M) focus primarily on achieving high precision in their top-ranked results relative to an objective notion of relevance. I presume the test queries would be chosen to make relevance as objective as possible. One might debate the particulars of how to model result quality (e.g., should there be extra points for a “best” result as opposed to just a relevant one), but those are minor details. Moreover, there’s no reason to ask the engines for permission; anyone can just run the queries. Why didn’t the Ask guy just do that?
I think automated querying like that is against the TOS of these engines. And besides, even if he could “get away with it”, the point was not to do it for his own satisfaction, but to come up with an unbiased, objective, publicizable experiment.
If he was just doing it all himself, then even if he made the results public, he could be accused of cherry-picking the queries to make Ask look good, etc.
Those terms of service are honored in the breach–besides, there are APIs which would cover retrieving the top 20 results. Moreover, a good-faith public effort would be a great straw man. Make the experiment completely transparent and make the code reusable, so people could repurpose it to perform their own tests. If he wanted to invest the effort, he could do it. I suspect he just didn’t want to do all that work to get underwhelming results. And I wouldn’t blame him.
BTW, check out http://www.langreiter.com/exec/yahoo-vs-google.html?q=relevance if you want to eyeball the similarity of Google and Yahoo search results.
I don’t know enough about his motivation 5 years ago to know how much effort he was or wasn’t willing to put in. But I’m not sure what you mean by underwhelming results. You mean, underwhelming of Ask vs. the others? Even if Ask wasn’t/isn’t competitive, I still agree with his basic premise, and personally would like to see all of the search engines engage in fully-transparent, open evaluation.
Yeah, I’ve seen that Y vs. G comparison visualization. Pretty cool. But it still doesn’t get me to the interesting info. For example, take a look at this search:
There is hardly any overlap between the two search engines, especially in the top 10. But if I look at the absolute top result in each engine, I would say that each of those results are *highly* relevant. Though different.
This visualization doesn’t give me that sense. Moreover, it doesn’t help me get a sense of how well each engine would have worked, over my last 200 queries.
We’re really drifting off topic, now, aren’t we? Sorry 🙂
No, I didn’t mean to slight Ask. I meant that I don’t think there would be much of a difference among the search engines’ performance on relevance, and hence that no one would really care. I don’t think Google, Yahoo, or Microsoft are trying to differentiate on relevance at this point. Google certainly has won that branding war.
If Yahoo or Microsoft felt they could win a “Pepsi challenge” and make a dent in Google’s brand equity, I’m sure they’d invest in hosting one, and they could surely get a respected third party to administer it. Ask could do the same. It’s not hard to run a blind test, and you wouldn’t even need to assess individual search results. You could have participant see two unbranded result sets side by side and select which one they preferred, or indicate no preference. And you could do this in a natural setting, paying / persuading people to install a browser plug-in and having them evaluate their own natural search queries. You wouldn’t even have to store information about the queries, the results, or the IP addresses–so there would be no privacy issues.
Though, now that I think about it, participants might cheat in order to favor one of the engines, running the same queries against the search engines to defeat the test’s blinding. So perhaps the test does need to be administered in a lab setting.
Still, my point is that the test is quite easy to do, given a little bit of funding. I see the lack of such funding as strong circumstantial evidence that no one would win such a contest.
Daniel – “Moreover, a good-faith public effort would be a great straw man. Make the experiment completely transparent and make the code reusable, so people could repurpose it to perform their own tests.” — I think that’s at least one of the goals of ORP.
… and that’s actually what I refer to in my comment: http://www.jroller.com/otis/entry/followup_open_relevance_project#comment-1242694268000
Otis, your blog is rejecting all of my comments as spam, so I’ll post a response here instead.
Endeca has tools that help people see how reconfiguring their applications will change them (a sort of integrated application development environment), as well as batch tools to perform regression and performance tests. I assume (though don’t know) that other commercial vendors offer their own tools for similar purposes.
Would customers appreciate a more comprehensive and perhaps standardized evaluation suite? Perhaps. But we haven’t been getting requests for it (as far as I know), and our customers are not shy about asking for enhancements. Nor are our engineers and product managers shy about proactively putting work on the road map. So we are unlikely to invest much ourselves in the near term.
Could we influence the ORP do do free work for us without our committing significant work to building it? I doubt it. I would think that the ORP is looking for people willing to write code, not vendors making feature requests.
Let me frame the question differently: who are the potential beneficiaries of the ORP, and how would they benefit? If the benefits are significant, why not ask them to help fund the effort? I realize you’re not involved in the ORP yourself, but I just want you to understand why they might not be attracting a groundswell of interest / investment. The academics are content with TREC, and commercial vendors don’t have an urgent need for better evaluation tools.
Web search is different, because the corpus and the user experience are standard–at least among the big players.
[…] Copying TREC is the Wrong Track for the Enterprise | The Noisy Channel. […]
I have put my thoughts, as one of the co-founders of ORP at http://lucene.grantingersoll.com/2009/05/18/copying-trec-is-the-wrong-track-for-the-enterprise-the-noisy-channel/
Summary: It’s way early to be reading too much into what ORP is or isn’t. For now, just know my desire (I can’t speak to others, but I suspect we are in line) for it is to have an open means for the Lucene community to compare notes on relevance testing in a meaningful way. At some point, it might go beyond that, but who knows! I know we are definitely interested in lots of ideas and welcome contributions over at firstname.lastname@example.org (in all likelihood, we will have our own mailing list in the near future).
Grant, as I just commented on your blog (this scattering problem is absurd!), I’m sorry if I created confusion through a cascade of extrapolations. I don’t mean to criticize your attempt to meet the needs of the Lucene community, and I wish you success in that endeavor.
>Grant, as I just commented on your blog (this scattering problem is absurd!),
Yeah, I agree. Don’t you just love everyone fighting for control of the conversation?
> I’m sorry if I created confusion through a cascade of extrapolations. I don’t mean to criticize your attempt to meet the needs of the Lucene community, and I wish you success in that endeavor.
No need to apologize. I agree with a lot of the comments here. Batch testing is not the be all, end all of relevance testing. It is merely a few data points that help fill in the bigger picture. (See http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Debugging-Relevance-Issues-Search). It is mostly useful, IMO, for testing and comparing changes to existing algorithms.
I think one of the things ORP can bring to the table that is different is an ongoing evaluation capability. In other words, we can crowdsource the work and it can be happening all the time, not just once or twice a year.
At any rate, as I said before, ideas are welcome! It will be an interesting experiment and one that anyone can participate in.
I meant that I don’t think there would be much of a difference among the search engines’ performance on relevance, and hence that no one would really care. I don’t think Google, Yahoo, or Microsoft are trying to differentiate on relevance at this point. Google certainly has won that branding war.
See, now *that* is interesting! If there were an objective, 3rd party measurement out there that conclusively showed that Google was no better than anyone else, it might not “kill” Google, but it would take a lot of the wind out of the sails.
Google’s market share keeps increasing, right? Which means less diversity. And therefore less overall innovation. Making sure that more than 1 or 2 or even 3 search companies stay viable is important to the overall health of our information economy. And so even a result that showed Y or A or M are just as relevantly good as Google, but offer different/diverse results, would go a long way in breaking that marketing/branding stranglehold that G currently enjoys. That can only be good for the marketplace.
The query I issued above on that comparison engine is a perfect example. Both the top result on G and Y were 100% relevant. But both were 100% different from each other, too.
So you can see why G would not want to submit to this open evaluation, and why this guy 5 years ago, from Ask, would. G has nothing to gain, and Ask does.
>Grant, as I just commented on your blog (this scattering problem is absurd!),
Yeah, I agree. Don’t you just love everyone fighting for control of the conversation?
I don’t see it as people fighting for control of the conversation, so much as the post/comment model not being a good one for these sorts of conversations. It’s an artificial hierarchy, and we need something better.
BTW, we did have this conversation before:
Part of the reason why I comment so much, and didn’t have my own blog for years and years, is because I was less interested in fighting for control of any conversation, and more interested in the conversation.
I’ll have to say, though, that after being encouraged by many to start a blog, and after finally having done it a few months ago, and after writing about 40 posts, I still enjoy the threaded conversation more than I do the blogging itself.
On GYMA and relevance: but I agree with Daniel. It’s not only about relevance, but about all kinds of little details. For example, I really WANT to use Yahoo search. But I am always disappointed when I search for an address and Yahoo doesn’t detect it, doesn’t show me a map like Google does. So I go to Google for that. On the other hand, I love how I can use Yahoo as a smart Java API search that gives me options t go to different versions of the the Javadoc. This type of stuff falls in the large user satisfaction bucket.
Consider the irony: comparing web search engines on relevance is easy to do and even makes some sense, but wouldn’t be likely to change any minds–it’s a bit late to take the wind out of Google’s sails (or sales). In contrast, the enterprise, which could benefit a lot from better competitive analysis, isn’t (at least in my view) amenable to TREC-like evaluation approach. Sigh.
Re blogs and threads: I just want the multiple posts participating in a cross-post conversation to be able to include widget with each other’s comment streams. That assumes that all parties consent, of course; otherwise, it’s like plagiarizing someone else’s blog. But I’d certainly be happy if Otis and Grant made these comments readable on their related posts, and I’d do the same for theirs here if I could (and if they were amenable).
it’s a bit late to take the wind out of Google’s sails (or sales)
Well, again, this was someone from Ask who wanted to do it 5 years ago.
But still, I think Grant has a good point: Even if equally relevant, G does one type of thing well, and Y another. That fact alone, demonstrated and heavily publicized, might not take any wind away.. but at least it might stop more wind from constantly being added.
Am I the only one who fears an information monoculture?
I don’t like an information monoculture either, even if Google creates one with the best intentions. But I fight battles I can win. Maybe it might have made a difference 5 years ago, but fighting Google with parity–or even incremental improvement–today is a lost cause. That’s what I was after in this post.
Well, then the relevance evaluation should include a metric of relevant information diversity. “Look,” the evaluation could say, “by using only 1 search engine, you are missing all this useful information.”
The TREC 2009 web track has a diversity task;
see the track guidelines at http://plg.uwaterloo.ca/~trecweb/
The TREC 2008 and 2009 legal track had/has an extensive interactive task. You can read about the 2008 version in the legal track overview in the 2008 proceedings (in the publications section of the TREC web site).
As an exceedingly biased commenter (I am the project manager of TREC at NIST), I would claim that TREC is and has been far more than ‘just MAP’ for a long time. But I would also claim that ‘just MAP’ is in fact useful. At the risk of repeating myself ad nauseum, Cranfield (or relevance ranking or whatever you want to call it) does not make claims that the ability to rank documents by relevance is sufficient for user applications, only that it is necessary.
Ellen, I concede that I’m oversimplifying, and Jeremy did call that out earlier (back around comment #16). And my bad for not remembering that the legal track has an ad hoc interactive component.
In any case, I don’t doubt that batch tests (whether using MAP or other measures) have advanced the science of information retrieval. TREC has been and continues to be a wild success for IR.
That’s said, I’m curious to hear your perspective on how you see TREC’s role with respect to commercial search offerings. Should companies be participating? How do you explain the current level of participation?
There are companies that do participate in TREC, though most often participation is with research versions of a system rather than a COTS version.
(Somewhere in this thread, either on this blog or one of the others, it was asked how to see who participated in which tracks. The TREC overview paper lists the participants in that year, but not broken down by track. Appendix A of the proceedings lists all the runs submitted to a track and the organization that submitted it.)
Why do we not get more companies participating in TREC? I can only hypothesize.
– TREC makes all results public. I think there are companies who feel that it is too risky to participate because doing damage control would be too much of an effort if they happen to make a blunder in a run they submit (or their results did not meet some other expectation). People can get the collections after the fact and do their own runs in the privacy of their own lab. Why not allow anonymous runs? Because TREC is a research conference whose whole point is to advance the state of the art. Anonymous runs do not contribute to the common good (though I concede anonymous runs might might help the pools). On a more practical note, it is not clear that NIST could guarantee anonymity since submissions would probably be FOIA-able.
– Participation takes resources (mostly staff time). Some organizations do not have the extra resources to commit to “extraneous” activities. This is especially true if the organization feels that none of the tasks in that TREC is a close enough match to the task their system targets. I am in complete agreement with the assertion that a TREC task is not equivalent to a real user task. I think there are good reasons not to attempt real user tasks in TREC (chief among them being generalizability), but do agree that TREC tasks are abstractions.
But obviously I am not at a company who is not participating in TREC, while many readers of this blog are. So let me turn it around: why don’t you (all) participate in TREC?
As for why I don’t push Endeca to participate, you’ve nailed it in that second reason. Right or wrong, I don’t feel that any of the tasks, as conducted and evaluated, is a close enough match to the tasks our technology targets. My team focuses on interaction, while most of our engineers focus on making our software faster, more scalable, and simpler to use. In fairness, I do need to take a closer look at the legal interactive track. But even there I’m not sure if the tasks are particularly suitable for evaluating interactive interfaces themselves, or how much work it would be to participate.
The other consideration is that there isn’t much competitive upside to participation if most companies aren’t participating. Customers are unlikely to be swayed. That does make it hard to justify devoting resources.