<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: A Utilitarian View of IR Evaluation</title>
	<atom:link href="http://thenoisychannel.com/2008/05/15/a-utilitarian-view-of-ir-evaluation/feed/" rel="self" type="application/rss+xml" />
	<link>http://thenoisychannel.com/2008/05/15/a-utilitarian-view-of-ir-evaluation/</link>
	<description></description>
	<lastBuildDate>Wed, 16 May 2012 21:42:15 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
	<item>
		<title>By: Daniel Tunkelang</title>
		<link>http://thenoisychannel.com/2008/05/15/a-utilitarian-view-of-ir-evaluation/comment-page-1/#comment-79</link>
		<dc:creator>Daniel Tunkelang</dc:creator>
		<pubDate>Sat, 31 May 2008 23:29:00 +0000</pubDate>
		<guid isPermaLink="false">http://thenoisychannel.com/?p=29#comment-79</guid>
		<description>Giri, thanks for the links! While I&#039;m intrigued by the user satisfaction measures in the SIGIR &#039;07 poster, I&#039;m more interested in the objective measures of task effectiveness and efficiency. The Kaki paper seems more  on target, especially for evaluating a recall-oriented task.</description>
		<content:encoded><![CDATA[<p>Giri, thanks for the links! While I&#8217;m intrigued by the user satisfaction measures in the SIGIR &#8217;07 poster, I&#8217;m more interested in the objective measures of task effectiveness and efficiency. The Kaki paper seems more  on target, especially for evaluating a recall-oriented task.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: giridhar kumaran</title>
		<link>http://thenoisychannel.com/2008/05/15/a-utilitarian-view-of-ir-evaluation/comment-page-1/#comment-75</link>
		<dc:creator>giridhar kumaran</dc:creator>
		<pubDate>Thu, 29 May 2008 17:19:00 +0000</pubDate>
		<guid isPermaLink="false">http://thenoisychannel.com/?p=29#comment-75</guid>
		<description>&lt;i&gt;...accurately, quickly, and confidently as possible. Holding the interface constant, are there evaluation measures that correlate to how well users perform on these three criteria? ...&lt;/i&gt;&lt;br/&gt;&lt;br/&gt;I think measures that factor in  time as well as the amount of relevant material retrieved will be useful to evaluate atleast the first two criteria you have mentioned. You might want to look at papers by Mika Kaki, especially &lt;a HREF=&quot;http://portal.acm.org/citation.cfm?id=1028014.1028072&amp;coll=&amp;dl=&amp;CFID=29991117&amp;CFTOKEN=21361174&quot; REL=&quot;nofollow&quot;&gt;this&lt;/a&gt; one. Another interesting read would be &lt;a HREF=&quot;http://dis.shef.ac.uk/mark/publications/my_papers/SIGIR2007-b.pdf&quot; REL=&quot;nofollow&quot;&gt;this&lt;/a&gt; SIGIR 2007 poster on correlations between IR measures and user satisfaction.</description>
		<content:encoded><![CDATA[<p><i>&#8230;accurately, quickly, and confidently as possible. Holding the interface constant, are there evaluation measures that correlate to how well users perform on these three criteria? &#8230;</i></p>
<p>I think measures that factor in  time as well as the amount of relevant material retrieved will be useful to evaluate atleast the first two criteria you have mentioned. You might want to look at papers by Mika Kaki, especially <a HREF="http://portal.acm.org/citation.cfm?id=1028014.1028072&#038;coll=&#038;dl=&#038;CFID=29991117&#038;CFTOKEN=21361174" REL="nofollow">this</a> one. Another interesting read would be <a HREF="http://dis.shef.ac.uk/mark/publications/my_papers/SIGIR2007-b.pdf" REL="nofollow">this</a> SIGIR 2007 poster on correlations between IR measures and user satisfaction.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daniel Tunkelang</title>
		<link>http://thenoisychannel.com/2008/05/15/a-utilitarian-view-of-ir-evaluation/comment-page-1/#comment-59</link>
		<dc:creator>Daniel Tunkelang</dc:creator>
		<pubDate>Sat, 17 May 2008 01:20:00 +0000</pubDate>
		<guid isPermaLink="false">http://thenoisychannel.com/?p=29#comment-59</guid>
		<description>I agree that, at least in many circumstances, I&#039;m interested in retrieving documents that are relevant to an information need. I&#039;ll put aside the use case I described, where the optimal outcome is to quickly ascertain that no such documents exist.&lt;br/&gt;&lt;br/&gt;But there is a difference between a task goal and a query goal. As I perform a sequence of queries to meet my information need, I&#039;m not necessarily concerned with retrieving relevant documents on each query. Rather, I&#039;d like to be learning how to improve my query strategy to ultimately complete my task as successfully as possible.&lt;br/&gt;&lt;br/&gt;So yes, relevance is a necessary criterion. But not necessarily in the way that the Cranfield/TREC experiments measure it.</description>
		<content:encoded><![CDATA[<p>I agree that, at least in many circumstances, I&#8217;m interested in retrieving documents that are relevant to an information need. I&#8217;ll put aside the use case I described, where the optimal outcome is to quickly ascertain that no such documents exist.</p>
<p>But there is a difference between a task goal and a query goal. As I perform a sequence of queries to meet my information need, I&#8217;m not necessarily concerned with retrieving relevant documents on each query. Rather, I&#8217;d like to be learning how to improve my query strategy to ultimately complete my task as successfully as possible.</p>
<p>So yes, relevance is a necessary criterion. But not necessarily in the way that the Cranfield/TREC experiments measure it.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jon Elsas</title>
		<link>http://thenoisychannel.com/2008/05/15/a-utilitarian-view-of-ir-evaluation/comment-page-1/#comment-58</link>
		<dc:creator>Jon Elsas</dc:creator>
		<pubDate>Fri, 16 May 2008 17:31:00 +0000</pubDate>
		<guid isPermaLink="false">http://thenoisychannel.com/?p=29#comment-58</guid>
		<description>&quot;The problem is the subjective nature of relevance itself. &quot;&lt;br/&gt;&lt;br/&gt;I agree that this is part of the problem, Daniel.  That&#039;s the reason why the relevance assessor is the same person who has the information need and crafts the query/topic.  There is no absolute (topical) relevance.  A measure like AP is evaluating how well a system returns relevant documents &lt;b&gt;according to one person&#039;s idea of relevance at the time the assessment was made&lt;/b&gt;.  There always are and always will be disagreements between assessors, but that&#039;s the point  (and what makes our job interesting)!&lt;br/&gt;&lt;br/&gt;Assessors are typically not asked to penalize redundant documents, documents with mis-spellings or grammatical errors, documents from known unreliable sources, etc.  These are all facets of what a user may consider useful or not... but they are not traditionally considered aspects of topical relevance.&lt;br/&gt;&lt;br/&gt;Relevance is a necessary but not sufficient criterion for utility.</description>
		<content:encoded><![CDATA[<p>&#8220;The problem is the subjective nature of relevance itself. &#8220;</p>
<p>I agree that this is part of the problem, Daniel.  That&#8217;s the reason why the relevance assessor is the same person who has the information need and crafts the query/topic.  There is no absolute (topical) relevance.  A measure like AP is evaluating how well a system returns relevant documents <b>according to one person&#8217;s idea of relevance at the time the assessment was made</b>.  There always are and always will be disagreements between assessors, but that&#8217;s the point  (and what makes our job interesting)!</p>
<p>Assessors are typically not asked to penalize redundant documents, documents with mis-spellings or grammatical errors, documents from known unreliable sources, etc.  These are all facets of what a user may consider useful or not&#8230; but they are not traditionally considered aspects of topical relevance.</p>
<p>Relevance is a necessary but not sufficient criterion for utility.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daniel Tunkelang</title>
		<link>http://thenoisychannel.com/2008/05/15/a-utilitarian-view-of-ir-evaluation/comment-page-1/#comment-57</link>
		<dc:creator>Daniel Tunkelang</dc:creator>
		<pubDate>Fri, 16 May 2008 16:29:00 +0000</pubDate>
		<guid isPermaLink="false">http://thenoisychannel.com/?p=29#comment-57</guid>
		<description>Jon, I&#039;ll concede that, just because I like the Turpin study doesn&#039;t mean it&#039;s right, and I appreciate you keeping me honest. The end doesn&#039;t justify the means.&lt;br/&gt;&lt;br/&gt;Still, I can&#039;t leave your assertion that &quot;retrieving relevant documents is undoubtedly the leading indicator of the effectiveness of a system&quot; unchallenged. My concern is not just the subjective nature of relevance assessment--if that were the only issue, then we could see the average judgment of assessors as an &quot;objective&quot; relevance measure. The problem is the subjective nature of relevance itself. Ultimately, relevant information is just whatever information I want / need right now.</description>
		<content:encoded><![CDATA[<p>Jon, I&#8217;ll concede that, just because I like the Turpin study doesn&#8217;t mean it&#8217;s right, and I appreciate you keeping me honest. The end doesn&#8217;t justify the means.</p>
<p>Still, I can&#8217;t leave your assertion that &#8220;retrieving relevant documents is undoubtedly the leading indicator of the effectiveness of a system&#8221; unchallenged. My concern is not just the subjective nature of relevance assessment&#8211;if that were the only issue, then we could see the average judgment of assessors as an &#8220;objective&#8221; relevance measure. The problem is the subjective nature of relevance itself. Ultimately, relevant information is just whatever information I want / need right now.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jon Elsas</title>
		<link>http://thenoisychannel.com/2008/05/15/a-utilitarian-view-of-ir-evaluation/comment-page-1/#comment-55</link>
		<dc:creator>Jon Elsas</dc:creator>
		<pubDate>Fri, 16 May 2008 13:30:00 +0000</pubDate>
		<guid isPermaLink="false">http://thenoisychannel.com/?p=29#comment-55</guid>
		<description>Correction: my recollection of systems tested within the MAP range of 0.7 - 0.9 was a little off: they evaluated systems with MAP values of 0.55, 0.65, ... 0.95.  This is still well above the best performing ad hoc system performance at TREC, typically in the 0.3-0.35 range.  &lt;br/&gt;&lt;br/&gt;What implication does this have?  Hard to say for sure, but its roughly the difference in systems having a relevant document at the top two ranks 75% of the time vs. 30% of the time (assuming 1-2 relevant documents per query).   AP, like Reciprocal Rank, is very sensitive at the high end of the range, so a difference in MAP of 0.85 and 0.95 is due to very small perturbations in the rankings.  Differences between 0.25 and 0.35, on the other hand, require a much larger shuffle of the document ranks.</description>
		<content:encoded><![CDATA[<p>Correction: my recollection of systems tested within the MAP range of 0.7 &#8211; 0.9 was a little off: they evaluated systems with MAP values of 0.55, 0.65, &#8230; 0.95.  This is still well above the best performing ad hoc system performance at TREC, typically in the 0.3-0.35 range.  </p>
<p>What implication does this have?  Hard to say for sure, but its roughly the difference in systems having a relevant document at the top two ranks 75% of the time vs. 30% of the time (assuming 1-2 relevant documents per query).   AP, like Reciprocal Rank, is very sensitive at the high end of the range, so a difference in MAP of 0.85 and 0.95 is due to very small perturbations in the rankings.  Differences between 0.25 and 0.35, on the other hand, require a much larger shuffle of the document ranks.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jon Elsas</title>
		<link>http://thenoisychannel.com/2008/05/15/a-utilitarian-view-of-ir-evaluation/comment-page-1/#comment-54</link>
		<dc:creator>Jon Elsas</dc:creator>
		<pubDate>Fri, 16 May 2008 11:08:00 +0000</pubDate>
		<guid isPermaLink="false">http://thenoisychannel.com/?p=29#comment-54</guid>
		<description>Oh, where to begin!&lt;br/&gt;&lt;br/&gt;The user Turpin user study you cite, &quot;User performance versus precision measures for simple search tasks&quot;, is interesting but FAR from a definitive nail in the coffin for MAP.  Two major &quot;flaws&quot; (I really hesitate to use that word) I see -- first, the MAP of the systems they&#039;re testing are in the range of 0.7 - 0.9 (if I remember correctly), and they aim to show that there&#039;s no noticeable difference in the systems from the users&#039; point of view.  This MAP range is really well above the range of MAP scores we see in TREC style evaluations, which is usually closer to 0.3-0.5 for most tasks.  Second, measuring user satisfaction in a controlled lab environment with search tasks dictated by someone other than the users is... well... unsatisfying.  I&#039;m sure I don&#039;t need to convince you of that.&lt;br/&gt;&lt;br/&gt;The real issue, IMHO:&lt;br/&gt;&lt;br/&gt;Relevance is not only subjective, but its also just part of the picture when talking about utility &amp; user satisfaction.  The Cranfield/TREC methodology allows for subjectivity -- queries and relevance judgements are given by a single person, with the intent that a relevant document is relevant for that assessor, not absolutely relevant for everyone.  However, Cranfield and evaluation measures like MAP only look at relevance, not all the other factors that make a system truly useful: authority, diversity, recency, and many others.  &lt;br/&gt;&lt;br/&gt;It has been useful for us IR researchers to focus on relevance -- retrieving relevant documents is undoubtedly the leading indicator of the effectiveness of a system.  Although I don&#039;t think we&#039;ve solved this problem, we all need to be aware that that is not the only criteria for a successful IR system.</description>
		<content:encoded><![CDATA[<p>Oh, where to begin!</p>
<p>The user Turpin user study you cite, &#8220;User performance versus precision measures for simple search tasks&#8221;, is interesting but FAR from a definitive nail in the coffin for MAP.  Two major &#8220;flaws&#8221; (I really hesitate to use that word) I see &#8212; first, the MAP of the systems they&#8217;re testing are in the range of 0.7 &#8211; 0.9 (if I remember correctly), and they aim to show that there&#8217;s no noticeable difference in the systems from the users&#8217; point of view.  This MAP range is really well above the range of MAP scores we see in TREC style evaluations, which is usually closer to 0.3-0.5 for most tasks.  Second, measuring user satisfaction in a controlled lab environment with search tasks dictated by someone other than the users is&#8230; well&#8230; unsatisfying.  I&#8217;m sure I don&#8217;t need to convince you of that.</p>
<p>The real issue, IMHO:</p>
<p>Relevance is not only subjective, but its also just part of the picture when talking about utility &#038; user satisfaction.  The Cranfield/TREC methodology allows for subjectivity &#8212; queries and relevance judgements are given by a single person, with the intent that a relevant document is relevant for that assessor, not absolutely relevant for everyone.  However, Cranfield and evaluation measures like MAP only look at relevance, not all the other factors that make a system truly useful: authority, diversity, recency, and many others.  </p>
<p>It has been useful for us IR researchers to focus on relevance &#8212; retrieving relevant documents is undoubtedly the leading indicator of the effectiveness of a system.  Although I don&#8217;t think we&#8217;ve solved this problem, we all need to be aware that that is not the only criteria for a successful IR system.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

