Does Metadata Matter?

I’m trilled at the discussion that my call for devil’s advocacy has incited. Keep bringing it–and let me know if you’d like to contribute a guest post.

But it’s also nice to find strong views elsewhere in the blogosphere. This morning I saw a post at The Findability Project entitled “Metadata Schmetadata, Relevance and Reality“, in which the authors argue that they don’t need metadata.

Specifically, they say:

Working on this project, we have evaluated what we need from metadata as part of enterprise search implementation. Our conclusion? We don’t need metadata.

Or better said, we don’t need to add metadata for a Google Search Appliance (GSA) to accomplish what we want to accomplish with enterprise search.

I posted the following comment, which is currently pending moderation:

An interesting article. But perhaps I missed an explanation of how you performed your evaluation. Did you assign tasks to your users and compare their effectiveness on the two systems? Did you ask them to express their subjective satisfaction with the system? Did you have some productivity measure external to the system, such as efficiency at completing projects?

It may be that a simple out-of-the box ranked search approach, with no annotation, manual or automatic, of your documents, is exactly what your organization need. But it’s very hard to generalize from your experience without understanding better what exactly you where evaluating.

I am on board with the argument that 100% manual annotation offers poor return on investment. But that’s a straw-man argument. I would think that the real question is whether you want fully automated metadata generation, a semi-automated approach, or none at all.

And, as per my comment, it seems hard to justify any design decisions about enterprise search without success metrics, even imperfect ones.

In any case, look forward to more such posts, as I strive to increase the diversity of views at The Noisy Channel, even if I have to import them!

By Daniel Tunkelang

High-Class Consultant.

15 replies on “Does Metadata Matter?”

If you have a bunch of Word documents, you might get away with no metadata, but what happens when you start bringing in multimedia content? How are you supposed to find these items without metadata to help you since by their nature, multimedia lacks the obvious contextual (word-based) clues of a document.

What’s more, if you look at the success of Web 2.0 tools and folksonomies, tagging, tag clouds, rating systems and so forth; you see that even informal taxonomies (which is essentially metadata, even if not in the purse sense of the word) helps people find and evaluate the quality of the findings.

A search engine may present you with a static list of documents, but how are you to know that the Pharma presentation from one person is much more popular than the one from the guy who was fired two weeks ago without this type of information?

Metadata in all its guises helps bring more clarity to the search results, and in some cases, actually helps expose those results.


Arguing whether metadata is useful or not seems to be a pretty useless endeavor. It’s a bit like discussing whether a textbook benefits from having an index at the back.

The key point, in my mind, is how the metadata is created. Even if manual annotation is feasible, the taxonomy problem is frequently just shifted one level up, leading to a nice metadata-metadata hall of mirrors.

I believe progress will only really be made in this field when metadata (context) can be mechanically *inferred* from the source with statistically acceptable levels of success.

OpenCalais seems to be making a lot of progress in this direction.



I’m as excited as the next guy about automatic metadata extraction. Technologies like OpenCalais and Dapper represent a tremendous potential for bringing struture to the wild and unwieldy Internet. But the core problem that should be solved is automatic metadata generation, preserving the semantics of the content, not reverse engineering (with sophisticated algorithms) a poorer version of the context that initially existed.

Naturally, people won’t spend time annotating content because it’s tedious and time consuming. Often the incentives for putting in the extra effort are missing. That is why all the major content management systems should provide mechanisms for automatic content tagging according to W3C Semantic Web standards. We’re already on our way to building semantics in to publication platforms, and we need to continue in that direction.

I guess CMS vendors need the right incentives. If the increasing compliance to other W3C standards, like XHTML and CSS, is partly driven by SEO incentives, how can we apply that to Semantic Web standards?


Note: Brian Lawlor at The Findability Project published my comment and responded to it. I’ll re-post his comment here, where the conversation is unmoderated:

Daniel, your observations are spot-on about the evaluation of user need and experience. Our evaluations of whether our users can find what they are looking for are, admittedly, limited to basic user-experience surveys, as well as a considerable amount of first-hand observations of users conducting actual searches. But what we have done is not as thorough or precise as the basic frameworks you suggest. We readily agree that one cannot generalize in any definitive fashion about metadata from our experience. We are not inviting others to do so. We are, however, reporting on our experience and what we do think legal services and other non-profits can and should question, including whether the initial cost and ongoing investment in creating and managing metadata is a practical way to go.

What we are discovering is that our users are finding what they are looking for without the apparent need to add metadata to our targets, most of which do not have added metadata or we could not add it if we wanted. (For example, our targeted Google Sites content.) What is left unstated in the post are other things we are exploiting in the Google Search Appliance, including an array of filters and collections for narrowing search results in a way very easily understood by our users, selective use of Keymatch, and a set of OneBox modules that are very effective in helping our users find some of the most common things they are searching for. In the context of our project, those options strike us as more efficient and less costly ways for us to “help” the search.


It’s good he took the time to respond but based on the content of the response there seems to me to be no empirical data to support his claim.

It’s fine to use and even state anecdotal evaluation results (I think we all do at times) but when publishing a concrete statement it should be based on empirical results. Without an empirical result set I think it’s incorrect to state what they are doing is better for the users.

If you are only given one method to use you can not access if another method is better or not.

So given the response (which I’m glad you got) I’m disturbed by how someone not in our domain might interpret this conclusion which is unsubstantiated given their methodology (lack there of).


I agree, and that’s why I commented on his post in the first place. I understand that not everyone has the time to perform rigorous evaluation, but anyone who publishes controversial conclusions is asking to be held to a higher standard.


Points well made, and taken. This discussion in response to my original post is certain to prompt another post at our project site to amplify some on the larger context of that post about metadata. For now, a few observations from the civilian perspective of our modest non-profit organization:

On the empirical arguments, as they relate to search design, I agree completely. It is more than fair to criticize the unadorned conclusion “we don’t need metadata” to question and highlight the lack of its empirical basis. And I thought the comments, above, we very responsive and relevant to Daniel’s basic criticism. But it was not an unadorned conclusion. And was it “controversial” and/or “disturbing”? Huh? As a response to our larger project, describing a practical, affordable, manageable enterprise search solution for a non-profit organization of our scale and purpose — you know, that pesky thing mentioned in the first paragraph of the post about “context” — I don’t see much.

By way of context, elsewhere we documented (empirically, by the way) what our intended GSA targets are, and all of which are text-based documents. (I loved the implicit trivialization in Ron’s comment, “If you have a bunch of Word documents…” If you like that, you’re going to really love our 67% WordPerfect collection!) Currently, we have nothing intentionally targeted in our structural taxonomy that is not a text-based object. We do anticipate adding a limited number of audio and video files to the mix down the road, and we have always understood that in those cases we would manually add metadata. That’s not a big deal on our project. It would be if we were to attempt “100% manual annotation” of our targets, which Daniel and others agree is non-starter. Another piece of the project context is our creation of specific, peer-reviewed document collections, among several others, that we now maintained in our domain’s Google Sites, also targeted by our GSA. (We are huge fans of Google Apps generally, which is a phenomenal, zero-cost option for non-profit organizations like ours; and are very pleased with the results we are getting from crawling our Google Sites, as well as our Google discussion groups).

Ron’s other point about the significance of “folksonomies, tagging, tag clouds, rating systems” is dead-on. We’re totally on board with that. We have that on our radar but it is not a current priority given the time and resources we can bring to our particular project. Daniel, high-fives for automated metadata generation! But do we really need it on a project like ours? At what cost?

There’s the rub. I am simplifying here for the sake of discussion, but what can a ragtag, non-profit that services the legal needs of low-income clients do to implement an effective enterprise search solution? You nailed us on empirical search design. But we came up with an actual enterprise search solution for a type of organization that cannot readily do so. And given how that solution overwhelmingly is about text-based targets, do WE need metadata?

That’s a fair question to ask in our world and on our project. You know our conclusion. That is not the same thing as saying there is no need for metadata, the proposition that the comments seem directed toward.



I appreciate the time you took and the impassioned defense of the post but I think you miss the point of at least my comment.

I am not suggesting you must have a formal test procedure comparing x, y, & z. There are many reasons why this is not reasonable (money only being one of them) and many researchers make comments on their work that is not backed by formal study & empirical results but when they do they state the fact that the results are anecdotal and not based on a set of rigorous tests resulting in something that is quantifiable.

To make the statements I find troubling ok all one has to do is basically state the fact that though you did not do empirical testing, anecdotally your users seem to be doing fine without meta-data.

Questioning whether you need meta-data is fine as well but you do not know how much better your user experience may be with it as you have not done the tests to prove that hypothesis so making it sound like not needing meta-data is a definitive result is incorrect.

Hope that makes sense to you,




Brian, I first want to commend you for joining the discussion here. And I think what happened here is that, even though you restricted your conclusions to the context of your project, you made the bold and less qualified statement that “Because improvements in search algorithms are such that metadata is not needed to help the search.” I hope you understand that, for some of us, those are fighting words! 🙂


Christopher, that does make perfect sense to me. I should have made it clear in the original post that we came to our conclusion based on anecdotal evidence, however worthy it may be, and I did not. Lesson learned, my friend.


Thanks Brian.

I think the beauty of the community Daniel is fostering here, is it allows for many points of view. While we might not always agree it’s a great place to hash ideas out. In the end we’re all working towards a similar goal, better experiences for users.

I also want to applaud your project which I should have done before, the mission your team has set is very admirable. It’s gratifying to see technology being put to use to help people who may not otherwise be able to have access to the same advances we do.


It is slightly beside the point, but I happened to read about Drupal implementing support for RDFa, a method for embedding semantic metadata in xhtml pages. A similar RDF module is already in use by OpenCalais.

A problem with many production systems today (including CMS’s) is that so much valuable metadata is discarded along the way. Just compare a digital multitrack music recording, which contains metadata about all the individual instruments, to a CD with it’s meager two-channel mix. Reverse engineering metadata costs a lot more than just preserving what already exists.

Metadata matters, and it doesn’t have to be painful.


Indeed, one of the best things we can do to make metadata less painful is to embrace the fact that content is naturally semi-structured. I know that for some people, the big question is RDF vs. XML. But let’s start by agreeing that a data model should at least support a mix of structured and unstructured data, neither forced into a rigid schema nor de-boned into a blob of unstructured text.


This sentence needs to be the first slide, statement, chant at the beginning of every meeting of SemWeb practitioners.

“Let’s start by agreeing that a data model should at least support a mix of structured and unstructured data, neither forced into a rigid schema nor de-boned into a blob of unstructured text.”

I use a mixture of both and while I’m still not convinced RDF is the ideal solution for storing general knowledge (like in Wolfram Alpha) it is ideal for many tasks.

I’m also a member of the Data Portability group and a big believer in lightweight semantic structures like Microformats.

I still see NLP/ML techniques as the most scalable method of acquiring semantic data which can then be put into a formalized structure.

I’m also hoping that whatever the structure is RDF or otherwise that we support full first order logic for inferencing. I think it’s a mistake and lessens the quality of the structured data to allow ambiguous language into the inference side of things.

So in a nutshell: More metadata good. Lightweight semantic structures like Microformats good. Formal inferencing language good. NLP/ML acquisition good. This topic very good.

Just my 2 cents.


Comments are closed.