Month: March 2009

Functional Requirements for Bibliographic Records

Post author By Daniel Tunkelang
Post date March 10, 2009
7 Comments on Functional Requirements for Bibliographic Records

This is the first of what I hope to be many guest posts. Our guest blogger is Kelley McGrath, a Cataloging and Metadata Services Librarian at Ball State University Libraries, and at my request she’s supplying a perspective that I feel is crucial for anyone interested in HCIR–that of an actual librarian who deals with the realities of cataloging technologies.

THE PROBLEM

Do people really judge books by their covers? If not, then why is it that the Penguin version of Passage to India is rated 5 stars while the Penguin Classic version only gets 3? A recent BBC blog entry asks this question and concludes that the problem is that bookstore and library records for books (or other things) are designed largely to support inventory tasks and are based on identifiers like ISBNs that relate to particular editions. The BBC points out that sometimes what we really need are what they call “cultural identifiers” for cultural artifacts that point to one place even if you’re talking about the large print or Spanish translation and I’m reading the standard English paperback.

A POSSIBLE SOLUTION

In fact, libraries and the publishing world are well aware of this problem. Since I’m in the library world and I work primarily with cataloging moving images (film, TV, video), I’m going to talk from that perspective. The library world’s proposed solution is an entity-relationship conceptual model called the Functional Requirements for Bibliographic Records (FRBR, often pronounced “ferber”). FRBR divides the bibliographic universe into four main entities, which from the most abstract to concrete are:

Work: This is the BBC’s “cultural artifact” or the abstract commonality of something that is considered the same essential creation. In most cases, this is clear-cut, but there can be disagreements about where to draw the boundaries. For example, is Gus van Sant’s frame-by-frame remake of Psycho a new work or is it an expression of Hitchcock’s work?

Expression: These are versions that vary in content in some significant way. If you actually want to get your hands on something, differences in expression are important. Examples of expressions are things like language translations (dubbed versions, subtitles), accessibility modifications (captions, audio descriptions), widescreen vs. full screen, colorized versions of black and white films, and theatrical release vs. uncut/unrated/director’s cut versions. All differences in expressions may not be practical to track (e.g., the various video versions of Star Wars).

Manifestation: This is more or less the same as what libraries or bookstores keep track of now—generally, a particular published edition or the set of items that have the same characteristics for ordering purposes, e.g., the Warner Home Video DVD with a certain ISBN released in a certain year. From a library perspective VHS vs. DVD vs. Blu-ray, as well as publisher names and publication dates, are manifestation-level attributes.

Item: This is the particular DVD that has a certain barcode on it that you’ve checked out from the library and neglected to bring back on time so now the library wants to charge you a fine.

Particularly as larger, shared library catalogs have become more common, the multi-level FRBR model could be used to present options to users in a more succinct and usable manner. Would it not be easier to see one basic overview record for Hamlet and choices for versions and availability rather than a long list of records of different editions of Hamlet with not much information on the initial hit list page to differentiate them?

STEPS TOWARD THE SOLUTION

So the library world has the challenge of seeing if we can get from where we are now to a FRBR-based model. What we have are records at the manifestation (published edition) level, e.g., the 2-disc special edition released by Paramount on DVD in 2008. This record probably includes information from other levels of the FRBR model such as the director’s name (work level) or the fact that it’s the full screen version (expression level). Unfortunately, these various bits of information are intermingled, not clearly identified in terms of what level they apply to, and often given in free text notes which are hard to analyze or turn into controlled data.

To get from where we are now to a world where the multi-level FRBR model is truly useful, there are a few things I think we need. Again, I am going to talk in terms of film and video, as that is what I work with. I am only going to talk in terms of two levels, work and manifestation (published edition). I think the two levels are a more practical first step. In addition, for most materials this may be a viable approach even in the long run in that the characteristics that identify the expression (version) have to be identified and verified for each new manifestation (published edition) and most of them can be coded in machine-readable form such that expressions (versions) could be automatically calculated.

1. Work sets

The existing manifestation (published edition) records have to somehow be related to the work or works they contain. This is generally done by grouping the records into work sets or by linking the relevant manifestations to work records.

OCLC, a large nonprofit, membership, computer library service and research organization, has developed what is probably the best-known clustering-based FRBR algorithm. OCLC’s algorithm uses several approaches with the most commonly-occurring one based on primary author and title. Because library rules consider most moving images to be works of mixed responsibility without a primary author, this approach works less well for them. The sets that are created by this algorithm are sometimes closer to the expression (version) level than the work level and the data that is displayed to the end user is derived algorithmically from the set of manifestations.

LibraryThing, a social book cataloging site, suggests possible combinations, but relies on human intervention to create its work clusters based on what founder Tim Spalding calls the “cocktail party” test. This test asks whether two people would think they’re talking about the same book in casual conversation. LibraryThing does include basic work-level records, parts of which appears to be surfaced from the manifestation clusters and parts of which are entered manually by users.

Both OCLC and LibraryThing offer services that create work sets on the fly based on an ISBN (XISBN and thingISBN). Manifestation (published edition) records that include multiple works can be particularly problematic for these services.

2. Work Identifiers

As the BBC noted, we need some stable way to identify and refer to works. Both OCLC and LibraryThing offer work identifiers of a sort. OCLC’s database is more comprehensive, but its identifiers are less reliably at the work level. Ed Summers recently pointed out that these identifiers only provide human-readable data and not data useful for machines.

3. Work Records

In the long run there are serious limitations with the clustering approach and with displaying work-level information created by extracting and analyzing data in sets of manifestation (published edition) records. For one thing, if information is automatically generated from clusters, it is difficult to correct errors that may sneak in. This problem may be particularly prevalent in the library world due to the practice of copying much information on new records from previous editions without verification. It is also redundant to re-enter and store all this data in multiple manifestation records when it would be more efficient to assess and maintain it in a single work-level record. In addition, a single work-level record would present consistent information to all users. Currently manifestation records may be more or less complete and may give conflicting information about the same work so what a user finds is arbitrarily influenced by the particular records that happen to be in the library catalog being searched.

A group that I am part of, OLAC (Online Audiovisual Catalogers), has created a task force to examine what it might mean to create work-level records for moving images in a library context and also to what extent we might be able to leverage existing library data. We did a project to try to extract work-level data from existing manifestation (published edition) records. Lynne Bisko and I have an article about our experience in The Code4Lib Journal. Our basic idea is that by extracting what we can from existing library records, possibly in combination with information from external data sources, we can create basic, good-enough work-level records that can be corrected and maintained by human beings. These work-level records could then be linked to our existing manifestation-level records. This data could be used to create an interface to provide better access to moving images both in terms of the work (director, cast, original date, language, and country of production, place and time period of setting, etc.) and in terms of the characteristics of the expression and manifestation that will help users easily select the best items for their needs (language(s) of audio and subtitle tracks, captioning, DVD vs. VHS vs. Blu-ray, availability, etc.).

General

The Guardian Gets Openness

Post author By Daniel Tunkelang
Post date March 10, 2009
3 Comments on The Guardian Gets Openness

Now that the Guardian Open Platform is live, I wanted to share some first impressions. Full disclosure: the Guardian is an Endeca customer. Still, my impressions are my own.

What the Guardian has released are a Content API and a Data Store, sets of publicly-available data made available for free. Here is the gem:

The APIs will feature ‘full fat’ feeds with full articles and other content including video, audio and photo galleries, some one million pieces of content published on guardian.co.uk from 1999-2008.

Of course, the Guardian’s decision to open up its APIs opens up inevitable comparisons to the New York Times for its recent opening up. But I think the Guardian is taking its effort a significant step further. The New York Times has only released its full archival content under non-commercial terms. Its article search and newswire APIs are nice, but they aren’t full fat feeds. Perhaps the closest comparison would be to Reuters Spotlight–but that is a non-commercial effort.

What the Guardian has done right is to offer openness in the context of commercial use. Here is the relevant section of their terms and conditions:

8. Advertising and Commercial Use

(a) If requested, you will as a condition of your licence to publish OPG Content, display on Your Website any advertisement that we supply to you together with the relevant OPG Content. You shall comply with our instructions regarding the position, form and size of such advertisements on Your Website. Such instructions may be notified to you directly or posted on the OPG Site.

(b) You may attach third party advertising to Your Website, which includes OPG Content, without accounting to us for any share in the revenue generated by such advertising, provided that:
• You do not associate OPG Content, directly or indirectly, with advertisements or advertisers that could be regarded by us as illegal or discriminatory.
• You comply with any additional restrictions that we may introduce from time to time as part of the OPG Terms.

(c) You may not syndicate or otherwise charge a fee for access to OPG Content.

That strikes me as eminently reasonable.

I’ve been looking forward to this launch for a while–unfortunately, my inside knowledge meant that I couldn’t be entirely open myself! But today I’m proud to see the Guardian continuing its tradition of leading the way in online media.

Uncategorized

Find Out More about the Guardian Open Platform!

Post author By Daniel Tunkelang
Post date March 10, 2009

If you’re reading this, then it’s at least 9am GMT, and you should be able to learn more about the Guardian Open Platform at http://www.guardian.co.uk/open-platform. That’s a bit early on this side of the pond, but I promise to share more details and impressions once I’m awake and have a chance to gather them!

Uncategorized

Guardian Launching Open Platform

Post author By Daniel Tunkelang
Post date March 9, 2009
1 Comment on Guardian Launching Open Platform

The Guardian, an internationally acclaimed newspaper (and a long-time Endeca client!) that has been a major force in the United Kingdom for 180 years, is launching an open platform tomorrow. The Guardian has led the media in openness, making the unprecedented decision last fall to offer the full text of its articles in its RSS feeds.

I’ll report more about the new platform when there are more details than I can publicly share.

General

A New Kind of Marketing (NKM)

Post author By Daniel Tunkelang
Post date March 9, 2009
13 Comments on A New Kind of Marketing (NKM)

The blogosphere is a buzz with hype about Wolfram Alpha. Stephen Wolfram writes:

It’s going to be a website: www.wolframalpha.com. With one simple input field that gives access to a huge system, with trillions of pieces of curated data and millions of lines of algorithms.

We’re all working very hard right now to get Wolfram|Alpha ready to go live.

I think it’s going to be pretty exciting. A new paradigm for using computers and the web.

That almost gets us to what people thought computers would be able to do 50 years ago!

And Nova Spivack shares his own excitement:

Stephen was kind enough to spend two hours with me last week to demo his new online service — Wolfram Alpha (scheduled to open in May)….

In a nutshell, Wolfram and his team have built what he calls a “computational knowledge engine” for the Web. OK, so what does that really mean? Basically it means that you can ask it factual questions and it computes answers for you….

Think about that for a minute. It computes the answers. Wolfram Alpha doesn’t simply contain huge amounts of manually entered pairs of questions and answers, nor does it search for answers in a database of facts. Instead, it understands and then computes answers to certain kinds of questions.

I haven’t seen this much excitement about a search-related product since the pre-launches of Cuil and Powerset, and we know how those played out. In fairness to Wolfram, however, he did bring us Mathematica, which is more than a legitimate claim to fame.

However, I’m not so persuaded by his more recent accomplishment of publishing A New Kind of Science, a best-seller and 1200-page coffee table book. Here’s what Wikipedia tells us about its critical reception:

NKS received extensive media publicity for a scientific book, generating scores of articles in such publications as The New York Times, Newsweek, Wired, and The Economist. It was a best-seller and won numerous awards. NKS was reviewed in a large range of scientific journals. Several themes emerged. On the positive, many reviewers enjoyed the quality of the book’s production, and the clear way Wolfram presented many ideas. Many reviewers, even those who engaged in other criticisms, found aspects of the book to be interesting and thought-provoking. On the negative, many reviewers criticized Wolfram for his lack of modesty, poor editing, lack of mathematical rigor, and the lack of immediate utility of his ideas. Concerning the ultimate importance of the book, a common attitude was that of either skepticism or “wait and see”.

If Wolfram has built a breakthrough tool to support information seeking, then he should let it prove itself by unveiling it and letting other people test it. We aren’t talking about some kind of esoteric science where only a few intellectuals can hope to understand it. Rather, his product purports to be some kind of search / answer / knowledge engine. It’s 2009, and we’re all used to the general vision. What we’re holding our breath for is execution.

I’m open to the possibility that Wolfram has built something that will change the world. But I’m extemely skeptical, and this hype campaign hardly instills confidence. Apparently he told Nova that the product will be launched in May. Two months: not so long to wait to see how well reality matches the hype.

Uncategorized

Apologies to Google Reader Users

Post author By Daniel Tunkelang
Post date March 8, 2009
2 Comments on Apologies to Google Reader Users

For some reason that I have still not diagnosed, readers who view the RSS feed for this blog using Google Reader are seeing a handful of bogus entries in the feed–something like this. I don’t know why those entries are showing up, let alone why they seem to congregate at the front of the feed. If anyone has suggestion on how to diagnose or resolve the issue, I’d greatly appreciate it. In the mean time, I apologize for the incovenience and annoyance.

General

Is Global the New Local?

Post author By Daniel Tunkelang
Post date March 7, 2009

I was just reading a nice article by Mike Elgan in Computerworld entitled “Why global is the new ‘local‘”.

He starts off by talking about the transformations happening in radio:

“Local” radio stations are going national, and even international. That sounds like an opportunity for the stations — they can now reach a larger potential audience for advertisers. But in reality, it’s a problem. The whole radio business model is built around pandering to local community groups, small businesses, area schools and, above all, local listeners. So how do you pander to the old audience without alienating the new one?

He then goes on to explain how the same problem applies to newspapers:

Now you can get local news anywhere. Look, for example, at Lodi, Calif., a medium-size city of about 63,000 people. (You may recall the town from a 1969 Creedence Clearwater Revival song.)

Search Google News for “Lodi” and there it is: more than 4,000 news stories, organized roughly by importance. Getting Lodi news on Google is faster, cheaper, more comprehensive and, well, better than the local Lodi paper. You can get Lodi news even if you’re in Timbuktu. And, of course, you can get county, state, national and international news everywhere. Even if you’re stuck in Lodi.

And here is the money shot:

What’s really going on is that the Internet is punishing inefficiency.

His analysis strikes me as brutally accurate. As much as I criticize the ad-supported model in general and Google’s role in devaluing online content in particular, I think that Elgan does a great job of explaining what may be one of the the news industry’s biggest contributions to its own malaise. Indeed, for all of the hype about hyperlocal news, I suspect that the winners in this market will be news providers or aggregators that don’t focus on local news but rather let users find whatever they want.

In an unsuccessful City Council run, Tip O’Neill received the famous advice from his father that “All politics is local.” That was surely true in the 1930s, but the world had changed a bit in seven decades.

Fittingly, Elgan concludes his article:

Nothing is local anymore. And it’s a huge opportunity. The new mantra should be: Cover local events exclusively, but for a global audience.

Uncategorized

Google’s Marissa Mayer on Privacy vs. Transparency

Post author By Daniel Tunkelang
Post date March 6, 2009
4 Comments on Google’s Marissa Mayer on Privacy vs. Transparency

TechCrunch posted a transcript of a Charlie Rose interviewing Google Vice President of Search Product and User Experience Marissa Mayer.

Here’s an excerpt I found particularly interesting:

Charlie Rose:
This is a broader philosophical question I want to talk about later. But I mean is there some point in which we know too much about people?

Marissa Mayer:
Well I think that in all cases it’s a tradeoff, right, where you will give you some of your privacy in order to gain some functionality, and so we really need to make those tradeoffs really clear to people, what information are we using and what’s the benefit to them? And then ultimately leave it to user choice so the user can decide. And you have to be very transparent about what information you have about that user and how it’s being used.

Charlie Rose:
But it’s also seems to me clearly a product of age and generation, how willing you are to give up privacy and to allow transparency, clearly.

Marissa Mayer:
Sure, absolutely…

That’s a great attitude. I only which Charlie Rose had fact-checked Google’s actual policy when it comes to transparency. Indeed, Google’s lack of transparency with advertisers, who are its bread and butter, recently cost them $761 and a bunch of bad press. While I’m sure Google can afford the judgment (less than 2.5 shares of GOOG stock at the time of this writing), I hope they see this experience as an opportunity to review their principles.

And, of course, don’t get me started on the lack of transparency in their approach to relevance! For those who haven’t been regular readers, here are two of my recent posts about Google:

Uncategorized

Can You Digg It?

Post author By Daniel Tunkelang
Post date March 6, 2009

According to an article in the LA Times, USocial CEO and founder Leon Hill is bragging that they are “”gaming Digg” by letting advertisers buy votes. Sound familiar? When will people figure out that anonymous social voting schemes that don’t offer users control over the social lens are just begging to be gamed?

It’s as Ben Franklin said, “Experience is the best teacher, but a fool will learn from no other.” Emphasis mine.

Uncategorized

Jason Adams Explains TunkRank

Post author By Daniel Tunkelang
Post date March 6, 2009

Jason Adams, who recently won the TunkRank implementation challenge, explains on his blog how he implemented TunkRank.com. He implemented the algorithm in Ruby using Merb, MySQL, Capistrano, nginx, and ActiveRecord. For more details, check out his blog!

Note: he just added a follow-up post: The Road Ahead for TunkRank.