This is the first of what I hope to be many guest posts. Our guest blogger is Kelley McGrath, a Cataloging and Metadata Services Librarian at Ball State University Libraries, and at my request she’s supplying a perspective that I feel is crucial for anyone interested in HCIR–that of an actual librarian who deals with the realities of cataloging technologies.
Do people really judge books by their covers? If not, then why is it that the Penguin version of Passage to India is rated 5 stars while the Penguin Classic version only gets 3? A recent BBC blog entry asks this question and concludes that the problem is that bookstore and library records for books (or other things) are designed largely to support inventory tasks and are based on identifiers like ISBNs that relate to particular editions. The BBC points out that sometimes what we really need are what they call “cultural identifiers” for cultural artifacts that point to one place even if you’re talking about the large print or Spanish translation and I’m reading the standard English paperback.
A POSSIBLE SOLUTION
In fact, libraries and the publishing world are well aware of this problem. Since I’m in the library world and I work primarily with cataloging moving images (film, TV, video), I’m going to talk from that perspective. The library world’s proposed solution is an entity-relationship conceptual model called the Functional Requirements for Bibliographic Records (FRBR, often pronounced “ferber”). FRBR divides the bibliographic universe into four main entities, which from the most abstract to concrete are:
Work: This is the BBC’s “cultural artifact” or the abstract commonality of something that is considered the same essential creation. In most cases, this is clear-cut, but there can be disagreements about where to draw the boundaries. For example, is Gus van Sant’s frame-by-frame remake of Psycho a new work or is it an expression of Hitchcock’s work?
Expression: These are versions that vary in content in some significant way. If you actually want to get your hands on something, differences in expression are important. Examples of expressions are things like language translations (dubbed versions, subtitles), accessibility modifications (captions, audio descriptions), widescreen vs. full screen, colorized versions of black and white films, and theatrical release vs. uncut/unrated/director’s cut versions. All differences in expressions may not be practical to track (e.g., the various video versions of Star Wars).
Manifestation: This is more or less the same as what libraries or bookstores keep track of now—generally, a particular published edition or the set of items that have the same characteristics for ordering purposes, e.g., the Warner Home Video DVD with a certain ISBN released in a certain year. From a library perspective VHS vs. DVD vs. Blu-ray, as well as publisher names and publication dates, are manifestation-level attributes.
Item: This is the particular DVD that has a certain barcode on it that you’ve checked out from the library and neglected to bring back on time so now the library wants to charge you a fine.
Particularly as larger, shared library catalogs have become more common, the multi-level FRBR model could be used to present options to users in a more succinct and usable manner. Would it not be easier to see one basic overview record for Hamlet and choices for versions and availability rather than a long list of records of different editions of Hamlet with not much information on the initial hit list page to differentiate them?
STEPS TOWARD THE SOLUTION
So the library world has the challenge of seeing if we can get from where we are now to a FRBR-based model. What we have are records at the manifestation (published edition) level, e.g., the 2-disc special edition released by Paramount on DVD in 2008. This record probably includes information from other levels of the FRBR model such as the director’s name (work level) or the fact that it’s the full screen version (expression level). Unfortunately, these various bits of information are intermingled, not clearly identified in terms of what level they apply to, and often given in free text notes which are hard to analyze or turn into controlled data.
To get from where we are now to a world where the multi-level FRBR model is truly useful, there are a few things I think we need. Again, I am going to talk in terms of film and video, as that is what I work with. I am only going to talk in terms of two levels, work and manifestation (published edition). I think the two levels are a more practical first step. In addition, for most materials this may be a viable approach even in the long run in that the characteristics that identify the expression (version) have to be identified and verified for each new manifestation (published edition) and most of them can be coded in machine-readable form such that expressions (versions) could be automatically calculated.
1. Work sets
The existing manifestation (published edition) records have to somehow be related to the work or works they contain. This is generally done by grouping the records into work sets or by linking the relevant manifestations to work records.
OCLC, a large nonprofit, membership, computer library service and research organization, has developed what is probably the best-known clustering-based FRBR algorithm. OCLC’s algorithm uses several approaches with the most commonly-occurring one based on primary author and title. Because library rules consider most moving images to be works of mixed responsibility without a primary author, this approach works less well for them. The sets that are created by this algorithm are sometimes closer to the expression (version) level than the work level and the data that is displayed to the end user is derived algorithmically from the set of manifestations.
LibraryThing, a social book cataloging site, suggests possible combinations, but relies on human intervention to create its work clusters based on what founder Tim Spalding calls the “cocktail party” test. This test asks whether two people would think they’re talking about the same book in casual conversation. LibraryThing does include basic work-level records, parts of which appears to be surfaced from the manifestation clusters and parts of which are entered manually by users.
Both OCLC and LibraryThing offer services that create work sets on the fly based on an ISBN (XISBN and thingISBN). Manifestation (published edition) records that include multiple works can be particularly problematic for these services.
2. Work Identifiers
As the BBC noted, we need some stable way to identify and refer to works. Both OCLC and LibraryThing offer work identifiers of a sort. OCLC’s database is more comprehensive, but its identifiers are less reliably at the work level. Ed Summers recently pointed out that these identifiers only provide human-readable data and not data useful for machines.
3. Work Records
In the long run there are serious limitations with the clustering approach and with displaying work-level information created by extracting and analyzing data in sets of manifestation (published edition) records. For one thing, if information is automatically generated from clusters, it is difficult to correct errors that may sneak in. This problem may be particularly prevalent in the library world due to the practice of copying much information on new records from previous editions without verification. It is also redundant to re-enter and store all this data in multiple manifestation records when it would be more efficient to assess and maintain it in a single work-level record. In addition, a single work-level record would present consistent information to all users. Currently manifestation records may be more or less complete and may give conflicting information about the same work so what a user finds is arbitrarily influenced by the particular records that happen to be in the library catalog being searched.
A group that I am part of, OLAC (Online Audiovisual Catalogers), has created a task force to examine what it might mean to create work-level records for moving images in a library context and also to what extent we might be able to leverage existing library data. We did a project to try to extract work-level data from existing manifestation (published edition) records. Lynne Bisko and I have an article about our experience in The Code4Lib Journal. Our basic idea is that by extracting what we can from existing library records, possibly in combination with information from external data sources, we can create basic, good-enough work-level records that can be corrected and maintained by human beings. These work-level records could then be linked to our existing manifestation-level records. This data could be used to create an interface to provide better access to moving images both in terms of the work (director, cast, original date, language, and country of production, place and time period of setting, etc.) and in terms of the characteristics of the expression and manifestation that will help users easily select the best items for their needs (language(s) of audio and subtitle tracks, captioning, DVD vs. VHS vs. Blu-ray, availability, etc.).