Metadata Interest Group Meeting at ALA Midwinter 2010

Links to presentation slides are available on the ALA presentation wiki

eXtensible Catalog: Metadata Services Toolkit
Jennifer Bowen, University of Rochester

Jennifer Bowen presented on the Metadata Services Toolkit (MST), one of three modules under development by the University of Rochester as part of the eXtensible Catalog (XC – http://www.extensiblecatalog.org). There is also a new non-profit being developed to support eXtensible catalog: the eXensible Catalog Organization.

Three main categories of XC:

  • User interface
  • Metadata management tools – this presentation focused on the MST
  • Connectivity

Bowen displayed a diagram showing the different layers of architecture of the XC. It uses different toolkits for its different functions. The OAI toolkit allows MARC records to be extracted out of an ILS via OAI (many ILSs are not OAI harvestable on their own). The NCIP toolkit works for circulation data, and the Drupal toolkit works for the user interface. OAI-PMH was selected as the method to pull records out of ILSs because it is a standard protocol, in wide use, and can be synchronized to automatically pull in updates, new records, and deletes—-in general easy to use and easy to replicate

Bowen provided some background on the MST

  • It is based on idea of Diane Hillman, Stuart Sutton and Jon Phipps.
  • It enables libraries to automatically process batches of metadata.
  • It can be used by other front-end systems, e.g., Summon, Primo-—is designed to serve as a middle layer for any system that can harvest data.
  • It’s a pipeline—-will continue to feed data and keep changes current.
  • It has services to clean up and normalize inconsistent metadata—-current focus on MARC and DC, but ultimately can be extended to any XML metadata.
  • It has its own faceted interface.
  • But it’s not a metadata editor, cannot edit/change records using it. Problems can be cleaned up in original repository and then reimported.

There are four major functions of the toolkit:

  1. Add repositories
  2. Schedule harvests
  3. Orchestrate services
  4. Browse records

There are also four major services in MST. Again, the current focus is on MARC, but ultimately should work with all different types of metadata.

  1. Normalization of MARCXML and DC
    o Metadata stays in the same schema
    o Can correct frequent errors, e.g., OCLC numbers formatted in different ways
    o Prepares metadata for use in other applications
  2. Transformation
    o Automates the transformation from one schema to other
    o Can incorporate any existing XSLT crosswalks
    o Right now, Rochester focus on transforming to XC schema from MARC
    o Maintains the relationships between input records, e.g., bib records and holdings records
    o XC schema is a FRBR schema, so relationships are very important: one record in results in several output records
    o Creates additional work/expression records for MARC analytics (7XX)
  3. Aggregation
    o Aggregates records that represents the same resource, focus is currently at the manifestation level in FRBR.
    o Manages relationships between FRBR levels
    o Enables automated synchronization of updates for records at each FRBR level
  4. Authority control
    o This area was covered in a different talk at ALA
    o Works in MARCXML, Dublin Core.
    o Matches headings against MARCXML authority file and populates records with identifier
    o Working on tool for debugging likely and unlikely matches, which is under development this year.

Bowen proceeded to give a demonstration of the MST in action.

  • In the example, she has repositories linked to Cornell DSpace and Voyager test database, which are set up to harvest. User can specify repository, frequency, time, etc. for harvesting.
  • Can run services on harvested metadata. Current services in MST are normalization and transformation. Other services can be added.
  • Browse records tab allows catalogers to review records (not a PUBLIC interface): can review by repository, harvest, schema, errors.
    o Reminder: errors cannot be fixed in MST, you need to fix in native repository
    o Can view records in original schema as brought in, and its successors after transformation
  • Log of harvesting and services
  • Can set up different groups and permissions on MST and ways of authentication (email, LDAP)
    MST is relatively straightforward to install

Current status of XC/MST

  • Available for free download via www.extensiblecatalog.org
  • Initial MST services prepare MARC data for reuse in FRBRized XC user interface, and contains some RDA elements
  • Ongoing development includes testing with a range of data, performance work, code refinement and documentation

Summary: MST can serve as open-source platform to manipulate XML metadata. It automates the processing of large batches of metadata, synchronizes metadata from multiple repositories, facilitates manipulation and aggregation of FRBRized data, and in general lowers the bar for metadata processing.

The presentation ended with some questions about XC. There is ongoing work to bring in authority records and add identifiers to them, and work to provide more work-level matching with FRBR levels. The aggregation service is currently matching at manifestation level. Bowen also reminded users that XC does not deal with acquisitions data, as it is not an ILS, rather it takes data out of the ILS.

Massive Metadata Mashup
Roy Tennant, OCLC

Tennant presented on research he has been involved in that grew out of his own personal interests, to quote, “This is an ad-hoc project that defies natural laws and civil society.”

First, he answered the question, “What is a massive metadata mashup?” It’s taking different piles of metadata, smashing them together and processing them—MARC, storage data, and HathiTrust metadata. It is an attempt to find an answer to the question: how we manage print/digital together – what do we keep in print in relation to our digital collections?

Background on the project

  • HathiTrust is collaborative organization for storage books digitized by Google: http://www.hathitrust.org/
  • This research supports the Cloud Library Project
    o Research libraries have offsite storage
    o What is the overlap between offsite storage and content in HathiTrust?
    o Will this intersection create new operational efficiencies? For which libraries and under what conditions? How soon? What impacts?
  • HathiTrust exposed data about their content early on: set of 13 elements output as a tab-delimited file which included:
    o Identifiers
    o Access
    o Rights
    o Control numbers
  • Tennant transformed simple HathiTrust metadata into XML and created a simple database on personal server and loaded new data each month: http://roytennant.com/proto/hathi/

Workflow

  • Download HathiTrust metadata
  • Enhance with more OCLC numbers (would like to have OCLC numbers in every HathiTrust record): link up via ISBN/ISSN/LCCN to OCLC number
    o Extract OCLC numbers to match to WorldCat data, currently saving for later use
  • Explode into individual XML records
  • Index into proto database
  • Tools used: Perl, XML, Swish-e, XSLT, xlstproc
  • Data structure
    o WorldCat specific data
    o Hathi specific data
    o Storage information data
  • Looking at data: can find some clean-up issues through reports, reporting odd items to HathiTrust

Discoveries from process and lessons learned

  • Collection growth: Hathi has almost doubled in the space of a year
  • 85% in copyright, 15% (762,000 vols) in public domain
  • Humanities dominates HathiTrust currently
  • HathiTrust counts by volumes, so after the process matches to MARC, it ends up being 2-3 million MARC records from 5.2 million volumes
  • Trying to find the intersection between a collection in the library and the HathiTrust
    o Example: Can a library think about moving items to storage and getting rid of items in those titles are in the HathiTrust? A graphical representation as a Vin diagram showed the overlap
    o Overlaps will grow over time
  • Lessons learned
    o Identifiers are essential: in this case using OCLC numbers
    o Standards are great, but not always necessary. In this case, they really didn’t apply, as there was no need to validate
    o When processing large amounts of data, always check your work – one small mistake can have very large consequences – build in checks along the way
    o You may need to revisit initial assumptions
    o Never underestimate the power of the prototype

The question and answer session that followed reiterated that this is an experimental project attempting to generate broad principles for all libraries.

Blogged by Kristin Martin

This entry was posted in ALA Midwinter 2010. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *