Metadata Interest Group Meeting ALA 2010: Linked Data

Metadata Interest Group Meeting
The Metadata Interest Group met on Sunday, June 28, and had two speakers.

Summaries of the presentations are below. A link to the full presentations is available at: http://connect.ala.org/node/107906

Linked Data and Controlled Vocabularies on the Web
Rebecca Guenther, Library of Congress

Ms. Guenther described a project underway at the Library of Congress to provide access to the Library of Congress’s controlled vocabularies using the Resource Description Framework (RDF). First, she gave an overview of the controlled vocabularies and their uses. Controlled vocabularies control value, reduce ambiguity, provide for synonym control, allow for validation, and establish formal relationships among terms. They can be simple, like lists of enumerated lists (e.g., drop down menu) or complex, (e.g., full thesauri with multiple relationships). The Library of Congress (LC) maintains standards that contain controlled vocabularies, including:

  • LCSH/NAF
  • TGM
  • MARC controlled lists (e.g., ISO 639-2 language codes)
  • MODS/METS/MIX/PREMIS controlled lists

Controlled vocabularies are currently represented in a variety of ways,

  • Metadata format like MARC authority records
  • XML schemas, e.g., enumerated list
  • RDF/XML and RDFS (i.e., semantic web)
  • SKOS Simple Knowledge Organization System
  • MADS (MODS for authority records)

Guenther focused on using SKOS at http://id.loc.gov. SKOS is an RDF application used to express knowledge organization systems such as classifications, thesauri, and taxonomies. It allows distributed decentralized management of SKOS through linked data-inspired applications. It requires a uniform resource identifier (e.g., http URIs). The data model in place at id.loc.gov provides a concept scheme; logical groupings of concepts; and labeling properties, annotation properties, and associative properties

SKOS was selected because the defined element set is relevant to controlled vocabularies, more than RDF or OWL (ontology web language) alone. It is easy to transform MARC authority records into SKOS and show broader and narrower relationships and it enables web services using the URIs.

Guenther also provided some additional information about “linked data,” which is a feature of the semantic web where links are made between resources. It goes beyond the hypertext links because it allows links between concepts. According to Wikipedia: “term used to describe a method of exposing, sharing, and connecting data via dereferenceable URIs on the Web.” Id.loc.gov is a web service for shared vocabularies. It should reduce maintenance and make openly available comprehensive information about controlled terms and has been an experiment ground for semantic web technologies.

Id.loc.gov went live in April 2009, with more vocabularies added in 2010: LCSH, TGM, MARC code list for relators, and PREMIS controlled vocabularies. Data is open and continuously updated, and can be bulk downloaded in RDF. Searches can bring up terms by ID or label information, and multiple vocabularies can be searched at once. A demonstration of the site revealed the visualizations tree, suggested terminology tab, and links to similar concepts in other vocabularies, such as the French RAMEAU subject headings.

The data from id.loc.gov has been put in use in several projects, including:

  • University of Pennsylvania online books
  • University of Virginia auto suggest feature
  • Freebase.org
  • National libraries of Sweden and France

In the future LC will be adding some additional vocabularies:

  • MARC code list for language (ISO 639-2) and other ISO 639 lists
  • MARC code list for countries and geographic areas
  • Additional PREMIS controlled vocabularies
  • Name authorities will be a challenge because they doesn’t fit into SKOS very well, so looking at a different mark-up

Some other avenues in the future includes a MADS OWL schema to enable identification of facets within name and subjects, expanded information on subdivisions, and additional relator terms to enhance existing relationships.

The technical infrastructure for id.loc.gov

  • Django (Python)
  • LCSH uses MySQL and SKOS RDF generated at time of request, mainly operates like relational database with MARC mapped to tables
  • Everything else is RDF triplesotre (Python Library, uses MySQL), XML to SKOS RDF/XML before ingest
  • Programmatic queries using SPARQL


VIVO: A Research-Focused Discovery Tool

Sara-Russell-Gonzalez, University of Florida

Russell-Gonzalez discussed VIVO, an open source semantic web application that enables the discovery of research and discovery for researchers. It is designed for researchers, students, administrators, and donor/funding agencies. It provides profiles for researchers. Originally developed at Cornell University Libraries to support the life sciences, it was redesigned in 2007 to be a semantic web application, and can cover all disciplines. The University of Florida got involved with a 2009 NIH grant to create National Networking with VIVO. VIVO is designed in part to answer the following questions:

  • Researchers don’t visit the library with online resources, so how do you know what your researchers are doing and how can you be involved in the research process?
  • How can researchers form collaborations with researchers in other disciplines or students learn about potential advisors?
  • How can administrators know their strengths and weaknesses for strategic planning?

VIVO gathers data from a variety of sources, although all of it is public. As much as possible is done automatically, drawing from internal and external sources. Because each school is different, each school has their own local VIVO instance. Local sources can includes the institutional repository, human resources databases, institutional grants database, faculty reporting tool, etc. National sources are mostly abstracts and indexes, like PubMed. All data is mapped into an RDF structure. Compliance with semantic web standards enables national network across all VIVOs around the country

Data is stored in RDF triples, with reflexive relationships (i.e., relationships are reflected in both directions). Consequently it grows quickly in size. The VIVO core ontology is used to describe people, organizations, activities, publications, events, interests, grants, and other relationships, with support for local extensions and FOAF (Friend of a Friend). Http URIs identify objects and data uses SPARQL end-points.

There are multiple challenges to using the semantic web approach, such as determining the level of granularity, scalability as the database grows, provenance of the data, and keeping data up-to-date. Disambiguation, particularly of authors, may be one of the biggest challenges. From a political standpoint, determining when data should be removed is another issue (e.g., what happens when a faculty member leaves?).

There is one year left on the project with some upcoming enhancements:

  • Want to be able to give ability to produce CVs and biosketches
  • Forming collaborations with publishers to bring in additional external data sources
  • Developing visualization capabilities

VIVO is still looking for schools to get involved, data providers and for application developers to interface with VIVO. The first national conference for VIVO is August 12-13 at the New York Hall of Science. VIVO’s website is: http://vivoweb.org/

Reported by Kristin Martin

This entry was posted in ALA Annual 2010. Bookmark the permalink.

One Response to Metadata Interest Group Meeting ALA 2010: Linked Data

  1. Bernard says:

    whoah this blog is wonderful i really like reading your posts.

    Stay up the good work! You realize, lots of individuals
    are hunting around for this information, you can help them greatly.

Leave a Reply

Your email address will not be published. Required fields are marked *