ALA Annual 2019 Presentation and Q&A

Hello ALCTS Metadata Interest Group blog! Rick Fitzgerald and Grace Thomas here – we are librarians at the Library of Congress and recently gave a talk about describing web archives at the ALA Annual ALCTS Metadata Interest Group Meeting!

We first want to thank Anna Neatrour and the ALCTS Metadata Interest Group board for having us and, second, to everyone who took time out of their busy conference schedule to attend our talk. Web archives are becoming increasingly prevalent in libraries, archives, and scholarly research, so we are excited about the interest in our work. Anna and Tillay invited us to share our slides, for anyone who would like to review or missed the session.

Additionally, we wanted to address some questions at the end for anyone who wasn’t able to attend. Please forgive us for paraphrasing the questions and also for paraphrasing our own answers!

Q&A:

Q: Quality review for web archives is challenging because of the scale, how does your program approach it?

A: We can’t look at everything, so we do as much as we can. This is an issue throughout the web archiving community and there have been efforts to explore automated quality review, including a workshop held by the International Internet Preservation Consortium (IIPC) this year dedicated solely to brainstorming quality review solutions. If those tactics advance into some kind of software, we would love to implement it, but for now we look at reports from the crawler and click through as much as we can.

Q: You mentioned datasets, and there is an issue among the community of retaining provenance information and scope notes, how does your program handle this?

A: This is a community-wide issue, also, and we have varying levels of provenance information. First, our curatorial data in Digiboard is the record of selection for a particular URL (in it, the selecting librarian must assign the URL to a collection and provide a justification). Second, once a URL is approved to go to crawl, our team assigns scopes telling the crawler what it is allowed to crawl as part of this URL (for example: social media or CDNs which might host embedded content found through the main URL) and what it is restricted from crawling (we don’t want to crawl all of social media, perhaps just a particular related profile or page). Third, we have the crawl logs from the crawler which has very rich metadata showing the path of how a certain page came to be captured: response codes from the server at the time of crawl, the MIME type of the crawled resource, capture timestamp, and size of the resource, for example. Since we do not have a legal mandate to crawl, our (very) complicated permissions process makes releasing the crawl logs publicly impossible right now. However, with time, perhaps more of this provenance data can become part of publicly released datasets. For now, check out the ones we have publicly available here.

Q: Is Digiboard open-source?

A: Unfortunately, no. Digiboard is a home-grown tool that works specifically for our scale (tens of thousands of URLs crawled at varying frequencies), complicated permissions process, complicated selection process (with over 200+ potential selecting librarians), organizational structure, and our current method of quality review. If you wish to begin web archiving, there are subscription-based services which take care of all behind-the-scenes technical work (maintaining and running the crawler, indexing the content, maintaining the indexes, maintaining a version of the Wayback Machine and specific accesspoints for collections, etc). Many national and regional libraries, archives, and university libraries throughout the world successfully use these kinds of services to perform web archiving!

Q: How will the sidecar records relate to the minimal records?

A: The sidecar MODS XML files will sit on the same server as the minimal MODS XML files (separate files). During the ETL (Extract Transform Load) process to convert the information from MODS XML into the Library’s Solr index for loc.gov, the two files will be merged into the pages you see on https://www.loc.gov/websites/ based on identical ID numbers.

For more information about the backlog we released last year, please see the Library of Congress Signal blog post: More Web Archives, Less Process, written by Grace. Also, if you are interested in getting updates on our work as we write about them or any other digital library news from the Library of Congress, bookmark The Signal!

For any other questions, please do not hesitate to send us an email, you can find our addresses at the end of the slide deck. Thank you again for giving us a platform to share our work and best of luck with future interest group activities!

Posted in ALA Annual 2019 | Tagged , , , , | Leave a comment

CC:DA Liaison Blog Post

After my recent appointment as the Metadata Interest Group’s liaison to the Committee on Cataloging: Description and Access (CC:DA), I reached out to outgoing liaison Jessica Hayden to inquire about her experience.  One of her principle recommendations was to use Metadata Interest Group (MIG) channels like the blog to grow awareness of RDA revision proposals an other CC:DA business relevant to MIG members, and to use these channels to get feedback and ideas from a broader segment of stakeholders within the community of metadata librarians.

But first a bit of background.  CC:DA “is the body within the American Library Association responsible for developing official ALA positions on additions to and revisions to RDA: Resource Description and Access.” For the past two years, changes to the RDA standard have been frozen because of the RDA Toolkit Restructure and Redesign (commonly known as the 3R Project), which will not only change the look, feel, and functionality of the RDA Toolkit, but also incorporate much of the IFLA Library Reference Model (LRM) into the text and structure of RDA.  In April, a “stabilized” English version of the new RDA text was released as part of the Beta RDA Toolkit. Neither this stabilized text nor the Beta RDA Toolkit will supplant the current version until the RDA Steering Committee (RSC) declares the 3R Project complete, which in all likelihood will not happen before the end of 2019.  In the meantime, the focus of the 3R Project will shift to translations and the development of policy statements, such as LC/PCC-PS.

In the meantime, addition and revision proposals to RDA are still frozen, although minor revision proposals, such as error or typo corrections, may be submitted.  However, the release of the stabilized text provides an excellent opportunity to review what will likely represent a major revision to RDA.  The incorporation of the LRM should be of especial interest to our community at the Metadata Interest Group, as this stabilized text represents a further shift towards an entity-based approach, with many ramifications for linked data implementation.  If you have ideas for revisions and additions, I’d love to hear them–feel free to contact me directly at trm2151 [at] columbia [dot] edu.

Timothy R. Mendenhall

Metadata librarian at Columbia University Libraries, performing both traditional MARC cataloging and non-MARC work in the digital library collections. Participant in the PCC.  Formerly a processing archivist and still active as an art cataloger at the Frick Art Reference Library.

Posted in Standards and Guidelines | Leave a comment

Nominations are opened for leadership roles in the ALCTS Metadata Interest Group

Announcement:

The ALCTS MIG seeks nominations (self-nominations welcomed) for the following offices:

  • Vice-Chair/Chair Elect (Vice-Chair 2019-2020, Chair 2020-2021)
  • Program Co-Chair (2019-2021)
  • Secretary (2019-2021)

These positions are held for two years, and attendance to ALA Annual and ALA Midwinter is expected. Service duties begin July 1, and would run through June 2021. Continue reading

Posted in ALA Annual 2019 | Tagged | Leave a comment