Hello ALCTS Metadata Interest Group blog! Rick Fitzgerald and Grace Thomas here – we are librarians at the Library of Congress and recently gave a talk about describing web archives at the ALA Annual ALCTS Metadata Interest Group Meeting!
We first want to thank Anna Neatrour and the ALCTS Metadata Interest Group board for having us and, second, to everyone who took time out of their busy conference schedule to attend our talk. Web archives are becoming increasingly prevalent in libraries, archives, and scholarly research, so we are excited about the interest in our work. Anna and Tillay invited us to share our slides, for anyone who would like to review or missed the session.
Additionally, we wanted to address some questions at the end for anyone who wasn’t able to attend. Please forgive us for paraphrasing the questions and also for paraphrasing our own answers!
Q: Quality review for web archives is challenging because of the scale, how does your program approach it?
A: We can’t look at everything, so we do as much as we can. This is an issue throughout the web archiving community and there have been efforts to explore automated quality review, including a workshop held by the International Internet Preservation Consortium (IIPC) this year dedicated solely to brainstorming quality review solutions. If those tactics advance into some kind of software, we would love to implement it, but for now we look at reports from the crawler and click through as much as we can.
Q: You mentioned datasets, and there is an issue among the community of retaining provenance information and scope notes, how does your program handle this?
A: This is a community-wide issue, also, and we have varying levels of provenance information. First, our curatorial data in Digiboard is the record of selection for a particular URL (in it, the selecting librarian must assign the URL to a collection and provide a justification). Second, once a URL is approved to go to crawl, our team assigns scopes telling the crawler what it is allowed to crawl as part of this URL (for example: social media or CDNs which might host embedded content found through the main URL) and what it is restricted from crawling (we don’t want to crawl all of social media, perhaps just a particular related profile or page). Third, we have the crawl logs from the crawler which has very rich metadata showing the path of how a certain page came to be captured: response codes from the server at the time of crawl, the MIME type of the crawled resource, capture timestamp, and size of the resource, for example. Since we do not have a legal mandate to crawl, our (very) complicated permissions process makes releasing the crawl logs publicly impossible right now. However, with time, perhaps more of this provenance data can become part of publicly released datasets. For now, check out the ones we have publicly available here.
Q: Is Digiboard open-source?
A: Unfortunately, no. Digiboard is a home-grown tool that works specifically for our scale (tens of thousands of URLs crawled at varying frequencies), complicated permissions process, complicated selection process (with over 200+ potential selecting librarians), organizational structure, and our current method of quality review. If you wish to begin web archiving, there are subscription-based services which take care of all behind-the-scenes technical work (maintaining and running the crawler, indexing the content, maintaining the indexes, maintaining a version of the Wayback Machine and specific accesspoints for collections, etc). Many national and regional libraries, archives, and university libraries throughout the world successfully use these kinds of services to perform web archiving!
Q: How will the sidecar records relate to the minimal records?
A: The sidecar MODS XML files will sit on the same server as the minimal MODS XML files (separate files). During the ETL (Extract Transform Load) process to convert the information from MODS XML into the Library’s Solr index for loc.gov, the two files will be merged into the pages you see on https://www.loc.gov/websites/ based on identical ID numbers.
For more information about the backlog we released last year, please see the Library of Congress Signal blog post: More Web Archives, Less Process, written by Grace. Also, if you are interested in getting updates on our work as we write about them or any other digital library news from the Library of Congress, bookmark The Signal!
For any other questions, please do not hesitate to send us an email, you can find our addresses at the end of the slide deck. Thank you again for giving us a platform to share our work and best of luck with future interest group activities!