Metadata Librarian’s Little Helper: OpenRefine Reconciliation Services

This is the third in our series of follow-up posts by Midwinter Lightning Talk presenters.

When our archive opened to the public two years ago, the catalog of nearly 5,000 records were findable by keyword search and little else. The data was devoid of authorities and controlled vocabularies, and had not been compiled into finding aids. Attempting a migration from our legacy records management software to ArchivesSpace involved a great deal of cleanup, and authority reconciliation proved to be the most challenging part.  A specific feature in OpenRefine called Reconcile-csv helped with this task.

Since our legacy metadata lacked authority control, the issues were predictable: corporate and personal names were inconsistent, acronym-filled, and sometimes included related terms in parenthesis. Also, the records did not link to each other, so the same name could be formed differently in an authorities record, collection record and accession record. Cleaning up the names was going to be a large task since it meant addressing inconsistencies everywhere that a name appeared. Thankfully,  OpenRefine’s reconciliation service can automate some of this work: by plugging in a URL, the application will match data in your spreadsheet against a controlled vocabulary on the web, such as the Library of Congress (LC) Authorities.

Our plan was to export batches of 100 names from our catalog’s name authority file, which consisted of 1,700 names. Next, we would reconcile against LC Authorites in OpenRefine. Another great thing about the reconciliation services is that it can pull in other values from the controlled vocabularies, such as URIs. In doing this, our cleaned authority records could incorporate linked data.

OpenRefine’s reconciliation service is an amazing feature, but our metadata was in such rough shape that this stage took longer than anticipated. The name reconciliation matched or suggested matches for half of our names, and the remaining names either did not exist in LC Authorities, or they were so messy that no match could be found. Also, the names that were matched had to be evaluated for accuracy to make sure that our Michael Smith is the same person as the matched Michael Smith. On average, the reconciliation service took two minutes to run on 100 records, but the evaluation stage took an hour.

Our name cleanup also required us to standardize local names according to RDA. Since the bulk of our collection consists of university records, we had many variant spellings of university departments and offices. With an intern’s help, I made these edits in our name database, and added them, with a unique identifier, to a Google sheet that was to become our local name authority.

Now, our name authority records were clean, and ready to be imported into ArchivesSpace. But we were far from finished – all of the names within our collection and accession records were still a mess. Did this meant that we had to repeat the entire reconciliation process for these other record types? It did not, thanks to Reconcile-csv, a free reconciliation service developed by Open Knowledge Labs.

Two documents were the by-product of the work that we had just done: (1) a list of names from our catalog matched to LC authorities with their identifiers, and (2) a list of RDA-formed local names. Putting these together essentially gave us a master CSV of authority-controlled names. Now, we could use reconcile-csv, which matches data against a local CSV file. So, instead of matching our remaining messy names against the millions of entries in LC Authorities, then manually cross-referencing names against our local name authority, we simply matched against our master spreadsheet of 800 authoritative (local and LC) names that had been reconciled and evaluated for accuracy. This time, our match rate was higher and more accurate than with the LC Authorities reconciliation. As a result, our evaluation stage took significantly less time – 15 minutes per 100 records instead of an hour.

The Open Knowledge Labs website offers a simple 3-step instructions for downloading and running reconcile-csv from the command line. It behaves like any other OpenRefine reconciliation service: you can specify the column to be matched, view matches and suggested matches, and pull in other values from your master spreadsheet, like URIs.  Reconcile-csv is great for metadata that requires a lot of authority reconciliation, since compiling a segment of authoritative terms will make that reconciliation go much faster. For that reason, it’s really helpful for subject term reconciliation as well.


Reconcile-csv in GitHub:
Reconcilable data sources for OpenRefine:
How to use OpenRefine reconciliation services:

Greer Martin, Discovery & Metadata Librarian, Illinois Institute of Technology

This entry was posted in ALA Midwinter 2017, Conferences. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *