Scaling Machine-Assisted Description of Historical Records

One of the questions I’ve been grappling with as part of the Archival Connections research project is simple: Is there a future for the finding aid?  I’m inclined to think not, at least not in the form we are used to.

Looking to the future, I recently had the chance to propose something slightly different, and have proposed a potential project for funding via an Amazon Research Grant. While the jury is still out on the proposal (an answer is coming in mid-December), I’d like to share a copy of the proposal, Scaling Machine-Assisted Description of Historical Materials.

The idea I describe there seeks to build on an emergent digital repository and library infrastructure that is being built by the University of Illinois Library.  It seeks to integrate natural language processing and named entity recognition elements to index and provide relational browsing pathways alongside file-system access.  I’ll  have more to say about this at the Society of Indiana Archivists meeting tomorrow.

Social Feed Manager Takeaways

Later this week, I’ll be introducing the Archival Connections Project at the Society of Indiana Archivists Meeting.  During the first year of this project, one focus of my work was evaluating and developing some recommendations for using Social Feed Manager, a tool developed by George Washington University Libraries.

My full report is here, for those interested:  https://gwu-libraries.github.io/sfm-ui/resources/SFMReportProm2017.pdf.

Without going into too much detail, here is what I feel like I learned while working on this report, at least as far as it relates to the Archival Connections project:

First and foremost: Data models matter. As I indicated in the report, the SFM’s underlying database and data model are both simple and elegant. Since the application focuses on doing one thing and doing it well, the database directly translates into user interface components that make the application a joy to use.  While the project team hired a usability consultant to improve the app, the tweaks made by the team in response to the report simply added polish to an already strong interface.  While I won’t be so impolitic as to compare SFM to other archival tools, the application works well, in part, because the various data object and the tables that underly them represent things that exist in the real world, not abstractions or vague concepts that are hard for staff to understand or programmers to translate into an interface.

Second: Archivists should become better API consumers.  One of the things that fascinates me most about SFM is the fact that it connects directly to the Twitter API and slurps up all of the metadata supplied by it.  Thinking broadly, the archival and information professions are doing a lot to build and use our own API’s or data providers, but less to interact with those supplied by the data companies that now order our lives.   For example, do we have an API that line archivists (as opposed to technical staff) can connect to (a) Google Drive, Box.com, Outlook 365, or Facebook,  (b), harvest records from those systems, and (c) prep them for deposit in a digital repository?  Not that I am aware of, but we should. Without them, we can’t capture records and preservation metadata at or near the point that records created (h/t David Bearman).

Third: The metadata that APIs supply is a two-edged sword.   Once you dig into their JSON files, you quickly see that Twitter supplies a lot of what the OAIS reference model calls preservation metadata: dates and times tweets were published, times the tool captured it, etc.  As a baseline, such data will help people make future claims about the authenticity of these records or mine them as data.  But given the relative lack of descriptive metadata and the fact that bots and other non-human agents control so many twitter accounts (not to mention the fact that many users’ handles tell you little to nothing about their real identity), this metadata in itself is not sufficient to say something is authentic or not authentic or to wring much value from the dataset.  That requires (wait for it . . . ) a person interpreting the records using all of the intelligence they can muster.

Finally: Aggregations matter now more than ever.   I was a bit taken aback a few months ago when the committee charged with revising DACS made no mention of provenance, original order, arranging files or levels of description in their draft principles.  While their work had much to recommend it,  the lack of any mention like an oversight, and an important one.   My work with SFM has convinced me that aggregations and provenance are even more important when working with records harvested from the cloud.  Given the free-floating, intertwined nature of records found in social media or other ‘cloud’ platforms, it seems to me that the act of capturing records by an archivist results in an aggregation.  For instance, SFM generates a set of tweets, but that set is the result of an archivist’s activity to shape the collection.  And this aggregation and the provenance behind it deserve to be described as such, with as much transparency about the archivist’s role as possible.  In short, archivists can and must do a good job of arranging and describing materials at a collection or series level, there is no workaround for this core archival function, even–or perhaps especially–when extracting item based metadata and records from the platforms that now rule many people daily work and social lives.

Arrangement and Description in the Cloud: A Preliminary Analysis

I’m posting a preprint of some early work related to the Archival Connections project.  This work will be published as a book chapter/proceedings by the ArchiveSchule in Marburg.  In the meantime, here is the preprint:

Installing Social Feed Manager Locally

The easiest way to get started with Social Feed Manager is to install Docker on a local machine, such as a laptop or (preferably) desktop computer with a persistent internet connection.

Running SFM locally for anything other than testing purposes is NOT recommended. It will not be sufficient for a long-term documentation project and would typically allow only one local user to access the application’s web interface under localhost. But it could be useful for experimentation when access to a dedicated web server is not possible or desirable.

After testing the software on several Apple computers, I developed instructions for installing locally. Those who wish to install the application on locally should note that the virtualization software Docker must be installed and running on the host operating system and that no support for installing locally is provided, either by GW Libraries or by me.

One the application is running, it will be available at http://localhost/, under the port specified in the configuration (.env) file. So long as Docker and the application remain running and an Internet connection is live, the application will run harvests as scheduled in the user interface and will be available at localhost. The application can be stopped gracefully from the terminal at any time, with the command docker-compose stop.

To install SFM locally, a user would take the following steps:

  • Download and install Docker. There are instructions for Mac and for Linux Ubuntu on the project site, and Docker can be also installed for Linux and other operating systems. More information about Docker is in the sidebar.
  • Launch the docker software. Once it is running, you will see a little whale icon in your task or applications bar.
  • If it is not already installed on your computer, install git from https://git-scm.com/. Be sure to download the version appropriate to your operating system.
  • Open a terminal session (in linux/mac) or the command prompt in Windows.
  • Follow the local installation instructions at https://sfm.readthedocs.io/en/latest/install.html#local-installation.
    1. Use the terminal to clone SFM software from github and to place it in a dedicated folder on your computer. (Simply cut and paste the provided text into a terminal, then hit enter).
    2. Set configuration variables by editing the file .env, using a text editor of your choice. These variables are described fully in the documentation. Key variables that must be set for the application to be usable include the following:
      1. SFM_HOSTNAME and SFM_PORT. By setting these to ‘localhost’ and ‘8080’ respectively, the user interface will be available at http://localhost/ on your computer.
      2. SFM_SMTP_HOST, SFM_EMAIL_USER, and SFM_EMAIL_PASSWORD. Once the email variables are set correctly, SFM will email your reminders and password resets, as well as other notifications. If you are running SFM locally, you may use the credentials for a gmail or other account, but it is recommended that you create an account just for this applications, not use your personal settings since anyone with access to the server will be able to read your password in clear text.
      3. DATA_VOLUME and PROCESSING_VOLUME. As noted in the documentation, SFM’s default settings will save data inside the docker containers, where it is not accessible to you via the usual file system. If you would like direct access to the data, it can be saved on the usual file system. Follow the instructions in the documentation to set the storage location to a local volume or folder. For example:
        DATA_VOLUME=/sfm-data:/sfm-data

        will set the storage location to the folder “sfm-data” on the root of your local file system, whereas:

        DATA_VOLUME=/sfm-data

        will set the sfm-data folder inside the docker containers, where it is accessible only through the docker commands for managing data in containers

    3. Launch the application using the command dockercompose up d
    4. Optionally, set multiple harvesting streams with the command:
      docker-compose scale twitterrestharvester=x

      where x is the number of simultaneous twitter harvesting streams that you wish to use. Please note that if SFM exceeds twitter’s data limits for a particular credential, it will likely be throttled or cut off.