Providing Reference to Disk Images

This month I received a reference request that involved a digital collection we acquired some months back.  It’s not an unusual type of reference question but I thought I’d blog about it as a use case for digital archives. The request came from the office of origin so we provided the reference directly.

The requestor wanted to know if the collection contained correspondence about a specific topic.  The date range was 2006-2008 and perhaps also 2012-2013.  Happily, I had three very unique names as terms that I could use for my search.  Also, happily the collection is a hybrid collection of both print and digital material.  My colleague searched the print archives and did not locate the material.  Step two was onto me and digital reference.

The material that I was searching in was the data files from a laptop of the donor. Because the material is under restriction by policy, we had not done much processing beyond triage and understanding the basic nature of the content for our collection description. For the original acquisition, we had captured the data files using Guymager as a E01 disk image.

So, the steps for answering the reference request involved:

  1. Using BitCurator tools to process the E01 disk image.
    1. Running BulkExtractor and then Fiwalk to find the “features” of the data on the disk.  This took a day.
    2. Attempted to run Fiwalk and Reports from BitCurator.  This took a day and stalled out my computer (running BitCurator in Virtual Box timed out.)
  2. Switched to using Autopsy directly on my Windows machine to reprocess the E01 disk files and search for the specific terms using Autopsy’s feature extraction tool.
    1. This took a day but was successful.
    2. I was able to preview the files located by the search and see that almost all the hits were valid returns and would help to answer the reference request.
    3. Using the Export feature, I selected the relevant hits and exported a set of files for later.
  3. Using Autopsy I also located the PST and OST file from the disk image and exported them to my temporary reference storage area on my off-line computer.
  4. Using Mozilla Thunderbird, I planned to search the email and attachments for more files that might help answer the reference request.
  5. In order the search the email files, I needed to transform the PST and OST files into the non-proprietary format MBOX.  This took about 2 hours.
  6. Once I had the files in MBOX format, I set up a temporary User Account and then following instructions on how to open MBOX files in Thunderbird off-line, I put the MBOX files into the appropriate User – Appdata – folders and restarted Thunderbird.
  7. Thunderbird took a while to parse the email – there was 2 years of messages and attachments in the main “Inbox” file.  I also loaded the off-line email folders “Personal” file (with 10 years of selected and filed messages and attachments.)  This took the better part of a day.
  8. Once I had access to the email messages in Thunderbird (think Eudora Reader if you are not familiar with Thunderbird) I was also to repeat my search using my reference terms and located another several dozen hits that included attachments.
  9. Finally, after I had copied the messages / attachments into my temporary reference question storage folder, I went through and de-duped the results from the 5 search sets.
  10. In order to provide the search results, we decided to provide read-only PDF versions to the requestor – all the data files were amenable to that kind of transformation; none were audio, video, spreadsheets with formulas, etc.  We did notice that many of the early files (from 2005-2007) had “auto update” dates that were changing when opened for export to PDF.

How does this compare with your process for providing reference to data that is captured as a disk image?  I’d love to get your feedback.  Contact me via the link on this page.

Posted in All, Tools, workflows | Leave a comment

Engaging with Digital Archives Open Source Communities

Digitalbevaring.dk is a Creative Commons Attribution 2.5 Denmark license. Image: Jørgen Stamp.

Digitalbevaring.dk is a Creative Commons Attribution 2.5 Denmark license. Image: Jørgen Stamp.

In addition to the work we are doing here at the MIT Libraries, we are also engaged with the global community of archivist and preservation folks developing open source tools for digital archives, digital forensics, and digital preservation.  Here are a few of the ways in which we are participating in communities to enhance, improve, and consider future integration and development of these tools.

ArchivesSpace Community

MIT Libraries is a Charter Member of the ArchivesSpace Community with Tom Rosko, Head of Institute Archives and Special Collections, as our official representative. ArchivesSpace is the next-generation archives information management application. Kari Smith is also a member of the Technical Advisory Committee (TAC) and the Chair of the TAC documentation sub-group.  These groups assist witharchivesspace_logo prioritizing and improving the technical development and technical documentation for the software system.  Just recently, she has been appointed to the cross-council Features Prioritization team that includes development and program management staff from ArchiveSpace as well as a few members from the User Advisory Committee and the TAC.

BitCurator Community and BitCurator Access

As the only digital forensics environment and tool set created specifically for collecting institutions, the BitCurator environment and BitCurator Access projects are critical for digital archives.  We participated in the BitCurator-in-a-Box assessment prior to the release of ver 1.0 and also hosted Cal Lee and Kam Woods at the Digital Sustainability Lab and a presentation to MIT Libraries staff and regional colleagues.  logoKari participates in the BitCurator Consortium monthly teleconferences during which users of BitCurator Tools discuss their use and thoughts about enhancements to the environment.  She was also recently appointment to the BitCurator Access Advisory Board.  BitCurator Access software tools will assist collecting institutions (libraries, archives, and museums) in providing web-based and local access to born-digital materials held on disk images.  Additionally, the Andrew W. Mellon funded project led by Cal Lee at UNC Chapel Hill will explore transforming and using digital forensics metadata in collecting environments, and redaction of file items, metadata and hidden data from disk images.

Archivematica users community

Although not a formal sub-group of the work by Artefactual Systems on the Archivematica digital preservation system, we are engaged with the intersections and integration of using Archivematica with BitCurator and ArchivesSpace.  logoKari and Nancy McGovern, Head of Digital Preservation and Curation at MIT Libraries are working on projects and with other uses of Archivematica to discuss case studies and explore how data objects and metadata can travel between the software systems for optimal use, automation, and to meet our archives and preservation requirements. Much of this work is happening under the aegis of the Digital Sustainability Lab at the Libraries.

History_Historie_Digitization

Digitalbevaring.dk is a Creative Commons Attribution 2.5 Denmark license. Image: Jørgen Stamp.

Posted in All, life-cycle experiments, Tools, workflows | Tagged , , | Comments Off

Jessica Venlet – Library Fellow for Digital Archives

In October we welcomed Jessica Venlet to the Institute Archives as our two-year Library Fellow for Digital Archives [see prior post].  Since joining us, she has been learning about the MIT Libraries by meeting with staff from across the departments, engaging in work with the Digital Sustainability Lab, and jumping into existing digital archives infrastructure work as well as conducting background and analysis in some new areas.  Her work will expand and build upon the already established foundations of our digital archives program.jmvenlet

Her Archive Hour blog serves two purposes for her professional development during the Fellow experience.  As she writes on her blog’s About page,

“The first is a structured way to keep up on my professional reading and place my work within the professional landscape. Each week, I set aside “coffee hours” during which I will read a technical report, article from an information science journal, or blog posts from peers. After reading, I’ll post a short response. The second part of this blog will be a record of projects I’ll be working on as part of the MIT Institute Archives and Special Collections team.”   [https://archivehour.wordpress.com/about/]

Jessica is a 2014 graduate of the University of Michigan, School of Information, with an MIS degree and a specialization in Library and Information Science and Preservation of Information.  She has a BA in English literature from Aquinas College.  Jessica was most recently a digital processing assistant at the University of Michigan’s Bentley Historical Library and previously spent time as an electronic records intern at the Michigan State University Library.

Posted in All, roles and responsibilities | Tagged , | Leave a comment