This month I received a reference request that involved a digital collection we acquired some months back. It’s not an unusual type of reference question but I thought I’d blog about it as a use case for digital archives. The request came from the office of origin so we provided the reference directly.
The requestor wanted to know if the collection contained correspondence about a specific topic. The date range was 2006-2008 and perhaps also 2012-2013. Happily, I had three very unique names as terms that I could use for my search. Also, happily the collection is a hybrid collection of both print and digital material. My colleague searched the print archives and did not locate the material. Step two was onto me and digital reference.
The material that I was searching in was the data files from a laptop of the donor. Because the material is under restriction by policy, we had not done much processing beyond triage and understanding the basic nature of the content for our collection description. For the original acquisition, we had captured the data files using Guymager as a E01 disk image.
So, the steps for answering the reference request involved:
- Using BitCurator tools to process the E01 disk image.
- Running BulkExtractor and then Fiwalk to find the “features” of the data on the disk. This took a day.
- Attempted to run Fiwalk and Reports from BitCurator. This took a day and stalled out my computer (running BitCurator in Virtual Box timed out.)
- Switched to using Autopsy directly on my Windows machine to reprocess the E01 disk files and search for the specific terms using Autopsy’s feature extraction tool.
- This took a day but was successful.
- I was able to preview the files located by the search and see that almost all the hits were valid returns and would help to answer the reference request.
- Using the Export feature, I selected the relevant hits and exported a set of files for later.
- Using Autopsy I also located the PST and OST file from the disk image and exported them to my temporary reference storage area on my off-line computer.
- Using Mozilla Thunderbird, I planned to search the email and attachments for more files that might help answer the reference request.
- In order the search the email files, I needed to transform the PST and OST files into the non-proprietary format MBOX. This took about 2 hours.
- Once I had the files in MBOX format, I set up a temporary User Account and then following instructions on how to open MBOX files in Thunderbird off-line, I put the MBOX files into the appropriate User – Appdata – folders and restarted Thunderbird.
- Thunderbird took a while to parse the email – there was 2 years of messages and attachments in the main “Inbox” file. I also loaded the off-line email folders “Personal” file (with 10 years of selected and filed messages and attachments.) This took the better part of a day.
- Once I had access to the email messages in Thunderbird (think Eudora Reader if you are not familiar with Thunderbird) I was also to repeat my search using my reference terms and located another several dozen hits that included attachments.
- Finally, after I had copied the messages / attachments into my temporary reference question storage folder, I went through and de-duped the results from the 5 search sets.
- In order to provide the search results, we decided to provide read-only PDF versions to the requestor – all the data files were amenable to that kind of transformation; none were audio, video, spreadsheets with formulas, etc. We did notice that many of the early files (from 2005-2007) had “auto update” dates that were changing when opened for export to PDF.
How does this compare with your process for providing reference to data that is captured as a disk image? I’d love to get your feedback. Contact me via the link on this page.