Digitalbevaring.dk is a Creative Commons Attribution 2.5 Denmark license. Image: Jørgen Stamp.
In an effort to inform my colleagues and keep myself up to date about the tools and systems we are using or assessing for our digital archives collections I have been giving open presentations to MIT LIbraries staff.
The first presentation was an overview of our needs, ecosystem, and a brief summary of each of the seven tools/systems to discuss. Here is the presentation file: DATools_MIT_overview_pt1
Other presentations covered basics about ArchivesSpace, Archivematica, Access to Memory (atom), and BitCurator and how we are thinking about where they fit in our Digital Archives Ecosystem.
Today I had the pleasure of contributing to the Radio Boston program, hosted by Meghna Chakrabarti and with Bina Venkataraman as her guest.
For information about how to preserve your digital data follow-up with these resources:
Personal Archiving at the Library of Congress
The Signal, blog at the Library of Congress
Taking time each year to go through your digital files and decide what to keep and what to delete can be easier if you do it during Preservation Week. Check for programs, workshops, and activities happening in your local area. In 2016, Preservation Week will be April 24-30 and is usually the last week in April.
In April for Preservation Week 2015 at MIT Libraries, Jessica Venlet, Fellow for Digital Archives and I hosted an activity, Connect the Clues! Exploring Personal Archives. Read her Connect the Clues! blog post about the activity and link to the presentation slides and resources handout. Check out these tips for archiving your content from social media sites, steps for finding, selecting, storing, and managing your digital files, and some information on legal and digital estate planning: Tips for Personal Digital Archiving.
Thanks to Radio Boston for hosting a great segment on a topic that effects all creators and users of digital data!
This month I received a reference request that involved a digital collection we acquired some months back. It’s not an unusual type of reference question but I thought I’d blog about it as a use case for digital archives. The request came from the office of origin so we provided the reference directly.
The requestor wanted to know if the collection contained correspondence about a specific topic. The date range was 2006-2008 and perhaps also 2012-2013. Happily, I had three very unique names as terms that I could use for my search. Also, happily the collection is a hybrid collection of both print and digital material. My colleague searched the print archives and did not locate the material. Step two was onto me and digital reference.
The material that I was searching in was the data files from a laptop of the donor. Because the material is under restriction by policy, we had not done much processing beyond triage and understanding the basic nature of the content for our collection description. For the original acquisition, we had captured the data files using Guymager as a E01 disk image.
So, the steps for answering the reference request involved:
- Using BitCurator tools to process the E01 disk image.
- Running BulkExtractor and then Fiwalk to find the “features” of the data on the disk. This took a day.
- Attempted to run Fiwalk and Reports from BitCurator. This took a day and stalled out my computer (running BitCurator in Virtual Box timed out.)
- Switched to using Autopsy directly on my Windows machine to reprocess the E01 disk files and search for the specific terms using Autopsy’s feature extraction tool.
- This took a day but was successful.
- I was able to preview the files located by the search and see that almost all the hits were valid returns and would help to answer the reference request.
- Using the Export feature, I selected the relevant hits and exported a set of files for later.
- Using Autopsy I also located the PST and OST file from the disk image and exported them to my temporary reference storage area on my off-line computer.
- Using Mozilla Thunderbird, I planned to search the email and attachments for more files that might help answer the reference request.
- In order the search the email files, I needed to transform the PST and OST files into the non-proprietary format MBOX. This took about 2 hours.
- Once I had the files in MBOX format, I set up a temporary User Account and then following instructions on how to open MBOX files in Thunderbird off-line, I put the MBOX files into the appropriate User – Appdata – folders and restarted Thunderbird.
- Thunderbird took a while to parse the email – there was 2 years of messages and attachments in the main “Inbox” file. I also loaded the off-line email folders “Personal” file (with 10 years of selected and filed messages and attachments.) This took the better part of a day.
- Once I had access to the email messages in Thunderbird (think Eudora Reader if you are not familiar with Thunderbird) I was also to repeat my search using my reference terms and located another several dozen hits that included attachments.
- Finally, after I had copied the messages / attachments into my temporary reference question storage folder, I went through and de-duped the results from the 5 search sets.
- In order to provide the search results, we decided to provide read-only PDF versions to the requestor – all the data files were amenable to that kind of transformation; none were audio, video, spreadsheets with formulas, etc. We did notice that many of the early files (from 2005-2007) had “auto update” dates that were changing when opened for export to PDF.
How does this compare with your process for providing reference to data that is captured as a disk image? I’d love to get your feedback. Contact me via the link on this page.