MIT Libraries logo MIT Libraries

MIT logo Search Contact

Guide to “Checklist for USA federal data backups”

Introduction

This guide provides further information and context for the “Checklist for USA federal data backups

Identify the data you are working with

  • ID-1: US federal government produced data and publications are typically available in the public domain, allowing you to freely use, adapt and copy this data without copyright concerns. Other sources, such as state and local government produced publications or data produced by non-US countries or NGOs may not be under a public domain license. [back]
  • ID-2: Many sources reuse data that was originally produced by the US federal government, such as census or environmental data. [back]
  • ID-3: Document this information in your README file and/or methods section. Only giving the URL of the dataset is not sufficient to find it later, particularly if the URL is later changed or removed. Instead, give the full name of the dataset, name of the program that produced the data, and how you obtained the data. For instance, for this walkability dataset, you can use the data.gov entry to find the specific EPA office that produced it: “U.S. Environmental Protection Agency, Office of Sustainable Communities.” Also document both the URL and title of the webpage you used to obtain the data (note that data.gov entries such as this example point you to agency websites to download data) and the date you downloaded the data. For more complex and interactive data products, additionally document how you downloaded/ obtained the data you are using, including specific queries, search terms, or scripts you used. See the section on citing data for more on data citations in bibliographies. [back]
  • ID-4: As a part of your README and/or methods section, note what the dataset contains and what specific data you are using from it. For example, if you are using the Cambridge, Massachusetts data from the Walkability Index dataset, document this. List which data you are specifically using and for what part of your analysis. This will help you with future reproducibility and documentation of your work, particularly if the scope or content of the dataset changes in future. [back]

Confirm data availability

  • DA-1: To check if the Internet Archive has archived a web page, navigate to the Wayback Machine and enter the URL of the page you are interested in. Check the calendar of saved web pages to see if there is a recent copy of the page, and if so check if the dataset you are using is downloadable from the archived page. Also double-check the date and version of any data you find; the copy you find may be out of date or not the same version you are using. If you are not sure of the specific URL, you can also search the Internet Archive’s collection of government web pages by keyword here. Searching in the Internet Archive also searches pages captured by the End of Term archive project. Additionally, the Federal Library Deposit Program maintains a web archive in conjunction with https://archive-it.org/home/FDLPwebarchive. [back]
  • DA-2: For example, US Census data is archived by ICPSR.  Other major datasets may also have duplicate copies or mirrors which you can find through professional networks or searching. Ask your library for help if you would like help in searching. [back]

Making backups

  • BU-1: If the page is not already saved, you can enter the web page’s URL in the “Save Page Now” button on the Wayback Machine’s homepage to save a snapshot of the page. Click “save outlinks” to make sure to save all the links on the page as well, including linked data, though interactive tools may not be backed up or function correctly. You do not need an Internet Archive account to do this. The Wayback Machine also provides browser extensions on their page that allow you to check and save pages as you browse. In addition, for the time period around the presidential election and transition you can nominate specific governmental webpages (whether they host data or not) to be included in the End of Term archive, which takes a snapshot of government webpages as they appear at the end of a presidential term. Interactive databases (particularly those through interfaces such as Tableau) are difficult to capture. To nominate interactive or complex databases, use the End of Term database nomination form here. [back]
  • BU-2: to back up a link that leads to code using the Software Heritage project, go to https://archive.softwareheritage.org/save/ and enter the URL and version control system. You can also browse already-added code. For backing up your own code, see RR-1, below. [back]
  • BU-3: you may be able to find this information in data.gov, which lists and documents US federal government datasets. [back]

Maintaining re-usability and reproducibility

  • RR-1: If your code is on GitHub, there are connectors between GitHub and the data repositories Zenodo or Dataverse. Archiving your GitHub repository in a data repository gives your code a persistent identifier that is citable, and allows it to be more easily associated with elements of your research project or scholarship. If your code is not in a version control system, you can put zip files or other source code files in data repositories; be sure to document what the code is for, dependencies, and how it is installed, as well as license and contact information and the other metadata points listed above for data sets. More information: Turing Way, Research Software Alliance (ReSA), Software Preservation Network, Software Sustainability Institute (SSI). [back]
  • RR-2: To get the web archive link from Internet Archive, navigate to the Wayback Machine, search for your website, and then choose the most recent or appropriate date from the calendar view. You will get an Internet Archive link that includes the original URL. You can use this in your citation, with or without the original URL, along with the date of capture. [back]

We’d be honored for you to reuse or link to our content, and we’d appreciate it if you’d credit the MIT Libraries as the source. Please cite as:

Guide to “Checklist for USA Federal Data Backups” by Data Management Services. Copyright © 2024-12-05 MASSACHUSETTS INSTITUTE OF TECHNOLOGY is licensed under a Creative Commons Attribution 4.0 International License except where otherwise noted. [https://creativecommons.org/licenses/by/4.0/]. Access at https://libraries.mit.edu/data-management/store/backups/checklist-usa-guide/