Checklist for USA federal data backups
Introduction
The United States (US) federal government collects, aggregates, and disseminates a large volume of information and data. This content is used by researchers, policymakers, and many others for various purposes.
Protecting access to US federal government data between and during presidential administrations is important. Data can potentially disappear because of government shutdowns, broken links, and policy shifts.
This checklist provides steps you can take to ensure the government data you use in your research remains accessible to you and others.
As you use the checklist, there is further information and context provided in our accompanying guide with anchored links indicated by ♥.
Identify the data you are working with
- Identify and document the US federal government-produced, hosted, or maintained data you are using and that you want to safeguard. ♥ [ID-1]
- Consider if you are using a model or data source that is based on US federal government-produced data. ♥ [ID-2]
- It may be useful to make a table as you document how you accessed the data. Include the dataset title, URL, which specific agency and program produced it, the date you accessed it, and any additional access method information. ♥ [ID-3]
- Document what the dataset contains and what data you are using. ♥ [ID-4]
Confirm data availability
- Check if the data has already been deposited in a non-governmental data repository. If the data is already preserved in a reliable place, making a backup of the data may not be necessary. Places to check:
- Non-interactive datasets hosted on government websites may already be backed up in the Internet Archive’s Wayback Machine, which captures webpages. ♥ [DA-1]
- Some large data products are duplicated by non-profits or research projects. You should check any that are common for your community. ♥ [DA-2]
Making backups
- Back up governmental webpages and non-interactive datasets that are hosted on them in the Internet Archive’s Wayback Machine, or in projects such as the End of Term Archive. ♥ [BU-1]
- For code that is in a version control system and on the web, use the Software Heritage project to back it up. ♥ [BU-2]
- If the data are not complex, very large (>1TB), or restricted you can make a local copy. For the data to be useful to you, your team, or your community, it’s important to include as much information on the data as possible, to make it findable and reusable in the future. For any data you copy, include: ♥ [BU-3]
- Actual, complete title of dataset
- Agency name that produced the data
- Program or office name
- Website urls, including both the data.gov URL if applicable and the URL where the data are hosted
- Date downloaded
- Method of access (may have been captured under first part)
- File names for data downloaded
- Identifiers associated with the dataset, e.g.,
- DOI
- If you are using data.gov, open the Data.json Metadata and look for the “identifier” value
- Any other thing that looks like an identifier and might help you identify the data in the future
- Note the license and any access and sharing restrictions. Do not share restricted data or data containing PII (Confidentiality & intellectual property | Data management)
- Additionally, save a current copy of the federal webpage that points to the raw data in the Internet Archive. ♥ [BU-1]
- Additional things to document:
- Coverage dates
- Size
- Format
- Version
- Description
- GeoLocation/Spatial coverage
- Related Items
- Checksum
- Consider putting a copy of the raw data that you just backed up in a data repository. See here for more information about data repositories: Find a data repository | Data management
- For larger or interactive projects with field-wide importance you may wish to consult with colleagues about how your field is preserving this data.
Maintaining re-usability and reproducibility
Whew! Well done! Now that you’ve done all that for data that you use from the feds, you might be thinking about the data that you produce. Here are some more ways that you can take care of your future self and the materials of your research whether it comes from the government, someplace else, or your own lab and research:
- Storage and backups
- Documenting your data
- Longer term storage and access for data and code ♥ [RR-1]
- Data citation: In your article cite the data and code you have used and produced in support of the argument or thesis you are presenting in ways that enable credit and findability:
- Author or creator: the entity/entities responsible for creating the data
- Date of publication: the date the data was published or otherwise released to the public
- Title: the title of the dataset or a brief description of it if it’s missing a title
- Publisher: entity responsible for hosting the data (like a repository or archive)
- Persistent URL (such as a DOI): a link that points to the data
- Use web archive links in your citation for URLs that are not persistent identifiers (e.g.,DOI). ♥ [RR-2]
- Date Accessed: since most data are published without versions, it’s important to note the time that you accessed the data in case newer releases are made over time.
- Ask your local experts for help, resources, and guidance
- MIT-affiliated researchers can reach out to Data Management Services at MIT for help or advice, at data-management@mit.edu.
- If you’re not at MIT, go to your library and look for data or research services. You may be pleasantly surprised!
- Help make this resource better
- If you have suggestions or additions for this page, please fill out this form.
- If you found these resources helpful or you republish to your site, please let us know by filling out this form
We’d be honored for you to reuse or link to our content, and we’d appreciate it if you’d credit the MIT Libraries as the source. Please cite as:
Checklist for USA Federal Data Backups by Data Management Services. Copyright © 2024-12-05 MASSACHUSETTS INSTITUTE OF TECHNOLOGY is licensed under a Creative Commons Attribution 4.0 International License except where otherwise noted. [https://creativecommons.org/licenses/by/4.0/]. Access at https://libraries.mit.edu/data-management/store/backups/checklist-usa/