File formats for long-term access
As technology changes, researchers should plan for both hardware and software obsolescence and consider the longevity of their file format choices to ensure long term readability and access.
File formats more likely to be accessible in the future have the following characteristics:
- Open (non-proprietary): Files should be stored in open formats, where possible. Open formats can be used by a variety of proprietary, free, and open-source software tools rather than just a single piece of software. These formats are far more likely to remain usable over the long term even if the software that created them is not available or no longer functional.
- Common usage by the research community: A file format which is relied upon by a large user group creates many more options for its users. It is worth bearing in mind levels of use and support for formats in the wider world, but also finding out what organisations similar to you are doing and sharing best practice in the selection of formats. Wide adoption of a format can give you more confidence in your preservation strategy.
- Documented: File formats with published documentation and standards will be easier to preserve and access in the future. Open formats are often more likely to have publicly available documentation. Check whether the standard is listed in the PRONOM file format registry, a resource maintained by the National Archives of the United Kingdom.
- Lossless: Lossy formats are those where data is compressed, or thrown away, as part of the encoding. In contrast, a “lossless” format is a type of compression where the original data is fully preserved, ensuring no loss of quality (after the file is decompressed).
- Unencrypted: If the encryption key, passphrase, or password to a file is lost, there may be no way to retrieve the data from the file later, rendering it unusable to others.
Consider migrating your data into a format with the above characteristics, in addition to keeping a copy in the original software format. Note, in some cases, migrating data to an open format may cause data/metadata loss.
The Library of Congress provides several resources that can help you decide how to prepare your data for long-term preservation:
- Sustainability of Digital Formats website
- Recommended Formats Statement
- Summary of Digital Format Preferences
Repositories may also provide a list or guidance on preferred formats:
Here are some examples of recommended file formats for different content types.
Content type
|
Recommended file formats
|
Text |
|
Tabular data |
|
Statistical data |
|
Geospatial |
|
Image |
|
Audio |
|
Video |
|
Note: This is not an exhaustive list of formats suitable for preservation and access. The ability to support the preservation of a particular file format will vary across institutions and data repositories. If you are curious about a particular file format, LOC documents a comprehensive list of file formats. The documentation for each file format includes a section on “sustainability factors” describing “the suitability of particular digital formats for the purposes of preserving digital information as an authentic resource for future generations”.
* If you deposit your data in a repository, your files may be migrated to newer formats, so that they’re usable to future researchers.