MIT Libraries

Data Management and Publishing

 

Organizing Your Files

File Version Control

Keeping track of versions of documents and datasets is critical. Strategies include:

  • Directory Structure Naming Conventions
  • File Naming conventions

Always record every change to a file no matter how small. Discard obsolete versions after making backups.

Directory Structure Naming Conventions

When organizing files, directory top-level folder should include the project title, unique identifier, and date (year).

The substructure should have a clear, documented naming convention; for example, each run of an experiment, each version of a dataset, and/or each person in the group.

File Naming Conventions

  • Reserve the 3-letter file extension for application-specific codes, for example, formats like .wrl, .mov, and .tif.
  • Identify the activity or project in the file name

File Renaming

Use free tools to help you:

File Naming Conventions for Specific Disciplines

Many disciplines have recommendations, for example:

Data Identifiers for Sharing Your Data

The information at the beginning of this page will help you organize your datasets for your own use. But you'll want to consider using more sophisticated name schema if you want to share or cite your data. You'll want put your datasets where other people can access them, and give your datasets identifiers that can be referenced easily.

Data identifiers must be globally unique and persistent. That is to say, they must not be repeated elsewhere and they must not change over time.

There are many different schemes:

  • PURL -- A PURL is a Persistent Uniform Resource Locator. Functionally, a PURL is a URL. However, instead of pointing directly to the location of an Internet resource, a PURL points to an intermediate resolution service. The PURL resolution service associates the PURL with the actual URL and returns that URL to the client.
  • DOI -- A DOI (Digital Object Identifier) is a name (not a location) for an entity on digital networks. It provides a system for persistent and actionable identification and interoperable exchange of managed information on digital networks.
  • ACCESSION -- Accession numbers used by the National Center for Biotechnology Information (NCBI) are unique and citable.
  • InChI -- The IUPAC International Chemical Identifier (InChITM) is a non-proprietary identifier for chemical substances that can be used in printed and electronic data sources thus enabling easier linking of diverse data compilations.
  • URI -- Uniform Resource Identifier (URI) consists of a string of characters used to identify or name a resource on the Internet. Such identification enables interaction with representations of the resource over a network, typically the World Wide Web, using specific protocols.

 

For advice on a data management project, contact:

data-management@mit.edu



MIT

For help on a data management project, contact: data-management@mit.edu

Text licensed under Creative Commons, unless otherwise noted. All other media all rights reserved unless otherwise noted.