Library Help Chat Skip to Main Content

Research Data Management

Metadata & Documentation

Documenting Your Research 

Documentation is an important part of data management. Your data is only useful to yourself and others if you have adequately described your dataset and documented your processes. This includes describing when, why and how the data was collected or generated, what the variables mean, how it was analyzed and how the final dataset was created. 

Documentation is best done at the beginning of your research journey, and maintained throughout the project, to ensure accuracy and thoroughness

ReadMe Files

ReadMe Files 

A readme file provides information about a dataset and is intended to help ensure that the data can be correctly interpreted, by yourself at a later date or by others when sharing or publishing data. ReadMe files are usually formatted as text files to prolong their lifespan and ensure accessibility. There are no standards for readme files but should include:

  • Data and file overview for each file name including a short description of the data each file contains and when the file was created
  • Licenses or restrictions placed on the data
  • Methodological information including, description of methods for data collection/generation and processing
  • Data-specific information for each dataset or file (as appropriate), including:
    • Variable list, including full names and definitions of column headings for tabular data
    • Units of measurement
    • Definitions for codes or symbols used to record missing data

Find more information on ReadMe files in the Guide to writing "readme" style metadata by the Research Data Management Service Group at Cornell University.

Codebooks and Data Dictionaries

Codebooks and Data Dictionaries 

Codebooks and data dictionaries are two forms of structured documentation used to define variables. They are related in function but differ in form, focus, and approach.

Codebooks

A codebook is a document commonly included with datasets in the social and behavioral sciences intended to assist with understanding the contents and structure of those datasets. Codebooks include front matter, including the study title, names of the principal investigators, and an introduction to the data. They may include methodological information too, if that is not documented elsewhere. However, the main content of a codebook is detailed definitions and descriptions of variables in the dataset.

Codebooks are commonly included with studies where lengthy questionnaires, surveys, or similar instruments are used and result in large numbers of variables, often named with opaque alphanumeric codes. For each coded variable, a codebook offers the question text, what the data values mean (e.g. 1 = good, 2 = fair, etc., also called value labels), and sometimes additional information such as summary statistics or notes and comments about that variable.

Data Dictionaries

Data dictionaries are, in contrast, typically in tabular/spreadsheet form. A typical data dictionary might contain columns for variable name (exactly as it appears in the dataset), a more descriptive human-readable variable name, unit of measurement, allowed values, a definition of the variable, and additional explanation, comments, or notes for each variable. Data dictionaries are not exclusively intended for quantitative empirical data, but they are more suited for that purpose than codebooks, since they foreground the units and allowed/expected values of variables.

If either of these forms of documentation are suitable for your study and dataset(s), it is good practice to create and maintain them and to later include them with your data when sharing it. They are crucial documentation when a research project has variables that are difficult to understand or need explanation.

File Naming & Versioning

File Naming and Versioning

Keep file names shortdescriptive, and agree on and follow consistent conventions with your team. Here are some general guidelines and examples:

  • Agree upon a file naming convention early with your team when planning data management
  • Use a short, unique, and descriptive identifier such as an acronym of your project name or grant #. This will make your files easy to find.
    • Add key term summarizing the content of the file to the file name such as GrantProposal, Questionnaire, etc.
    • Don't repeat file name information from the folder above: 
      • DO: Survey >> Results OR Survey >> ConsentForms
      • DON'T: Survey >> SurveryResults OR Survey >> SurveyConsentForms
  • Dates: Always use YYYYMMDD or YYYY-MM-DD format for dates. This format is easiest to read and systems to sort in chronological order
  • Use _ (underscores), - (hypehs), and/or CamelCase to delimit and avoid special characters as different computer systems will handle them differently
  • Where appropriate you may also wish to include researcher/author initials or location information in the file name
  • Keep track of versions by either changing the date and time or numbering system such as v01 or v01-01 ... v01-03 ... v03-02 to track file versions within different stages of the project.
    • Use leading 0s allowing a computer to sort the versions in chronological order
  • Try to keep file hierarchies shallow
    • no more than 4 levels deep
    • try to limit the number of files to around 10 files per folder

Examples

DO: SSHRC_Proposal_2022-04-01_v02.docx

DON'T: finaldraft1 or finalfinaldraft3

Resources

Sources

Sources


Copyright | Accessibility | Terms of Use