Documenting Your Research
Documentation is an important part of data management. Your data is only useful to yourself and others if you have adequately described your dataset and documented your processes. This includes describing when, why and how the data was collected or generated, what the variables mean, how it was analyzed and how the final dataset was created.
Documentation is best done at the beginning of your research journey, and maintained throughout the project, to ensure accuracy and thoroughness
ReadMe Files
A readme file provides information about a dataset and is intended to help ensure that the data can be correctly interpreted, by yourself at a later date or by others when sharing or publishing data. ReadMe files are usually formatted as text files to prolong their lifespan and ensure accessibility. There are no standards for readme files but should include:
Find more information on ReadMe files in the Guide to writing "readme" style metadata by the Research Data Management Service Group at Cornell University.
Codebooks and Data Dictionaries
Codebooks and data dictionaries are two forms of structured documentation used to define variables. They are related in function but differ in form, focus, and approach.
Codebooks
A codebook is a document commonly included with datasets in the social and behavioral sciences intended to assist with understanding the contents and structure of those datasets. Codebooks include front matter, including the study title, names of the principal investigators, and an introduction to the data. They may include methodological information too, if that is not documented elsewhere. However, the main content of a codebook is detailed definitions and descriptions of variables in the dataset.
Codebooks are commonly included with studies where lengthy questionnaires, surveys, or similar instruments are used and result in large numbers of variables, often named with opaque alphanumeric codes. For each coded variable, a codebook offers the question text, what the data values mean (e.g. 1 = good, 2 = fair, etc., also called value labels), and sometimes additional information such as summary statistics or notes and comments about that variable.
Data Dictionaries
Data dictionaries are, in contrast, typically in tabular/spreadsheet form. A typical data dictionary might contain columns for variable name (exactly as it appears in the dataset), a more descriptive human-readable variable name, unit of measurement, allowed values, a definition of the variable, and additional explanation, comments, or notes for each variable. Data dictionaries are not exclusively intended for quantitative empirical data, but they are more suited for that purpose than codebooks, since they foreground the units and allowed/expected values of variables.
If either of these forms of documentation are suitable for your study and dataset(s), it is good practice to create and maintain them and to later include them with your data when sharing it. They are crucial documentation when a research project has variables that are difficult to understand or need explanation.
Keep file names short, descriptive, and agree on and follow consistent conventions with your team. Here are some general guidelines and examples:
DO: SSHRC_Proposal_2022-04-01_v02.docx
DON'T: finaldraft1 or finalfinaldraft3
Sources