Data Organisation: Best Practice

Data Organisation: Best Practice#

Author: Dorothea Seiler Vellame (GitHub: dorotheavellame)

Reviewer: Christopher Tibbs (GitHub: ctibbs, ORCID: 0000-0002-3651-6573).

License: Creative Commons Attribution-ShareAlike 4.0 International license (CC BY-SA 4.0).

Course Objectives#

Understand why good data organisation is important Data organisation can help us manage projects, especially where the number of files are high and the project may be collaborative.
Learn what best practice is Clear file names and structure are the main contributers to good data organisation.
Differentiate between better and worse practice Test your knowledge with a quiz.

Why is good data organisation important?#

Data is the cornerstone of research. At the start of a project, it can seem like data organisation will be simple enough, you know what your data is called and where it is, however, over time things can snowball. New data is added, new scripts are stored with Final and FinalFINAL in the name.

Here are a few reasons why good data organisation can help your research:

It makes data sharing, now often a requirement from funders, easier.
It can protect from data loss, destruction, or corruption
It enables compliance with ethical codes, data protection laws, journal requirements, and funder/institutional policy
Can minimise future work and confusion of having to work out what’s what (“Your primary collaborator is you in 6 months, and your past self doesn’t answer emails”)

Now that we know that research productivity can be improved by good practice, how do we implement it?

What is best practice in data organisation?#

There are three main components to best practice when it comes to data organisation: file names, folder structure, and README’s.

File naming conventions#

Here are the general standards to follow for naming files

Make file names brief and descriptive
If date or version is relevant, begin file name with the date, formatted YYYYMMDD or YYYY_MM_DD. This helps with version control and searching.
File names should not contain spaces, use underscores (_) instead.
File names should not contain special characters as they may break some systems.
Order elements from brief to specific.
Use meaningful abbreviations only.
Use the correct file extensions.

Whichever naming conventions are chosen should be used consistently.

Example#

You want the name of a file to be descriptive, starting with when the file was made (if relevant) and then going from broad to specific names.

Say you have incoming data files for a project, and it will be going in the folder for the project. Five files are part of a monthly installment of data which measures the rainfall at 5 different sites of interest. How can we make the name informative?

There are several things to consider:

Is date important?

Here, because we know there will be monthly installments of data, we want the date to be included in each filename. This could be done with month names, however, the file order will be alphabetical, and therefore, if formatting year then month the filename prefix will allow for files to be in time order.

What detail distinguishes this data from other data that may end up in the same folder?

If the data already sits in the project folder, then the project name would not be necessary, as all data will be relevant to the project. However, here, details such as what is collected could be useful: rainfall and the site of data collection.

This could result in a name: 2025/08 rainfall at Torbay.csv

Think format

The name above contains spaces and a slash, and a redunant ‘at’. When working with code, filenames containing spaces or special characters can be awkward for certain languages to read in, as they can have other meaning. As such, better practice to include underscores instead:

2025_08_rainfall_Torbay.csv

It is now clear what the file contains without having to open it, as was the aim.

Folder structure#

An organised folder structure can make finding files easier. Folder names should follow the same conventions as file names. Structure will be slightly dependent on the research in question but should aim to build on the following:

Research_Project_Name/
│
├── README.md (a README describing the research project and what is contained)
│
├── data/
│   ├── README_data.md (a README describing the data in each folder)
│   │
│   ├── raw/
│   │   └── <raw data files>
│   │
│   ├── processed/
│   │   └── <processed data files>
│   │
│   └── metadata/
│       └── <metadata files>
│
├──scripts/
│   └── <scripts to carry out any data analysis>
│   
└──results/
    └── <results figures or tables> (could be split into file type folders)

Subfolders within a research project are broadly split by type (i.e. a data file or a script), and by use (i.e. raw vs processed data). Avoid having too many layers in your folder heirarchy, as it will be more difficult to navigate.

README files#

In addition to good folder and file name organisation, another way to help can be including README files. They are often text files and include a description of the folder. The project based README could include a description of the project, where as the data README could describe files included in more detail than possible in a file name, such as where raw data was generated.

README files should be formatted in a consistent way:

Use plain text to write them
Format the files in an understandable way
Include important information about each file .

Optional: you could include a README changelog to track how your files have changed.

Which of these filenames does not follow best practice?

Which of these could be included in a data README.txt?