FAIR (meta)data#
Author: Christopher Tibbs (GitHub: ctibbs, ORCID: 0000-0002-3651-6573).
Reviewer: Dorothea Seiler Vellame (GitHub: dorotheavellame).
Licence: Creative Commons Attribution-ShareAlike 4.0 International license (CC BY-SA 4.0).
Course objectives#
This short course will help you to:
Understand the importance of metadata.
Become familiar with the FAIR Principles.
Recognise that FAIR stands for Findable, Accessible, Interoperable, and Reusable.
Understand why it is important to make data FAIR.
Determine how to make your own data FAIR.
Recognise the distinction between FAIR data and open data.
(Meta)data#
What do we mean by data?#
When you think of the term research data, what comes to mind? Many people, when they think of research data, simply think of tables or databases full of numerical values (i.e., numerical data), but there is more to research data than that. Research data can be numerical data such as databases or spreadsheets, but it can also be questionnaires, transcripts, surveys, images, audio or video recordings, simulations, archival records, and much more.
The Concordat on Open Research Data (July 2016) defines research data as: ‘Research data are the evidence that underpins the answer to the research question, and can be used to validate findings regardless of its form (eg, print, digital, or physical). These might be quantitative information or qualitative statements collected by researchers in the course of their work by experimentation, observation, modelling, interview or other methods, or information derived from existing evidence. Data may be raw or primary (e.g., direct from measurement or collection) or derived from primary data for subsequent analysis or interpretation (e.g., cleaned-up or as an extract from a larger data set), or derived from existing sources where the rights may be held by others. Data may be defined as ‘relational’ or ‘functional’ components of research, thus signalling that their identification and value lies in whether and how researchers use them as evidence for claims’.
So throughout this course, when you read the term data, think of types of data that you use in your own research.
What do we mean by metadata?#
Data on their own are meaningless. Data require contextual information to be useful, and this is where metadata come in. Metadata are “data about data” i.e., data that describe other data.
Extensive metadata that describe the research data are as important as the research data. For example, look at the data provided below. The data are numerical data formatted into seven columns and five rows. Can you tell what the data represent?
2025 |
1 |
7.4 |
1.7 |
12 |
38.8 |
67.8* |
2025 |
2 |
9.0 |
3.9 |
2 |
40.2 |
68.0* |
2025 |
3 |
12.5 |
3.9 |
5 |
17.8 |
119.0* |
2025 |
4 |
15.8 |
5.5 |
1 |
56.6 |
216.7* |
2025 |
5 |
18.8 |
7.3 |
0 |
36.2 |
280.1* |
Caption: an example of data without any metadata highlighting that data on their own are meaningless.
Looking at the first column, you could guess that this is the year, but what about the other columns? What does the * in the final column mean?
On their own, the data are meaningless and unusable as there is no context provided. It isn’t clear what each column represents, or what any potential units of measurement are, or what the * in the final column means.
Now look at the second set of data provided. There is contextual information provided, which makes it clear that these are weather data. There is information on the location at which the data were collected, and each column now has a heading, with appropriate units, and the column headings are also fully described below the data. How missing data and estimated data are handled is also clearly explained, which helps to clarify what the * in the final column means.
Armagh
Location: 287800E 345800N (Irish Grid), Lat 54.352 Lon -6.649, 62 metres amsl
Estimated data is marked with a * after the value.
Missing data (more than 2 days missing in month) is marked by —.
Sunshine data taken from an automatic Kipp & Zonen sensor marked with a #, otherwise sunshine data taken from a Campbell Stokes recorder.
yyyy |
mm |
tmax (degC) |
tmin (degC) |
af (days) |
rain (mm) |
sun (hours) |
|---|---|---|---|---|---|---|
2025 |
1 |
7.4 |
1.7 |
12 |
38.8 |
67.8* |
2025 |
2 |
9.0 |
3.9 |
2 |
40.2 |
68.0* |
2025 |
3 |
12.5 |
3.9 |
5 |
17.8 |
119.0* |
2025 |
4 |
15.8 |
5.5 |
1 |
56.6 |
216.7* |
2025 |
5 |
18.8 |
7.3 |
0 |
36.2 |
280.1* |
Where the data consist of:
year (yyyy)
month (mm)
mean daily maximum temperature (tmax)
mean daily minimum temperature (tmin)
days of air frost (af)
total rainfall (rain)
total sunshine duration (sun)
Caption: The same dataset as above but now accompanied by metadata, highlighting the importance of metadata in making data understandable and usable. Data extract taken from the Met Office Historic station data.
With the accompanying metadata, the dataset can be easily understood and is therefore able to be used. While this is a very simple example, it does highlight the importance of good metadata accompanying data.
The FAIR Principles#
The FAIR Principles were defined in The FAIR Guiding Principles for scientific data management and stewardship (Wilkinson et al., 2016) as guidance for those wishing to enhance the reusability and reproducibility of their data.
The FAIR principles emphasize good data management practices and are split into four components:
Findable
Accessible
Interoperable
Reusable
Each of these components applies to both data and metadata, and are applicable to both human and machines i.e., data and metadata should be Findable, Accessible, Interoperable, and Reusable by both humans and machines.
We will discuss each of these four components further in the tabs below.
🔍 F = Findable
Findable means that the data can be found by others (both by humans and machines). Your data are not going to be reused by others if no-one can find them, or even know that the data exist. For example, if your data are simply stored on your laptop/computer/hard-drive, then they are not findable to others, and as such, no-one knows that the data exist, let alone be able to reuse them.
Your data and metadata should be easy to find for both humans and machines, and the easiest way to make your data findable is to deposit the data and corresponding metadata into a research data repository.
Ideally this would be a reputable discipline-specific repository (e.g., depositing social and economic data into the UK Data Service), but if there are no dedicated repositories for your discipline, you can always use a multi-disciplinary repository such as Zenodo or Figshare. You can use re3data to identify a suitable repository for your data.
Repositories are indexed by search engines and they also provide a search functionality, both of which help to ensure that your data are findable by others. Repositories not only help ensure that your data are findable, but they also manage all of the long-term data curation ensuring that your data are preserved for the future.
Additionally, repositories also assign a unique persistent identifier (such as a Digital Object Identifier or DOI) to your data. This identifier can be used to link to and cite your data and helps ensure that the data can be findable and cited.
A persistent identifier is not the same as a standard web address. Web addresses can change, and for example, if the web address of a data repository were to change, then the web address would no longer link to the dataset. However, the DOI is separate and it would continue to link to the dataset. This is what makes it persistent.
From Wilkinson et al., 2016: To be Findable: F1. (meta)data are assigned a globally unique and persistent identifier F2. data are described with rich metadata (defined by R1) F3. metadata clearly and explicitly include the identifier of the data it describes F4. (meta)data are registered or indexed in a searchable resource
🔐 A = Accessible
Accessible means that it is clear how others can access the data. Your data are not going to be reused by others if no-one can actually access the data. For example, if your data are simply stored on your laptop/computer/hard-drive, even if you tell others where the data are stored, the data are still not accessible to others if they do not have access to your laptop/computer/hard-drive.
Once someone finds your data, they need to know how they can access the data. Are the data publicly available or are there any authentication and authorisation procedures required.
The easiest way to make your data accessible is to deposit the data and corresponding metadata into a research data repository that uses standard communication protocols such as HTTPS. The data and metadata should also be accompanied by clear instructions on how the data can be accessed. If the data are openly available in the repository, then they will be accessible to anyone. However, if the data cannot be made publicly available (e.g., for ethical reasons), then the data may need to have access restricted, and in this case the metadata describing the data should still be publicly available, and the metadata should contain details on the procedure for requesting access to the restricted-access data. Ideally the data should be accessible to both humans and machines.
From Wilkinson et al., 2016: To be Accessible: A1. (meta)data are retrievable by their identifier using a standardized communications protocol A1.1 the protocol is open, free, and universally implementable A1.2 the protocol allows for an authentication and authorization procedure, where necessary A2. metadata are accessible, even when the data are no longer available
🔗 I = Interoperable
Interoperable means that the data should be able to interoperate with existing applications and workflows. Your data are not going to be reused by others if no-one can actually integrate your data with their own workflows or existing tools.
For example, if someone has access to your data, but your data are in an obscure or obsolete format, then the data are not interoperable, and will not be able to be reused by others.
The easiest way to make your data interoperable is to ensure that your data are in a standard format that is widely adopted within your research domain. This way, others will readily be able to integrate your data into their own workflows and tools.
From Wilkinson et al., 2016: To be Interoperable: I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation. I2. (meta)data use vocabularies that follow FAIR principles I3. (meta)data include qualified references to other (meta)data
♻️ R = Reusable
Reusable means the data are easy to understand and it is clear how the data can be reused.
Your data are not going to be reused by others if no-one can actually understand or read your data. For example, if someone has access to your data, but your data are not documented or the variables or units of measurement are not defined, then the data are not reusable.
The easiest way to make your data reusable is to ensure that your data are accompanied by metadata (e.g., a title, creator, description, keywords, dates, file formats, funders etc.), are clearly documented, all variables are defined, and how the data were processed and analysed is explained (e.g., clearly detailing how missing data are handled).
It is also important to make it clear how the data can be reused by others, and the best way to do that is to license the data. A licence determines how others can use, modify, and distribute your data, and the Creative Commons licences provide a convenient method to license datasets. A licence ensures that you get credit for your work, while still allowing others to use, and build upon that work.
From Wilkinson et al., 2016: To be Reusable: R1. meta(data) are richly described with a plurality of accurate and relevant attributes R1.1. (meta)data are released with a clear and accessible data usage license R1.2. (meta)data are associated with detailed provenance R1.3. (meta)data meet domain-relevant community standards
Why make data FAIR#
Making your data FAIR benefits the entire research community and the wider society. FAIR data helps to maintain research integrity and enables validation of results. By sharing data, you can increase your research profile leading to an increase in citations, and it can also lead to potential new research collaborations.
FAIR data not only benefits others, it also makes it easier for your future self to understand your own data.
Essentially, making data FAIR helps to maximise the impact of your research.
How to make data FAIR#
It is important to remember that the FAIR Principles are guidelines and not rules - they are a continuum of steps that you can adopt to make your data more reproducible.
There are a number of easy steps you can take to make your data FAIR:
Make your data Findable:
Deposit your data into a reputable repository.
Add metadata to provide as much context as possible for the dataset e.g., use a descriptive title; add a description of the dataset, keywords, research field, and other relevant tags; add any research funders.
Ensure your dataset is assigned a persistent identifier such as a DOI.
Make your data Acessible:
Manage access to your data and if necessary clearly define a process for how access to the dataset can be requested.
Make your data Interoperable:
Adopt standard, and where possible open, file formats.
Use controlled vocabularies.
Make your data Reusable:
Document your data e.g., provide a codebook/data dictionary that explains all of the variables, measurements, and units; describe the methodology and how the data were processed.
Include links to any associated datasets or publications.
Assign a reuse licence to the dataset.
FAIR Data and Open Data#
The Open Data Handbook defines open data as ‘data that can be freely used, reused and redistributed by anyone - subject only, at most, to the requirement to attribute and sharealike’.
However, it is important to point out that FAIR data do not have to be open. None of the FAIR principles necessitate data being open. What they do require, however, is clarity and transparency around the conditions governing access and reuse.
Note
FAIR data ≠ Open data
Therefore, data that cannot be made fully open (e.g., sensitive data that cannot be shared publicly for ethical reasons) can still be FAIR, as long as the metadata describing the research data are publicly available and that there is a clear process in place to allow access to be requested (e.g., by signing a data sharing agreement).
If you are collecting personal data, be sure to include explicit consent for data sharing in the consent form and make it clear how the data will be managed and shared in the participant information sheets. Protect participants by only collecting data that is necessary for the research and anonymise the data. If the data cannot be fully anonymised, and you have consent, regulate access to the shared data. For more information on managing research ethics, contact the University’s Research Ethics and Governance team.
Summary#
FAIR data are Findable, Accessible, Interoperable, and Reusable.
Findable means that the data should be registered in a searchable repository and be assigned a persistent identifier.
Accessible means that the data should be available via a standardised protocol e.g., HTTPS, so that anyone with an internet connection can access the data. Accessible also means that even if the research data themselves are not available, the metadata describing the research data should be available.
Interoperable means that the data should be able to be seamlessly integrated with existing datasets and tools.
Reusable means that the data should be clearly documented and clearly licensed.
As open as possible, and as closed as necessary.
It is much easier to make your data FAIR if you plan for this from the beginning of your project. For example, ensuring that your data are adequately documented is much easier to do if you start documenting the data as you go from the beginning of your project, rather than having to go back and document all of the data at a later date. Incorporating the steps required to make your data FAIR throughout your project will save you time when it comes to publishing the research as the data will already be in a format/condition that is suitable for sharing.
Finally, it is important to remember that FAIR is not black and white, FAIR is a continuum, and every step you make towards making your data FAIR is a positive step towards maximising the impact of your research.
Activity#
Assess the FAIRness of this dataset: https://doi.org/10.5255/UKDA-SN-853809
Click here for hints!
Where the data are located? Is it a recognisable repository?
Are you able to access the data? If not, is there a clear process for requesting access?
Are the data available in an open format?
How well are the data documented? For example, is there an accompanying README file?
Is the dataset licensed?
Is there a persistent identifier for the dataset?
Solution
Where the data are located? Is it a recognisable repository?
Are you able to access the data? If not, is there a clear process for requesting access?
The data files are closed access, however they are available on request by completing a User Agreement form.
Are the data available in an open format?
It is not clear what format the data are available in because they are in a zip file, but the other files are available in standard formats such as .csv, .txt, and .pdf.
How well are the data documented? For example, is there an accompanying README file?
The data are well documented and are accompanied by a number of useful documents such as a corpus metadata file, a schema file, a glossary file etc, and also a README file.
Is the dataset licensed?
The data are clearly licensed and the licence is applied at the file level. The two closed access data files are licensed using the UK Data Service End User Licence, while the documentation is all available under a Creative Commons Licence.
Is there a persistent identifier for the dataset?
Yes, the data have been assigned a DOI: https://doi.org/10.5255/UKDA-SN-853809
In summary, we can say that this dataset is FAIR.