Lesson plan 10: Finding and reusing data

Being able to reuse data and analyse secondary data not only helps to save time and energy for researchers, it can also fast-track scientific discoveries with shared resources and perspectives, while adhering to the FAIR principles.

The FAIR elements that this lesson plan deals with focus on F (Findable), A (Accessible) and R (Reusable). As stated in the 'FAIR Guiding Principles for scientific data management and stewardship (21), the ultimate goal of FAIR is to optimise the reuse of data. In order to be reusable, data should correspond, on a general level, to all of the FAIR principles, and in particular to the R ones:

R1. (Meta)data are richly described with a plurality of accurate and relevant attributes

R1.1. (Meta)data are released with a clear and accessible data usage license

R1.2. (Meta)data are associated with detailed provenance

R1.3. (Meta)data meet domain-relevant community standards

Primary audience(s): Master's and PhD degree students, researchers

Learning outcomes:

  • Can explain the importance of data discovery and reuse

  • Can recognise the concept of 'secondary data' vs. collecting primary data

  • Can discover published datasets in their discipline

  • Can cite data

  • Can develop a strategy to search for data

  • Can articulate the criteria for data selection

  • Can recognise the provenance of data they intend to use

  • Can recognise the importance of the terms and conditions of data reuse

  • Can recognise the importance of data citation when reusing data

Summary of tasks/actions:

  • Speaking of 'good scientific practice': Why is it important to use secondary data rather than collect primary data?

  • Identify a strategy to find data appropriate for a specific research project.

  • Identify a 'trustworthy' data repository: find relevant data in certified repositories, check measures taken by repositories to ensure that data are reusable. What are the criteria that 'trustworthy' data should meet?

  • Look at some examples of datasets and how they express terms for reuse.

  • Look at data citation models: Conduct a case study when wanting to cite multiple datasets from various repositories, providing different data citation models.

  • Sensitise learners to further share new knowledge and new data created during this data reuse process.

Materials/Equipment

  • Computer

  • Internet

Resources:

Why are research data managed and reused?

An interesting point on good scientific practice is made on this blog post from the Finnish Social Science Data Archive, which also briefly describes the benefits of data reuse:

"Reusing data is economic and saves resources. If suitable data are readily available, there is less need to spend time and money to collect new material. Data from large surveys often include material that has not been analysed in the original research. Data reuse helps to avoid duplication of data collection. It can also minimise collection on the hard-to-reach or the vulnerable.

Valuable research data are of no use to the scientific community and future research if original data creators are the only persons to have any information on the data. If they relocate to other organisations or to other tasks, or retire, all information will disappear." (https://www.fsd.tuni.fi/en/services/data-management-guidelines/why-are-research-data-managed-and-reused/)

Time Efficacy Gain:

Pronk, T.E., 2019. The Time Efficiency Gain in Sharing and Reuse of Research Data. Data Science Journal, 18(1), p.10. DOI: http://doi.org/10.5334/dsj-2019-010

The author uses a "mathematical model [...] to calculate the break-even point for time spent sharing in a scientific community, versus time gain by reuse" for a number of scenarios.

"The results indicate that sharing research data can indeed cause an efficiency revenue for the scientific community. However, this is not a given in all modeled scenarios. The scientific community with the lowest reuse needed to reach a break-even point is one that has few sharing researchers and low time investments for sharing and reuse. This suggests it would be beneficial to have a critical selection of datasets that are worth the effort to prepare for reuse in other scientific studies. In addition, stimulating reuse of datasets in itself would be beneficial to increase efficiency in scientific communities." (Pronk 2019)

Review shared research data:

CESSDA (Consortium of European Social Science Data)'s discovery section in the data management expert guide: https://www.cessda.eu/Training/Training-Resources/Library/Data-Management-Expert-Guide/7.-Discover

Including steps to take during the discovery process and a curated list of different types of social science data sources

Finding and citing data:

Ball, A., & Duke, M. (2015). 'How to Cite Datasets and Link to Publications'. DCC How-to Guides. Edinburgh: Digital Curation Centre. Available online: http://www.dcc.ac.uk/resources/how-guides

Gregory, K., Groth, P., Scharnhorst, A., & Wyatt, S. (2020). Lost or Found? Discovering Data Needed for Research. Harvard Data Science Review, 2(2). https://doi.org/10.1162/99608f92.e38165eb

This study presents evidence from the largest known survey investigating how researchers discover and use data that they do not create themselves.

Surrey Repro Society - Finding and using secondary data (workshop slides) https://osf.io/4yhtg/

List of resources and data repositories for finding secondary data.

An up-to-date list of available registered data repositories can be found at https://www.re3data.org/ and at FAIRsharing.

Still, finding a trustworthy data repository that suits your research needs can be a challenge. A possible solution is to look for certified repositories, be it a core certification or a more formal one. For example, a core certification involves a minimally intensive process whereby data repositories supply evidence that they are sustainable and trustworthy. Alternatively, look for repositories that have been recommended by your community and or research infrastructure in your discipline, such as ELIXIR for the Life Sciences.

The Core Trust Seal certified repositories: https://www.coretrustseal.org/why-certification/certified-repositories/

You could also look for the data catalogue of institutions, such as the data catalogue (https://datacatalogue.cessda.eu/) of the Consortium of European Social Science Data Archives (CESSDA), with guidelines on discovering data (https://www.cessda.eu/Training/Training-Resources/Library/Data-Management-Expert-Guide/7.-Discover/Data-repositories-as-data-resources).

In general, repositories that have reusability and metadata assessment tools, such as Kaggle (https://www.kaggle.com/datasets) and KNB (https://knb.ecoinformatics.org/), are a valuable resource for data reuse.

List of data and metadata standards

Across the research disciplines there are thousands of standards that act as pillars for data reuse. FAIRsharing maps the landscape of community-developed standards, while defining the indicators necessary to monitor their development, evolution and integration, implementation and use in databases, and adoption in data policies by funders, journals and other organisations.

Take-home tasks

  • Exercise on finding "trustworthy" data on a given topic during the class.

  • Use the data found in the above as an example to practice data citation.

  • Find standards relevant to your domain and discipline.

References:


(21) Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3:160018 doi: 10.1038/sdata.2016.18 (2016).


Last updated