Hi, you are logged in as , if you are not , please click here
You are shopping as , if this is not your email, please click here

Introduction to Data Linkage


Course Information


This short course is designed to give participants a practical introduction to data linkage and is aimed at both analysts intending to link data themselves and researchers who want to understand more about the linkage process and its implications for analysis of linked data—particularly the implications of linkage error. Day 1 will focus on the methods and practicalities of data linkage (including deterministic and probabilistic approaches) using worked examples. Day 2 will focus more on analysis of linked data, including concepts of linkage error, how to assess linkage quality and how to account for the resulting bias and uncertainty in analysis of linked data. Examples will be drawn predominantly from health data, but the concepts will apply to many other areas. This course includes a mixture of lectures and practical sessions that will enable participants to put theory into practice.

The course covers:

· Overview of data linkage (data linkage systems, benefits of data linkage, types of projects)

· Overview of linkage methods (deterministic and probabilistic, privacy-preserving)

· The linkage process (data preparation, blocking, classification)

· Classifying linkage designs

· Evaluating linkage quality and bias (types of error, analysis of linked data)

· Reporting analysis of linked data

· Practical sessions (no coding required; see below)

By the end of the course participants will:

· Understand the background and theory of data linkage methods

· Perform deterministic and probabilistic linkage

· Evaluate the success of data linkage

· Appropriately report analysis based on linked data

The course is aimed at analysts and researchers who need to gain an understanding of data linkage techniques and of how to analyse linked data. The course provides an introduction to data linkage theory and methods for those who might be implementing data linkage or using linked data in their own work. Participants may be academic researchers in the social and health sciences or may work in government, survey agencies, official statistics, for charities or the private sector. The course does not assume any prior knowledge of data linkage. Some experience of using Excel or other software will be useful for the practical sessions.

Preparatory Reading

Recommended (not required):

· Doidge JC, Christen P and Harron K (2020). Quality assessment in data linkage. In: Joined up data in government: the future of data linking methods. https://www.gov.uk/government/publications/joined-up-data-in-government-the-future-of-data-linking-methods/quality-assessment-in-data-linkage

· Harron K, Doidge JC & Goldstein H (2020) Assessing data linkage quality in cohort studies, Annals of Human Biology, 47:2, 218-226, DOI: 10.1080/03014460.2020.1742379

· Harron KL, Doidge JC, Knight HE, et al. A guide to evaluating linkage quality for the analysis of linked data. Int J Epidemiol. 2017;46(5):1699–1710. doi:10.1093/ije/dyx177

· Doidge JC, Harron K (2019). Reflections of modern methods: Linkage error bias. International Journal of Epidemiology. 48(6):2050-60. https://doi.org/10.1093/ije/dyz203

· Sayers A, Ben-Shlomo Y, Blom AW, Steele F. Probabilistic record linkage. Int J Epidemiol. 2016;45(3):954–964. doi:10.1093/ije/dyv322 · Doidge JC, Harron K. Demystifying probabilistic linkage: Common myths and misconceptions. Int J Popul Data Sci. 2018;3(1):410. doi:10.23889/ijpds.v3i1.410


Day 1

· Overview

· Deterministic linkage algorithms

· Linkage error

· Probabilistic linkage theory and practical demonstration

· Practical considerations (including variable selection, handling missing data and managing processing


· Overview of advanced topics including privacy preservation, string comparators and linkage of multiple files

Day 2

· Recap: Common myths and misconceptions about probabilistic linkage

· Linkage error bias

· Linkage quality assessment

· Handling linkage error in analysis

· Reporting studies of linked data

· Software demonstration: Splink – open-source toolkit for probabilistic record linkage and deduplication at scale

Course Code


Course Leader

Professor Katie Harron and Dr James Doidge
StartEndPlaces LeftCourse Fee 
15/03/202316/03/20230[Read More]

How would you rate your experience today?

How can we contact you?

What could we do better?

   Change Code