DOAJ Open Access 2020

Data Linkage of Hashed Data: Derive and Conquer

Josie Plachta Charlie Tomlin Rachel Shipsey

Abstrak

Introduction Data Linkage of hashed datasets is much more difficult than linking in-the-clear data. Hashing prevents the use of matching tools that overcome messy data such as ‘contained-within’ functions and edit distance metrics. Hashing sensitive data received from third parties is becoming more common due to increased Data Security concerns. Institutions need to be ready to link hashed data with high accuracy, otherwise the quality of outputs from these linked datasets will suffer. Objectives and Approach We designed an innovative matching method, Derive and Conquer (D&C). We derived variables containing substrings or patterns of the full variable (e.g. Soundex or first 4 characters of a string) to match on instead. However, using lots of combinations of these derived variables would require thousands of traditional match keys to be programmed, run, and reviewed. Instead, D&C runs matchkeys on a derived agreement variable which amalgamates information stored in multiple derived variables into one value, reducing the number of matchkeys to a manageable amount. D&C runs on distributing computing systems using PySpark to link datasets containing millions of records in a timely manner. Results D&C was developed using in-the-clear UK Census and health records with results comparable to the in-the-clear gold standard. It is currently being tested on hashed data to link UK tax and benefits data to UK health records. 66.4 million records were declared matched - a realistic match rate for the UK population. Research into the linkage quality is ongoing to produce estimates on the amount of bias in the linkage and the precision and recall. We will be excited to present these results at the Conference in October. These results will be used to improve D&C. Conclusion / Implications Using these derived variables, we have been able to overcome the challenge of matching massive hashed datasets with a realistic match rate and in a realistic time frame.

Penulis (3)

J

Josie Plachta

C

Charlie Tomlin

R

Rachel Shipsey

Format Sitasi

Plachta, J., Tomlin, C., Shipsey, R. (2020). Data Linkage of Hashed Data: Derive and Conquer. https://doi.org/10.23889/ijpds.v5i5.1447

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →
Lihat di Sumber doi.org/10.23889/ijpds.v5i5.1447
Informasi Jurnal
Tahun Terbit
2020
Sumber Database
DOAJ
DOI
10.23889/ijpds.v5i5.1447
Akses
Open Access ✓