FIRLA: a Fast Incremental Record Linkage Algorithm

https://doi.org/10.1016/j.jbi.2022.104094Get rights and content
Under an Elsevier user license
open archive

Highlights

  • Existing record linkage algorithms take a very long time.

  • Incremental linkage efficiently employs previous linkage output to cut down on the linking time.

  • Our novel deterministic linkage method (FIRLA) leverages innovative techniques that enable an average speed-up of 2.4× (up to 4×) for standard linkage.

  • The speed-up achieved does not compromise the linkage accuracy at all.

  • Moreover, FIRLA can incrementally link records in just 33

Abstract

Record linkage is an important problem studied widely in many domains including biomedical informatics. A standard version of this problem is to cluster records from several datasets, such that each cluster has records pertinent to just one individual. Typically, datasets are huge in size. Hence, existing record linkage algorithms take a very long time. It is thus essential to develop novel fast algorithms for record linkage. The incremental version of this problem is to link previously clustered records with new records added to the input datasets.

A novel algorithm has been created to efficiently perform standard and incremental record linkage. This algorithm leverages a set of efficient techniques that significantly restrict the number of record pair comparisons and distance computations. Our algorithm shows an average speed-up of 2.4x (up to 4x) for the standard linkage problem as compared to the state-of-the-art, without any drop in linkage performance at all. On average, our algorithm can incrementally link records in just 33% of the time required for linking them from scratch.

Our algorithms achieve comparable or superior linkage performance and outperform the state-of-the-art in terms of linking time in all cases where the number of comparison attributes is greater than two. In practice, more than two comparison attributes are quite common. The proposed algorithm is very efficient and could be used in practice for record linkage applications especially when records are being added over time and linkage output needs to be updated frequently.

Keywords

Record linkage
Data linkage
Edit distance
Electronic health records
Deterministic linkage

Cited by (0)

C++ source code is available upon request for non-commercial purposes only.