Skip to main content
Log in

Summarizing and linking electronic health records

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

In recent years, several applications have emerged which require access to consolidated information that has to be computed and provided in near real-time. Traditional record linkage algorithms are unable to support such time-critical applications, as they perform the linkage offline and provide the result set only when the entire process has completed. To address this need, in this paper we propose the first summarization algorithms that operate in the blocking and matching steps of record linkage to speed up online linkage tasks. Our first method, called SkipBloom, efficiently summarizes the participating data sets, using their blocking keys, to allow for very fast comparisons among them. The second method, called BlockSketch, summarizes a block to achieve a constant number of comparisons for a submitted query record, during the matching phase. Our third method, SBlockSketch, operates on data streams, where the entire data set is unknown a-priori but, instead, there is a potentially unbounded stream of incoming data records. Finally, we introduce PBlockSketch, which adapts BlockSketch to privacy-preserving settings. Through extensive experimental evaluation, using real-world data sets, we show that our methods outperform the state-of-the-art algorithms for online record linkage in terms of the time needed, the memory used, and the recall and precision rates that are achieved during the linkage process. Following the evaluation of our approaches, we introduce SFEMRL, a novel framework that uses them to enable the linkage of electronic health records at large scale, while respecting patients’ privacy. Under this framework, patient records first undergo a data masking process that perturbs sensitive information in data fields of the records to protect it. Subsequently, they participate in a parallel and distributed ecosystem, whose goal is to persist these records in order to be queried efficiently and accurately. We demonstrate that the integration of our framework with Map/Reduce offers robust distributed solutions for performing on-demand large-scale privacy-preserving record linkage tasks in the health domain.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

Notes

  1. As long as tails come up, we add the item to each successive list. We terminate this process when we encounter heads.

  2. A concrete class implements the inherited methods of an abstract class and/or interfaces in the Java programming language.

  3. https://github.com/google/leveldb.

  4. http://www.oracle.com/technetwork/database/database-technologies/berkeleydb/overview/index.html.

  5. SkipBloom aims to address Problem 1, BlockSketch targets Problem 2, while SBlockSketch tackles Problem 3.

  6. Henceforth, key and blocking key will be used interchangeably.

  7. SkipBloom locates these Bloom filters performing a recursive process.

  8. Since, we expect to have duplicate keys, it is quite natural that the same key may be stored into multiple Bloom filters of a block.

  9. We assume that the number of distinct blocking keys is n in both A and B.

  10. The distance either between a pair of records, or between a blocking key and a record, is determined by the distances of the certain field values, part of which usually make up the blocking key.

  11. The exact number of representatives will be specified later.

  12. A representative, being essentially a blocking key, has only key values.

  13. For instance, LevelDB (see https://github.com/google/leveldb) uses an in-memory highly efficient multi-level data structure, which enables logarithmic disk seeks in the number of stored blocking keys.

  14. A live block is a block that is stored in main memory.

  15. The Hamming distance between two Bloom filters is equal to the number of components in which these filters have different bits.

  16. We use the term masking to refer to data obfuscation and perturbations operations that are applied to protect the plain-text (original) values.

  17. In general, in a multi-party model, there are more than two data custodians involved.

  18. http://www.hhs.gov/ocr/privacy/hipaa/administrative/privacyrule/.

  19. https://ec.europa.eu/info/law/law-topic/data-protection_en.

  20. http://www.phrn.org.au/centre-for-data-linkage/, http://www.cherel.org.au/, https://www.cprd.com/intro.asp, http://ww2.health.wa.gov.au/Articles/A_E/Data-linkage.

  21. A collision is a record pair formulation in a certain hash table.

  22. Each hash table can be assumed as an independent Bernoulli trial for the collision of a Bloom filter pair.

  23. The collision threshold is essentially the number of collisions that should be counted between a pair of records, so as to be considered as a potential matching pair.

  24. http://dblp.uni-trier.de/xml.

  25. http://dl.ncsbe.gov/index.html?prefix=data/.

  26. https://idash-data.ucsd.edu/community/43.

  27. in terms of the predefined clusters of matching records.

  28. https://github.com/google/leveldb.

  29. Using the double metaphone encoding method, ‘SMITH’ and ‘SMYTH’ are both encoded as ‘SM0’.

  30. The map structure is initialized for each record of Q.

  31. We had 32GB of main memory available.

  32. http://hadoop.apache.org/.

  33. http://cassandra.apache.org/.

References

  1. Altwaijry, H., Kalashnikov, D., Mehrotra, S.: Query-driven approach to entity resolution. In: International Conference on Very Large Data Bases (PVLDB), vol. 6, pp. 1846–1857 (2013)

  2. Bhattacharya, I., Getoor, L., Licamele, L.: Query-time entity resolution. In: International Conference on Knowledge Discovery and Data Mining (KDD), pp. 529–534 (2006)

  3. Bilenko, M., Kamath, B., Mooney, R. J.: Adaptive blocking: learning to scale up record linkage. In: International Conference on Data Mining (ICDM), 87–96 (2006)

  4. Broder, A., Mitzenmacher, M.: Network applications of Bloom filters: a survey. In: Internet Mathematics, pp. 636–646 (2002)

  5. Christen, P.: Data matching—concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer, Data-Centric Sys. and Appl. (2012)

  6. Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. Trans. Knowl. Data Eng. (TKDE) 12(9), 1537–1555 (2012)

    Article  Google Scholar 

  7. Christen, P., Gayler, R., Hawking, D.: Similarity–aware indexing for real-time entity resolution. In: International Conference on Information and Knowledge Management (CIKM), pp. 1565–1568 (2009)

  8. Christophides, V., Efthymiou, V., Stefanidis, K.: Entity Resolution in the Web of Data. Morgan and Claypool Publishers, San Rafael (2015)

    Book  Google Scholar 

  9. Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 475–480 (2002)

  10. Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.: Locality-sensitive hashing scheme based on p-stable distributions. In: Symp. on Comp. Geom., pp. 253–262 (2004)

  11. Dean, J., Ghemawat, S.: Mapreduce: simplifed data processing on large clusters. CACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  12. Dey, D., Mookerjee, V., Liu, D.: Efficient techniques for online record linkage. Trans. Knowl. Data Eng. (TKDE) 23(3), 373–387 (2011)

    Article  Google Scholar 

  13. Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: a survey. Trans. Knowl. Data Eng. (TKDE) 19(1), 1–16 (2007)

    Article  Google Scholar 

  14. Firmani, D., Saha, B., Srivastava, D.: Online entity resolution using an oracle. In: International Conference on Very Large Data Bases (PVLDB), vol. 9, pp. 384–395 (2016)

  15. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB, pp. 518–529 (1999)

  16. Haas, P.J.: Data-stream sampling: basic techniques and results. In: Garofalakis, M., Gehrke, J., Rastogi, R. (eds.) Data Stream Management: Processing High-Speed Data Streams, pp. 13–44. Springer, Berlin (2016)

    Chapter  Google Scholar 

  17. Hall, R., Fienberg, S.: Privacy-preserving record linkage. In: PSD, pp. 269–283 (2010)

  18. Hernandez, M., Stolfo, S.: The merge/purge problem for large databases. In: International Conference on Management of Data (SIGMOD), pp. 127–138 (1995)

  19. Ioannou, E., Nejdl, W., Niederee, C., Velegrakis, Y.: On-the-fly entity-aware query processing in the presence of linkage. International Conference on Very Large Data Bases (PVLDB), vol. 3(1), pp. 429–438 (2010)

  20. Karapiperis, D., Verykios, V.: A distributed near-optimal LSH-based framework for privacy-preserving record linkage. COMSIS 11(2), 745–763 (2014)

    Article  Google Scholar 

  21. Karapiperis, D., Verykios, V.: An LSH-based blocking approach with a homomorphic matching technique for privacy-preserving record linkage. Trans. Knowl. Data Eng. (TKDE) 27(4), 909–921 (2015)

    Article  Google Scholar 

  22. Karapiperis, D., Verykios, V.: A fast and efficient Hamming LSH-based scheme for accurate linkage. Knowl. Inf. Syst. (KAIS) 49(3), 861–884 (2016)

    Article  Google Scholar 

  23. Karapiperis, D., Gkoulalas-Divanis, A., Verykios, V.: LSHDB: a parallel and distributed engine for record linkage and similarity search. In: International Conference on Data Mining (ICDM) demos, pp. 1–4 (2016)

  24. Karapiperis, D., Vatsalan, D., Verykios, V., Christen, P.: Efficient record linakge using a compact Hamming space. In: International Conference on Extending Database Technology (EDBT), pp. 209–220 (2016)

  25. Karapiperis, D., Gkoulalas-Divanis, A., Verykios, V.S.: Summarization algorithms for record linkage. In: EDBT, pp. 73–84 (2018)

  26. Kim, H., Lee, D.: Fast iterative hashed record linkage for large-scale data collections. In: International Conference on Extending Database Technology (EDBT), pp. 525–536 (2010)

  27. Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge Univ. Press, Cambridge (1995)

    Book  Google Scholar 

  28. Papadakis, G., Koutrika, G., Palpanas, T., Nejdl, W.: Meta-blocking: Taking entity resolution to the next level. TKDE 26(8), 1946–1960 (2014)

    Google Scholar 

  29. Papadakis, G., Papastefanatos, G., Koutrika, G.: Supervised meta-blocking. In: PVLDB, pp. 1929–1940 (2014)

  30. Papenbrock, T., Heise, A., Naumann, F.: Progressive duplicate detection. Trans. Knowl. Data Eng. (TKDE) 27(5), 1316–1329 (2015)

    Article  Google Scholar 

  31. Pugh, W.: Skip lists: a probabilistic alternative to balanced trees. CACM 33(6), 668–676 (1990)

    Article  Google Scholar 

  32. Rajaraman, A., Ullman, J.: Mining of Massive Datasets, Chapter Finding Similar Items. Cambridge Univ. Press, Cambridge (2010)

    Google Scholar 

  33. Ramadan, B., Christen, P.: Forest-based dynamic sorted neighborhood indexing for real-time entity resolution. In: CIKM, pp. 1787–1790 (2014)

  34. Ramadan, B., Christen, P., Liang, H., Gayler, R., Hawking, D.: Dynamic similarity-aware inverted indexing for real-time entity resolution. In: PAKDD Workshops, pp. 47–58 (2013)

  35. Schnell, R., Bachteler, T., Reiher, J.: Privacy-preserving record linkage using Bloom filters. Med. Inform. Decis. Making (BMC) 9, 41 (2009)

    Article  Google Scholar 

  36. Steorts, R., Ventura, S., Sadinle, M., Fienberg, S.: A comparison of blocking methods for record linkage. In: Privacy in Statistical Databases (PSD), pp. 253–268 (2014)

  37. Vatsalan, D., Christen, P., Verykios, V.: A taxonomy of privacy-preserving record linkage techniques. Inf. Sys. 38(6), 946–969 (2013)

    Article  Google Scholar 

  38. Whang, S.E., Marmaros, D., Garcia-Molina, H.: Pay-as-you-go entity resolution. Trans. Knowl. Data Eng. (TKDE) 25(5), 1111–1124 (2013)

    Article  Google Scholar 

  39. Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: SIGMOD, pp. 219–232 (2009)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dimitrios Karapiperis.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Karapiperis, D., Gkoulalas-Divanis, A. & Verykios, V.S. Summarizing and linking electronic health records. Distrib Parallel Databases 39, 321–360 (2021). https://doi.org/10.1007/s10619-019-07263-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-019-07263-0

Navigation