Summarizing and linking electronic health records

Karapiperis, Dimitrios; Gkoulalas-Divanis, Aris; Verykios, Vassilios S.

doi:10.1007/s10619-019-07263-0

Summarizing and linking electronic health records

Published: 18 March 2019

Volume 39, pages 321–360, (2021)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

Dimitrios Karapiperis ORCID: orcid.org/0000-0002-3878-5988¹,
Aris Gkoulalas-Divanis² &
Vassilios S. Verykios¹

383 Accesses
3 Citations
Explore all metrics

Abstract

In recent years, several applications have emerged which require access to consolidated information that has to be computed and provided in near real-time. Traditional record linkage algorithms are unable to support such time-critical applications, as they perform the linkage offline and provide the result set only when the entire process has completed. To address this need, in this paper we propose the first summarization algorithms that operate in the blocking and matching steps of record linkage to speed up online linkage tasks. Our first method, called SkipBloom, efficiently summarizes the participating data sets, using their blocking keys, to allow for very fast comparisons among them. The second method, called BlockSketch, summarizes a block to achieve a constant number of comparisons for a submitted query record, during the matching phase. Our third method, SBlockSketch, operates on data streams, where the entire data set is unknown a-priori but, instead, there is a potentially unbounded stream of incoming data records. Finally, we introduce PBlockSketch, which adapts BlockSketch to privacy-preserving settings. Through extensive experimental evaluation, using real-world data sets, we show that our methods outperform the state-of-the-art algorithms for online record linkage in terms of the time needed, the memory used, and the recall and precision rates that are achieved during the linkage process. Following the evaluation of our approaches, we introduce SFEMRL, a novel framework that uses them to enable the linkage of electronic health records at large scale, while respecting patients’ privacy. Under this framework, patient records first undergo a data masking process that perturbs sensitive information in data fields of the records to protect it. Subsequently, they participate in a parallel and distributed ecosystem, whose goal is to persist these records in order to be queried efficiently and accurately. We demonstrate that the integration of our framework with Map/Reduce offers robust distributed solutions for performing on-demand large-scale privacy-preserving record linkage tasks in the health domain.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data in healthcare: management, analysis and future prospects

Article Open access 19 June 2019

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Article 12 April 2024

Big Data Analytics in Healthcare

Notes

As long as tails come up, we add the item to each successive list. We terminate this process when we encounter heads.
A concrete class implements the inherited methods of an abstract class and/or interfaces in the Java programming language.
https://github.com/google/leveldb.
http://www.oracle.com/technetwork/database/database-technologies/berkeleydb/overview/index.html.
SkipBloom aims to address Problem 1, BlockSketch targets Problem 2, while SBlockSketch tackles Problem 3.
Henceforth, key and blocking key will be used interchangeably.
SkipBloom locates these Bloom filters performing a recursive process.
Since, we expect to have duplicate keys, it is quite natural that the same key may be stored into multiple Bloom filters of a block.
We assume that the number of distinct blocking keys is n in both A and B.
The distance either between a pair of records, or between a blocking key and a record, is determined by the distances of the certain field values, part of which usually make up the blocking key.
The exact number of representatives will be specified later.
A representative, being essentially a blocking key, has only key values.
For instance, LevelDB (see https://github.com/google/leveldb) uses an in-memory highly efficient multi-level data structure, which enables logarithmic disk seeks in the number of stored blocking keys.
A live block is a block that is stored in main memory.
The Hamming distance between two Bloom filters is equal to the number of components in which these filters have different bits.
We use the term masking to refer to data obfuscation and perturbations operations that are applied to protect the plain-text (original) values.
In general, in a multi-party model, there are more than two data custodians involved.
http://www.hhs.gov/ocr/privacy/hipaa/administrative/privacyrule/.
https://ec.europa.eu/info/law/law-topic/data-protection_en.
http://www.phrn.org.au/centre-for-data-linkage/, http://www.cherel.org.au/, https://www.cprd.com/intro.asp, http://ww2.health.wa.gov.au/Articles/A_E/Data-linkage.
A collision is a record pair formulation in a certain hash table.
Each hash table can be assumed as an independent Bernoulli trial for the collision of a Bloom filter pair.
The collision threshold is essentially the number of collisions that should be counted between a pair of records, so as to be considered as a potential matching pair.
http://dblp.uni-trier.de/xml.
http://dl.ncsbe.gov/index.html?prefix=data/.
https://idash-data.ucsd.edu/community/43.
in terms of the predefined clusters of matching records.
https://github.com/google/leveldb.
Using the double metaphone encoding method, ‘SMITH’ and ‘SMYTH’ are both encoded as ‘SM0’.
The map structure is initialized for each record of Q.
We had 32GB of main memory available.
http://hadoop.apache.org/.
http://cassandra.apache.org/.

References

Altwaijry, H., Kalashnikov, D., Mehrotra, S.: Query-driven approach to entity resolution. In: International Conference on Very Large Data Bases (PVLDB), vol. 6, pp. 1846–1857 (2013)
Bhattacharya, I., Getoor, L., Licamele, L.: Query-time entity resolution. In: International Conference on Knowledge Discovery and Data Mining (KDD), pp. 529–534 (2006)
Bilenko, M., Kamath, B., Mooney, R. J.: Adaptive blocking: learning to scale up record linkage. In: International Conference on Data Mining (ICDM), 87–96 (2006)
Broder, A., Mitzenmacher, M.: Network applications of Bloom filters: a survey. In: Internet Mathematics, pp. 636–646 (2002)
Christen, P.: Data matching—concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer, Data-Centric Sys. and Appl. (2012)
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. Trans. Knowl. Data Eng. (TKDE) 12(9), 1537–1555 (2012)
Article Google Scholar
Christen, P., Gayler, R., Hawking, D.: Similarity–aware indexing for real-time entity resolution. In: International Conference on Information and Knowledge Management (CIKM), pp. 1565–1568 (2009)
Christophides, V., Efthymiou, V., Stefanidis, K.: Entity Resolution in the Web of Data. Morgan and Claypool Publishers, San Rafael (2015)
Book Google Scholar
Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 475–480 (2002)
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.: Locality-sensitive hashing scheme based on p-stable distributions. In: Symp. on Comp. Geom., pp. 253–262 (2004)
Dean, J., Ghemawat, S.: Mapreduce: simplifed data processing on large clusters. CACM 51(1), 107–113 (2008)
Article Google Scholar
Dey, D., Mookerjee, V., Liu, D.: Efficient techniques for online record linkage. Trans. Knowl. Data Eng. (TKDE) 23(3), 373–387 (2011)
Article Google Scholar
Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: a survey. Trans. Knowl. Data Eng. (TKDE) 19(1), 1–16 (2007)
Article Google Scholar
Firmani, D., Saha, B., Srivastava, D.: Online entity resolution using an oracle. In: International Conference on Very Large Data Bases (PVLDB), vol. 9, pp. 384–395 (2016)
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB, pp. 518–529 (1999)
Haas, P.J.: Data-stream sampling: basic techniques and results. In: Garofalakis, M., Gehrke, J., Rastogi, R. (eds.) Data Stream Management: Processing High-Speed Data Streams, pp. 13–44. Springer, Berlin (2016)
Chapter Google Scholar
Hall, R., Fienberg, S.: Privacy-preserving record linkage. In: PSD, pp. 269–283 (2010)
Hernandez, M., Stolfo, S.: The merge/purge problem for large databases. In: International Conference on Management of Data (SIGMOD), pp. 127–138 (1995)
Ioannou, E., Nejdl, W., Niederee, C., Velegrakis, Y.: On-the-fly entity-aware query processing in the presence of linkage. International Conference on Very Large Data Bases (PVLDB), vol. 3(1), pp. 429–438 (2010)
Karapiperis, D., Verykios, V.: A distributed near-optimal LSH-based framework for privacy-preserving record linkage. COMSIS 11(2), 745–763 (2014)
Article Google Scholar
Karapiperis, D., Verykios, V.: An LSH-based blocking approach with a homomorphic matching technique for privacy-preserving record linkage. Trans. Knowl. Data Eng. (TKDE) 27(4), 909–921 (2015)
Article Google Scholar
Karapiperis, D., Verykios, V.: A fast and efficient Hamming LSH-based scheme for accurate linkage. Knowl. Inf. Syst. (KAIS) 49(3), 861–884 (2016)
Article Google Scholar
Karapiperis, D., Gkoulalas-Divanis, A., Verykios, V.: LSHDB: a parallel and distributed engine for record linkage and similarity search. In: International Conference on Data Mining (ICDM) demos, pp. 1–4 (2016)
Karapiperis, D., Vatsalan, D., Verykios, V., Christen, P.: Efficient record linakge using a compact Hamming space. In: International Conference on Extending Database Technology (EDBT), pp. 209–220 (2016)
Karapiperis, D., Gkoulalas-Divanis, A., Verykios, V.S.: Summarization algorithms for record linkage. In: EDBT, pp. 73–84 (2018)
Kim, H., Lee, D.: Fast iterative hashed record linkage for large-scale data collections. In: International Conference on Extending Database Technology (EDBT), pp. 525–536 (2010)
Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge Univ. Press, Cambridge (1995)
Book Google Scholar
Papadakis, G., Koutrika, G., Palpanas, T., Nejdl, W.: Meta-blocking: Taking entity resolution to the next level. TKDE 26(8), 1946–1960 (2014)
Google Scholar
Papadakis, G., Papastefanatos, G., Koutrika, G.: Supervised meta-blocking. In: PVLDB, pp. 1929–1940 (2014)
Papenbrock, T., Heise, A., Naumann, F.: Progressive duplicate detection. Trans. Knowl. Data Eng. (TKDE) 27(5), 1316–1329 (2015)
Article Google Scholar
Pugh, W.: Skip lists: a probabilistic alternative to balanced trees. CACM 33(6), 668–676 (1990)
Article Google Scholar
Rajaraman, A., Ullman, J.: Mining of Massive Datasets, Chapter Finding Similar Items. Cambridge Univ. Press, Cambridge (2010)
Google Scholar
Ramadan, B., Christen, P.: Forest-based dynamic sorted neighborhood indexing for real-time entity resolution. In: CIKM, pp. 1787–1790 (2014)
Ramadan, B., Christen, P., Liang, H., Gayler, R., Hawking, D.: Dynamic similarity-aware inverted indexing for real-time entity resolution. In: PAKDD Workshops, pp. 47–58 (2013)
Schnell, R., Bachteler, T., Reiher, J.: Privacy-preserving record linkage using Bloom filters. Med. Inform. Decis. Making (BMC) 9, 41 (2009)
Article Google Scholar
Steorts, R., Ventura, S., Sadinle, M., Fienberg, S.: A comparison of blocking methods for record linkage. In: Privacy in Statistical Databases (PSD), pp. 253–268 (2014)
Vatsalan, D., Christen, P., Verykios, V.: A taxonomy of privacy-preserving record linkage techniques. Inf. Sys. 38(6), 946–969 (2013)
Article Google Scholar
Whang, S.E., Marmaros, D., Garcia-Molina, H.: Pay-as-you-go entity resolution. Trans. Knowl. Data Eng. (TKDE) 25(5), 1111–1124 (2013)
Article Google Scholar
Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: SIGMOD, pp. 219–232 (2009)

Download references

Author information

Authors and Affiliations

Hellenic Open University, Patras, Greece
Dimitrios Karapiperis & Vassilios S. Verykios
IBM Watson Health, Cambridge, MA, USA
Aris Gkoulalas-Divanis

Authors

Dimitrios Karapiperis
View author publications
You can also search for this author in PubMed Google Scholar
Aris Gkoulalas-Divanis
View author publications
You can also search for this author in PubMed Google Scholar
Vassilios S. Verykios
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dimitrios Karapiperis.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Karapiperis, D., Gkoulalas-Divanis, A. & Verykios, V.S. Summarizing and linking electronic health records. Distrib Parallel Databases 39, 321–360 (2021). https://doi.org/10.1007/s10619-019-07263-0

Download citation

Published: 18 March 2019
Issue Date: June 2021
DOI: https://doi.org/10.1007/s10619-019-07263-0

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Summarizing and linking electronic health records

Abstract

Access this article

Similar content being viewed by others

Big data in healthcare: management, analysis and future prospects

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Big Data Analytics in Healthcare

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Navigation

Summarizing and linking electronic health records

Abstract

Access this article

Similar content being viewed by others

Big data in healthcare: management, analysis and future prospects

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Big Data Analytics in Healthcare

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation