Linking temporal records

Li, Pei; Dong, Xin Luna; Maurino, Andrea; Srivastava, Divesh

doi:10.1007/s11704-012-2002-5

Linking temporal records

Research Article
Published: 20 May 2012

Volume 6, pages 293–312, (2012)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Pei Li¹,
Xin Luna Dong²,
Andrea Maurino¹ &
…
Divesh Srivastava²

175 Accesses
38 Citations
Explore all metrics

Abstract

Many data sets contain temporal records which span a long period of time; each record is associated with a time stamp and describes some aspects of a real-world entity at a particular time (e.g., author information in DBLP). In such cases, we often wish to identify records that describe the same entity over time and so be able to perform interesting longitudinal data analysis. However, existing record linkage techniques ignore temporal information and fall short for temporal data.

This article studies linking temporal records. First, we apply time decay to capture the effect of elapsed time on entity value evolution. Second, instead of comparing each pair of records locally, we propose clustering methods that consider the time order of the records and make global decisions. Experimental results show that our algorithms significantly outperform traditional linkage methods on various temporal data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving Temporal Record Linkage Using Regression Classification

CLTR: Collectively Linking Temporal Records Across Heterogeneous Sources

A scalable privacy-preserving framework for temporal record linkage

Article 11 June 2019

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Elmagarmid A, Ipeirotis P, Verykios V. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1–16
Article Google Scholar
Koudas N, Sarawagi S, Srivastava D. Record linkage: similarity measures and algorithms. In: Proceedings of the 25th ACM SIGMOD International Conference on Management of Data. 2006, 802–803
Weikum G, Ntarmos N, Spaniol M, Triantafillou P, Benczúr A, Kirkpatrick S, Rigaux P, Williamson M. Longitudinal analytics on web archive data: It’s about time! In: Proceedings of the Biennial Conference on Innovative Data Systems Research. 2011, 199–202
McCallum A, Nigam K, Ungar L. Efficient clustering of highdimensional data sets with application to reference matching. In: Proceedings of the 6th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining. 2000, 169–178
Li P, Dong X, Maurino A, Srivastava D. Linking temporal records. Proceedings of the VLDB Endowment, 2011, 4(7): 956–967
Google Scholar
Fan W, Geerts F, Wijsen J. Determining the currency of data. In: Proceedings of the 30th Symposium on Principles of Database Systems of Data. 2011, 71–82
Hassanzadeh O, Chiang F, Lee H, Miller R. Framework for evaluating clustering algorithms in duplicate detection. Proceedings of the VLDB Endowment, 2009, 2(1): 1282–1293
Google Scholar
Fellegi I P, Sunter A B. A theory for record linkage. Journal of the American Statistical Association, 1969, 64(328): 1183–1210
Google Scholar
Dey D. Entity matching in heterogeneous databases: A logistic regression approach. Decision Support Systems, 2008, 44(3): 740–747
Article Google Scholar
Hernández M, Stolfo S. Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 1998, 2(1): 9–37
Article Google Scholar
Domingos P. Multi-relational record linkage. In: Proceedings of the KDD-2004 Workshop on Multi-Relational Data Mining. 2004, 31–48
Winkler W. Methods for record linkage and bayesian networks. Technical report, Statistical Research Division, US Census Bureau, Washington, DC, 2002
Ananthakrishna R, Chaudhuri S, Ganti V. Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th International Conference on Very Large Data Bases. 2002, 586–597
Chen Z, Kalashnikov D, Mehrotra S. Exploiting relationships for object consolidation. In: Proceedings of the 2nd International Workshop on Information Quality in Information Systems. 2005, 47–58
On B, Koudas N, Lee D, Srivastava D. Group linkage. In: Proceedings of the 23rd IEEE International Conference on the Data Engineering. 2007, 496–505
Wijaya D, Bressan S. Ricochet: A family of unconstrained algorithms for graph clustering. In: Proceedings of the 14th International Conference on Database Systems for Advanced Applications. 2009, 153–167
Flake G, Tarjan R, Tsioutsiouliklis K. Graph clustering and minimum cut trees. Internet Mathematics, 2004, 1(4): 385–408
Article MathSciNet MATH Google Scholar
Yakout M, Elmagarmid A, Elmeleegy H, Ouzzani M, Qi A. Behavior based record linkage. Proceedings of the VLDB Endowment, 2010, 3(1–2): 439–448
Google Scholar
Burdick D, Hernández MA, Ho H, Koutrika G, Krishnamurthy R, Popa L, Stanoi I, Vaithyanathan S, Das S R. Extracting, linking and integrating data from public sources: a financial case study. IEEE Data Engineering, 2011, 34(3): 60–67
Google Scholar
Ozsoyoglu G, Snodgrass R. Temporal and real-time databases: a survey. IEEE Transactions on Knowledge and Data Engineering, 1995, 7(4): 513–532
Article Google Scholar
Roddick J, Spiliopoulou M. A survey of temporal knowledge discovery paradigms and methods. IEEE Transactions on Knowledge and Data Engineering, 2002, 14(4): 750–767
Article Google Scholar
Cohen E, Strauss M. Maintaining time-decaying stream aggregates. Journal of Algorithms, 2006, 59(1): 19–36
Article MathSciNet MATH Google Scholar
Cormode G, Shkapenyuk V, Srivastava D, Xu B. Forward decay: a practical time decay model for streaming systems. In: Proceedings of the 25th IEEE International Conference on Data Engineering. 2009, 138–149

Download references

Author information

Authors and Affiliations

Department of Informatics, Systems and Communication, University of Milan-Bicocca, Milan, 20126, Italy
Pei Li & Andrea Maurino
Data Management Department, AT&T Labs-Research, Florham Park, NJ, 07932, USA
Xin Luna Dong & Divesh Srivastava

Authors

Pei Li
View author publications
Search author on:PubMed Google Scholar
Xin Luna Dong
View author publications
Search author on:PubMed Google Scholar
Andrea Maurino
View author publications
Search author on:PubMed Google Scholar
Divesh Srivastava
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Pei Li.

Additional information

Pei Li is a PhD student at Università di Milano Bicocca. Currently she is nearing the completion of her doctoral thesis in Computer Science. Previously, she studied electronic engineering at Beijing University of Posts and Telecommunications, where she received her BS and MS degrees. Her research interests are data integration and record linkage, with special focus on entity resolution with value inconsistency.

Xin Luna Dong is a researcher at AT&T Labs-Research. She received her PhD from University of Washington in 2007, received her Master’s Degree from Peking University in China in 2001, and her Bachelor’s Degree from Nankai University in China in 1998. Her research interests include databases, information retrieval, and machine learning, with an emphasis on data integration, data cleaning, personal information management, and Web search. She is cochairing Sigmod/PODS PhD Symposium 2012, Sigmod New Researcher Symposium 2012, and QDB (Quality of DataBases) 2012, has co-chaired WebDB’10, was a co-editor of the IEEE Data Engineering special issue on Towards quality data with fusion and cleaning, and has served in the program committees of VLDB’12, Sigmod’ 12, VLDB’11, Sigmod’11, VLDB’10, www’10, ICDE’10, and VLDB’09.

Andrea Maurino is an assistant professor at Università di Milano Bicocca, his research interest covers many areas in the field of database systems and service science. In the field of data quality, his research interests are record linkage, cooperative information systems, and assessment techniques of data intensive web applications. In the field of service science he focuses on the analysis of quality of services and non functional properties. He is the author of more than 40 papers including international journals and conferences; he is also the author of 4 book chapters. He was program co-chair of QDB’09 workshop, guest editor of IEEE Internet Computing in 2010. He is a reviewer for several journals including Information Systems, Knowledge Data and Engineering.

Divesh Srivastava is the head of the Database Research Department at AT&T Labs-Research. He received his PhD from the University of Wisconsin, Madison, and his BTech from the Indian Institute of Technology, Bombay. He is a Fellow of the ACM, on the board of trustees of the VLDB Endowment, and an associate editor of the ACM Transactions on Database Systems. He has served as the associate Editor-in-Chief of the IEEE Transactions on Knowledge and Data Engineering, and the program committee co-chair of many conferences, including VLDB 2007. He has presented keynote talks at several conferences, including VLDB 2010. His research interests span a variety of topics in data management.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, P., Dong, X.L., Maurino, A. et al. Linking temporal records. Front. Comput. Sci. 6, 293–312 (2012). https://doi.org/10.1007/s11704-012-2002-5

Download citation

Received: 01 January 2012
Accepted: 02 February 2012
Published: 20 May 2012
Issue Date: June 2012
DOI: https://doi.org/10.1007/s11704-012-2002-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Linking temporal records

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Improving Temporal Record Linkage Using Regression Classification

CLTR: Collectively Linking Temporal Records Across Heterogeneous Sources

A scalable privacy-preserving framework for temporal record linkage

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now