Abstract
Many data sets contain temporal records which span a long period of time; each record is associated with a time stamp and describes some aspects of a real-world entity at a particular time (e.g., author information in DBLP). In such cases, we often wish to identify records that describe the same entity over time and so be able to perform interesting longitudinal data analysis. However, existing record linkage techniques ignore temporal information and fall short for temporal data.
This article studies linking temporal records. First, we apply time decay to capture the effect of elapsed time on entity value evolution. Second, instead of comparing each pair of records locally, we propose clustering methods that consider the time order of the records and make global decisions. Experimental results show that our algorithms significantly outperform traditional linkage methods on various temporal data sets.
Similar content being viewed by others
References
Elmagarmid A, Ipeirotis P, Verykios V. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1–16
Koudas N, Sarawagi S, Srivastava D. Record linkage: similarity measures and algorithms. In: Proceedings of the 25th ACM SIGMOD International Conference on Management of Data. 2006, 802–803
Weikum G, Ntarmos N, Spaniol M, Triantafillou P, Benczúr A, Kirkpatrick S, Rigaux P, Williamson M. Longitudinal analytics on web archive data: It’s about time! In: Proceedings of the Biennial Conference on Innovative Data Systems Research. 2011, 199–202
McCallum A, Nigam K, Ungar L. Efficient clustering of highdimensional data sets with application to reference matching. In: Proceedings of the 6th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining. 2000, 169–178
Li P, Dong X, Maurino A, Srivastava D. Linking temporal records. Proceedings of the VLDB Endowment, 2011, 4(7): 956–967
Fan W, Geerts F, Wijsen J. Determining the currency of data. In: Proceedings of the 30th Symposium on Principles of Database Systems of Data. 2011, 71–82
Hassanzadeh O, Chiang F, Lee H, Miller R. Framework for evaluating clustering algorithms in duplicate detection. Proceedings of the VLDB Endowment, 2009, 2(1): 1282–1293
Fellegi I P, Sunter A B. A theory for record linkage. Journal of the American Statistical Association, 1969, 64(328): 1183–1210
Dey D. Entity matching in heterogeneous databases: A logistic regression approach. Decision Support Systems, 2008, 44(3): 740–747
Hernández M, Stolfo S. Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 1998, 2(1): 9–37
Domingos P. Multi-relational record linkage. In: Proceedings of the KDD-2004 Workshop on Multi-Relational Data Mining. 2004, 31–48
Winkler W. Methods for record linkage and bayesian networks. Technical report, Statistical Research Division, US Census Bureau, Washington, DC, 2002
Ananthakrishna R, Chaudhuri S, Ganti V. Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th International Conference on Very Large Data Bases. 2002, 586–597
Chen Z, Kalashnikov D, Mehrotra S. Exploiting relationships for object consolidation. In: Proceedings of the 2nd International Workshop on Information Quality in Information Systems. 2005, 47–58
On B, Koudas N, Lee D, Srivastava D. Group linkage. In: Proceedings of the 23rd IEEE International Conference on the Data Engineering. 2007, 496–505
Wijaya D, Bressan S. Ricochet: A family of unconstrained algorithms for graph clustering. In: Proceedings of the 14th International Conference on Database Systems for Advanced Applications. 2009, 153–167
Flake G, Tarjan R, Tsioutsiouliklis K. Graph clustering and minimum cut trees. Internet Mathematics, 2004, 1(4): 385–408
Yakout M, Elmagarmid A, Elmeleegy H, Ouzzani M, Qi A. Behavior based record linkage. Proceedings of the VLDB Endowment, 2010, 3(1–2): 439–448
Burdick D, Hernández MA, Ho H, Koutrika G, Krishnamurthy R, Popa L, Stanoi I, Vaithyanathan S, Das S R. Extracting, linking and integrating data from public sources: a financial case study. IEEE Data Engineering, 2011, 34(3): 60–67
Ozsoyoglu G, Snodgrass R. Temporal and real-time databases: a survey. IEEE Transactions on Knowledge and Data Engineering, 1995, 7(4): 513–532
Roddick J, Spiliopoulou M. A survey of temporal knowledge discovery paradigms and methods. IEEE Transactions on Knowledge and Data Engineering, 2002, 14(4): 750–767
Cohen E, Strauss M. Maintaining time-decaying stream aggregates. Journal of Algorithms, 2006, 59(1): 19–36
Cormode G, Shkapenyuk V, Srivastava D, Xu B. Forward decay: a practical time decay model for streaming systems. In: Proceedings of the 25th IEEE International Conference on Data Engineering. 2009, 138–149
Author information
Authors and Affiliations
Corresponding author
Additional information
Pei Li is a PhD student at Università di Milano Bicocca. Currently she is nearing the completion of her doctoral thesis in Computer Science. Previously, she studied electronic engineering at Beijing University of Posts and Telecommunications, where she received her BS and MS degrees. Her research interests are data integration and record linkage, with special focus on entity resolution with value inconsistency.
Xin Luna Dong is a researcher at AT&T Labs-Research. She received her PhD from University of Washington in 2007, received her Master’s Degree from Peking University in China in 2001, and her Bachelor’s Degree from Nankai University in China in 1998. Her research interests include databases, information retrieval, and machine learning, with an emphasis on data integration, data cleaning, personal information management, and Web search. She is cochairing Sigmod/PODS PhD Symposium 2012, Sigmod New Researcher Symposium 2012, and QDB (Quality of DataBases) 2012, has co-chaired WebDB’10, was a co-editor of the IEEE Data Engineering special issue on Towards quality data with fusion and cleaning, and has served in the program committees of VLDB’12, Sigmod’ 12, VLDB’11, Sigmod’11, VLDB’10, www’10, ICDE’10, and VLDB’09.
Andrea Maurino is an assistant professor at Università di Milano Bicocca, his research interest covers many areas in the field of database systems and service science. In the field of data quality, his research interests are record linkage, cooperative information systems, and assessment techniques of data intensive web applications. In the field of service science he focuses on the analysis of quality of services and non functional properties. He is the author of more than 40 papers including international journals and conferences; he is also the author of 4 book chapters. He was program co-chair of QDB’09 workshop, guest editor of IEEE Internet Computing in 2010. He is a reviewer for several journals including Information Systems, Knowledge Data and Engineering.
Divesh Srivastava is the head of the Database Research Department at AT&T Labs-Research. He received his PhD from the University of Wisconsin, Madison, and his BTech from the Indian Institute of Technology, Bombay. He is a Fellow of the ACM, on the board of trustees of the VLDB Endowment, and an associate editor of the ACM Transactions on Database Systems. He has served as the associate Editor-in-Chief of the IEEE Transactions on Knowledge and Data Engineering, and the program committee co-chair of many conferences, including VLDB 2007. He has presented keynote talks at several conferences, including VLDB 2010. His research interests span a variety of topics in data management.
Rights and permissions
About this article
Cite this article
Li, P., Dong, X.L., Maurino, A. et al. Linking temporal records. Front. Comput. Sci. 6, 293–312 (2012). https://doi.org/10.1007/s11704-012-2002-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11704-012-2002-5