Skip to main content
Log in

Linking temporal records

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Many data sets contain temporal records which span a long period of time; each record is associated with a time stamp and describes some aspects of a real-world entity at a particular time (e.g., author information in DBLP). In such cases, we often wish to identify records that describe the same entity over time and so be able to perform interesting longitudinal data analysis. However, existing record linkage techniques ignore temporal information and fall short for temporal data.

This article studies linking temporal records. First, we apply time decay to capture the effect of elapsed time on entity value evolution. Second, instead of comparing each pair of records locally, we propose clustering methods that consider the time order of the records and make global decisions. Experimental results show that our algorithms significantly outperform traditional linkage methods on various temporal data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Elmagarmid A, Ipeirotis P, Verykios V. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1–16

    Article  Google Scholar 

  2. Koudas N, Sarawagi S, Srivastava D. Record linkage: similarity measures and algorithms. In: Proceedings of the 25th ACM SIGMOD International Conference on Management of Data. 2006, 802–803

  3. Weikum G, Ntarmos N, Spaniol M, Triantafillou P, Benczúr A, Kirkpatrick S, Rigaux P, Williamson M. Longitudinal analytics on web archive data: It’s about time! In: Proceedings of the Biennial Conference on Innovative Data Systems Research. 2011, 199–202

  4. McCallum A, Nigam K, Ungar L. Efficient clustering of highdimensional data sets with application to reference matching. In: Proceedings of the 6th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining. 2000, 169–178

  5. Li P, Dong X, Maurino A, Srivastava D. Linking temporal records. Proceedings of the VLDB Endowment, 2011, 4(7): 956–967

    Google Scholar 

  6. Fan W, Geerts F, Wijsen J. Determining the currency of data. In: Proceedings of the 30th Symposium on Principles of Database Systems of Data. 2011, 71–82

  7. Hassanzadeh O, Chiang F, Lee H, Miller R. Framework for evaluating clustering algorithms in duplicate detection. Proceedings of the VLDB Endowment, 2009, 2(1): 1282–1293

    Google Scholar 

  8. Fellegi I P, Sunter A B. A theory for record linkage. Journal of the American Statistical Association, 1969, 64(328): 1183–1210

    Google Scholar 

  9. Dey D. Entity matching in heterogeneous databases: A logistic regression approach. Decision Support Systems, 2008, 44(3): 740–747

    Article  Google Scholar 

  10. Hernández M, Stolfo S. Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 1998, 2(1): 9–37

    Article  Google Scholar 

  11. Domingos P. Multi-relational record linkage. In: Proceedings of the KDD-2004 Workshop on Multi-Relational Data Mining. 2004, 31–48

  12. Winkler W. Methods for record linkage and bayesian networks. Technical report, Statistical Research Division, US Census Bureau, Washington, DC, 2002

  13. Ananthakrishna R, Chaudhuri S, Ganti V. Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th International Conference on Very Large Data Bases. 2002, 586–597

  14. Chen Z, Kalashnikov D, Mehrotra S. Exploiting relationships for object consolidation. In: Proceedings of the 2nd International Workshop on Information Quality in Information Systems. 2005, 47–58

  15. On B, Koudas N, Lee D, Srivastava D. Group linkage. In: Proceedings of the 23rd IEEE International Conference on the Data Engineering. 2007, 496–505

  16. Wijaya D, Bressan S. Ricochet: A family of unconstrained algorithms for graph clustering. In: Proceedings of the 14th International Conference on Database Systems for Advanced Applications. 2009, 153–167

  17. Flake G, Tarjan R, Tsioutsiouliklis K. Graph clustering and minimum cut trees. Internet Mathematics, 2004, 1(4): 385–408

    Article  MathSciNet  MATH  Google Scholar 

  18. Yakout M, Elmagarmid A, Elmeleegy H, Ouzzani M, Qi A. Behavior based record linkage. Proceedings of the VLDB Endowment, 2010, 3(1–2): 439–448

    Google Scholar 

  19. Burdick D, Hernández MA, Ho H, Koutrika G, Krishnamurthy R, Popa L, Stanoi I, Vaithyanathan S, Das S R. Extracting, linking and integrating data from public sources: a financial case study. IEEE Data Engineering, 2011, 34(3): 60–67

    Google Scholar 

  20. Ozsoyoglu G, Snodgrass R. Temporal and real-time databases: a survey. IEEE Transactions on Knowledge and Data Engineering, 1995, 7(4): 513–532

    Article  Google Scholar 

  21. Roddick J, Spiliopoulou M. A survey of temporal knowledge discovery paradigms and methods. IEEE Transactions on Knowledge and Data Engineering, 2002, 14(4): 750–767

    Article  Google Scholar 

  22. Cohen E, Strauss M. Maintaining time-decaying stream aggregates. Journal of Algorithms, 2006, 59(1): 19–36

    Article  MathSciNet  MATH  Google Scholar 

  23. Cormode G, Shkapenyuk V, Srivastava D, Xu B. Forward decay: a practical time decay model for streaming systems. In: Proceedings of the 25th IEEE International Conference on Data Engineering. 2009, 138–149

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pei Li.

Additional information

Pei Li is a PhD student at Università di Milano Bicocca. Currently she is nearing the completion of her doctoral thesis in Computer Science. Previously, she studied electronic engineering at Beijing University of Posts and Telecommunications, where she received her BS and MS degrees. Her research interests are data integration and record linkage, with special focus on entity resolution with value inconsistency.

Xin Luna Dong is a researcher at AT&T Labs-Research. She received her PhD from University of Washington in 2007, received her Master’s Degree from Peking University in China in 2001, and her Bachelor’s Degree from Nankai University in China in 1998. Her research interests include databases, information retrieval, and machine learning, with an emphasis on data integration, data cleaning, personal information management, and Web search. She is cochairing Sigmod/PODS PhD Symposium 2012, Sigmod New Researcher Symposium 2012, and QDB (Quality of DataBases) 2012, has co-chaired WebDB’10, was a co-editor of the IEEE Data Engineering special issue on Towards quality data with fusion and cleaning, and has served in the program committees of VLDB’12, Sigmod’ 12, VLDB’11, Sigmod’11, VLDB’10, www’10, ICDE’10, and VLDB’09.

Andrea Maurino is an assistant professor at Università di Milano Bicocca, his research interest covers many areas in the field of database systems and service science. In the field of data quality, his research interests are record linkage, cooperative information systems, and assessment techniques of data intensive web applications. In the field of service science he focuses on the analysis of quality of services and non functional properties. He is the author of more than 40 papers including international journals and conferences; he is also the author of 4 book chapters. He was program co-chair of QDB’09 workshop, guest editor of IEEE Internet Computing in 2010. He is a reviewer for several journals including Information Systems, Knowledge Data and Engineering.

Divesh Srivastava is the head of the Database Research Department at AT&T Labs-Research. He received his PhD from the University of Wisconsin, Madison, and his BTech from the Indian Institute of Technology, Bombay. He is a Fellow of the ACM, on the board of trustees of the VLDB Endowment, and an associate editor of the ACM Transactions on Database Systems. He has served as the associate Editor-in-Chief of the IEEE Transactions on Knowledge and Data Engineering, and the program committee co-chair of many conferences, including VLDB 2007. He has presented keynote talks at several conferences, including VLDB 2010. His research interests span a variety of topics in data management.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, P., Dong, X.L., Maurino, A. et al. Linking temporal records. Front. Comput. Sci. 6, 293–312 (2012). https://doi.org/10.1007/s11704-012-2002-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-012-2002-5

Keywords

Navigation