skip to main content
10.1145/1998076.1998094acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
research-article

Detecting and exploiting stability in evolving heterogeneous information spaces

Published: 13 June 2011 Publication History

Abstract

Individuals contribute content on the Web at an unprecedented rate, accumulating immense quantities of (semi-)structured data. Wisdom of the Crowds theory advocates that such information (or parts of it) is constantly overwritten, updated, or even deleted by other users, with the goal of rendering it more accurate, or up-to-date. This is particularly true for the collaboratively edited, semi-structured data of entity repositories, whose entity profiles are consistently kept fresh. Therefore, their core information that remain stable with the passage of time, despite being reviewed by numerous users, are particularly useful for the description of an entity.
Based on the above hypothesis, we introduce a classification scheme that predicts, on the basis of statistical and content patterns, whether an attribute (i.e., name-value pair) is going to be modified in the future. We apply our scheme on a large, real-world, versioned dataset and verify its effectiveness. Our thorough experimental study also suggests that reducing entity profiles to their stable parts conveys significant benefits to two common tasks in computer science: information retrieval and information integration.

References

[1]
E. Adar, J. Teevan, S. T. Dumais, and J. L. Elsas. The web changes everything: understanding the dynamics of web content. In WSDM, pages 282--291, 2009.
[2]
A. Aji, Y. Wang, E. Agichtein, and E. Gabrilovich. Using the past to score the present: Extending term weighting models through revision history analysis. In CIKM, 2010.
[3]
R. Almeida, B. Mozafari, and J. Cho. On the evolution of wikipedia. In Int. Conf. on Weblogs and Social Media. Citeseer, 2007.
[4]
S. Baluja, R. Seth, D. Sivakumar, Y. Jing, J. Yagnik, S. Kumar, D. Ravichandran, and M. Aly. Video suggestion and discovery for youtube: taking random walks through the view graph. In WWW, pages 895--904, 2008.
[5]
N. Bansal and N. Koudas. Searching the blogosphere. In WebDB, 2007.
[6]
E. Chu, A. Baid, X. Chai, A. Doan, and J. F. Naughton. Combining keyword search and forms for ad hoc querying of databases. In SIGMOD Conference, pages 349--360, 2009.
[7]
T. de Vries, H. Ke, S. Chawla, and P. Christen. Robust record linkage blocking using suffix arrays. In CIKM, pages 305--314, 2009.
[8]
E. Demidova, X. Zhou, I. Oelze, and W. Nejdl. Evaluating evidences for keyword query disambiguation in entity centric database search. In DEXA (2), pages 240--247, 2010.
[9]
A. Dong, Y. Chang, Z. Zheng, G. Mishne, J. Bai, R. Zhang, K. Buchner, C. Liao, and F. Diaz. Towards recency ranking in web search. In WSDM, pages 11--20, 2010.
[10]
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 2007.
[11]
G. Giannakopoulos, V. Karkaletsis, G. A. Vouros, and P. Stamatopoulos. Summarization system evaluation revisited: N-gram graphs. TSLP, 5(3), 2008.
[12]
G. Giannakopoulos and T. Palpanas. Content and type as orthogonal modeling features: a study on user interest awareness in entity subscription services. International Journal of Advances on Networks and Services, 3(2), 2010.
[13]
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491--500, 2001.
[14]
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. Witten. The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1):10--18, 2009.
[15]
M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. In SIGMOD Conference, pages 127--138, 1995.
[16]
L. Jin, C. Li, and S. Mehrotra. Efficient record linkage in large data sets. In DASFAA, 2003.
[17]
N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD, 2006.
[18]
H. Kwak, C. Lee, H. Park, and S. B. Moon. What is twitter, a social network or a news media? In WWW, pages 591--600, 2010.
[19]
J. Madhavan, S. Cohen, X. L. Dong, A. Y. Halevy, S. R. Jeffery, D. Ko, and C. Yu. Web-scale data integration: You can afford to pay as you go. In CIDR, pages 342--350, 2007.
[20]
M. McCandless, E. Hatcher, and O. Gospodneti. Lucene in action. Greenwich, CT: Manning, 2009.
[21]
C. Olston and S. Pandey. Recrawl scheduling based on information longevity. In WWW, pages 437--446, 2008.
[22]
S. Oyama, K. Shirasuna, and K. Tanaka. Identification of time-varying objects on the web. In JCDL, pages 285--294, 2008.
[23]
G. Papadakis, E. Ioannou, C. Niederee, and P. Fankhauser. Efficient entity resolution for large heterogeneous information spaces. In WSDM'11 (to appear), 2011.
[24]
M. Platakis, D. Kotsakos, and D. Gunopulos. Searching for events in the blogosphere. In WWW, pages 1225--1226, 2009.
[25]
J. Surowiecki. The wisdom of crowds: Why the many are smarter than the few and how collective wisdom shapes business, economies, societies, and nations. Random House, Inc., 2004.
[26]
S. Tata and G. M. Lohman. Sqak: doing more with keywords. In SIGMOD Conference, pages 889--902, 2008.
[27]
S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolution with iterative blocking. In SIGMOD Conference, pages 219--232, 2009.
[28]
I. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann Pub, 2005.

Cited By

View all
  • (2021)The Four Generations of Entity ResolutionSynthesis Lectures on Data Management10.2200/S01067ED1V01Y202012DTM06416:2(1-170)Online publication date: 15-Mar-2021
  • (2020)Blocking and Filtering Techniques for Entity ResolutionACM Computing Surveys10.1145/337745553:2(1-42)Online publication date: 20-Mar-2020
  • (2020)A Survey on Blocking Technology of Entity ResolutionJournal of Computer Science and Technology10.1007/s11390-020-0350-435:4(769-793)Online publication date: 27-Jul-2020
  • Show More Cited By

Index Terms

  1. Detecting and exploiting stability in evolving heterogeneous information spaces

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      JCDL '11: Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
      June 2011
      500 pages
      ISBN:9781450307444
      DOI:10.1145/1998076
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 June 2011

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. entity evolution
      2. n-gram graphs
      3. stability detection

      Qualifiers

      • Research-article

      Conference

      JCDL '11
      Sponsor:
      JCDL '11: Joint Conference on Digital Libraries
      June 13 - 17, 2011
      Ontario, Ottawa, Canada

      Acceptance Rates

      Overall Acceptance Rate 415 of 1,482 submissions, 28%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 16 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)The Four Generations of Entity ResolutionSynthesis Lectures on Data Management10.2200/S01067ED1V01Y202012DTM06416:2(1-170)Online publication date: 15-Mar-2021
      • (2020)Blocking and Filtering Techniques for Entity ResolutionACM Computing Surveys10.1145/337745553:2(1-42)Online publication date: 20-Mar-2020
      • (2020)A Survey on Blocking Technology of Entity ResolutionJournal of Computer Science and Technology10.1007/s11390-020-0350-435:4(769-793)Online publication date: 27-Jul-2020
      • (2019)EMBench++Semantic Web10.3233/SW-18033110:2(435-450)Online publication date: 1-Jan-2019
      • (2013)A Blocking Framework for Entity Resolution in Highly Heterogeneous Information SpacesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2012.15025:12(2665-2682)Online publication date: 1-Dec-2013
      • (2012)dbTrentoACM SIGMOD Record10.1145/2380776.238078441:3(28-33)Online publication date: 5-Oct-2012
      • (2012)On Generating Benchmark Data for Entity MatchingJournal on Data Semantics10.1007/s13740-012-0015-82:1(37-56)Online publication date: 20-Nov-2012

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media