skip to main content
10.1145/1967486.1967557acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiiwasConference Proceedingsconference-collections
research-article

The missing links: discovering hidden same-as links among a billion of triples

Published: 08 November 2010 Publication History

Abstract

The Semantic Web is constantly gaining momentum, as more and more Web sites and content providers adopt its principles. At the core of these principles lies the Linked Data movement, which demands that data on the Web shall be annotated and linked among different sources, instead of being isolated in data silos. In order to materialize this vision of a web of semantics, existing resource identifiers should be reused and shared between different Web sites. This is not always the case with the current state of the Semantic Web, since multiple identifiers are, more often than not, redundantly introduced for the same resources.
In this paper we introduce a novel approach to automatically detect redundant identifiers solely by matching the URIs of information resources. The approach, based on a common pattern among Semantic Web URIs, provides a simple and practical method for duplicate detection. We apply this method on a large snapshot of the current Semantic Web comprising 1.15 billion statements and estimate the number of hidden duplicates in it. The outcomes of our experiments confirm the effectiveness as well as the efficiency of our method, and suggest that URI matching can be used as a scalable filter for discovering implicit same-as links.

References

[1]
A. Agarwal, H. S. Koppula, K. P. Leela, K. P. Chitrapura, S. Garg, P. K. GM, C. Haty, A. Roy, and A. Sasturkar. Url normalization for de-duplication of web pages. In CIKM, pages 1987--1990, 2009.
[2]
E. Baykan, M. R. Henzinger, L. Marian, and I. Weber. Purely url-based topic classification. In WWW, pages 1109--1110. ACM, 2009.
[3]
E. Baykan, M. R. Henzinger, and I. Weber. Web page language identification based on urls. PVLDB, 1(1):176--187, 2008.
[4]
C. Bizer, T. Heath, and T. Berners-Lee. Linked data - the story so far. Int. J. Semantic Web Inf. Syst., 5(3):1--22, 2009.
[5]
P. Bouquet, H. Stoermer, C. Niederée, and A. Mana. Entity name system: The back-bone of an open and scalable web of data. In ICSC, pages 554--561, 2008.
[6]
W. W. Cohen, P. D. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IIWeb, pages 73--78, 2003.
[7]
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1--16, 2007.
[8]
H. Glaser, A. Jaffri, and I. Millard. Managing co-reference on the semantic web. In WWW2009 Workshop: on the Web (LDOW2009), April 2009.
[9]
O. Hassanzadeh, A. Kementsietsidis, L. Lim, R. J. Miller, and M. Wang. A framework for semantic link discovery over relational data. In CIKM, pages 1027--1036, 2009.
[10]
I. Jacobs and N. Walsh. Architecture of the world wide web, volume one. W3C Recommendation, December 2004.
[11]
M. A. Jaro. Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. Journal of the American Statistical Association, 84(406):414--420, June 1989.
[12]
H. S. Koppula, K. P. Leela, A. Agarwal, K. P. Chitrapura, S. Garg, and A. Sasturkar. Learning url patterns for webpage de-duplication. In WSDM, pages 381--390, 2010.
[13]
S. H. Lee, S. J. Kim, and S.-H. Hong. On url normalization. In ICCSA (2), pages 1076--1085, 2005.
[14]
P. Lehti and P. Fankhauser. Unsupervised duplicate detection using sample non-duplicates. J. Data Semantics VII, pages 136--164, 2006.
[15]
T. Lei, R. Cai, J.-M. Yang, Y. Ke, X. Fan, and L. Zhang. A pattern tree-based approach to learning url normalization rules. In WWW, pages 611--620, 2010.
[16]
P. Mika. Microsearch: An interface for semantic search. In SemSearch, pages 79--88, 2008.
[17]
SWEO Community Project: Linking Open Data on the Semantic Web. Equivalence mining and matching frameworks. http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/EquivalenceMining.
[18]
J. Volz, C. Bizer, M. Gaedke, and G. Kobilarov. Discovering and maintaining links on the web of data. In ISWC 2009, volume 5823 of Lecture Notes in Computer Science, pages 650--665. Springer, 2009.

Cited By

View all
  • (2022)The Four Generations of Entity ResolutionundefinedOnline publication date: 25-Feb-2022
  • (2020)Blocking and Filtering Techniques for Entity ResolutionACM Computing Surveys10.1145/337745553:2(1-42)Online publication date: 20-Mar-2020
  • (2020)A Survey on Blocking Technology of Entity ResolutionJournal of Computer Science and Technology10.1007/s11390-020-0350-435:4(769-793)Online publication date: 27-Jul-2020
  • Show More Cited By

Index Terms

  1. The missing links: discovering hidden same-as links among a billion of triples
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Other conferences
          iiWAS '10: Proceedings of the 12th International Conference on Information Integration and Web-based Applications & Services
          November 2010
          895 pages
          ISBN:9781450304214
          DOI:10.1145/1967486
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Sponsors

          • IIWAS: International Organization for Information Integration
          • Web-b: Web-b

          In-Cooperation

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 08 November 2010

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. URI matching
          2. entity resolution
          3. large scale information integration
          4. web data integration

          Qualifiers

          • Research-article

          Funding Sources

          Conference

          iiWAS '10
          Sponsor:
          • IIWAS
          • Web-b

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)2
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 13 Feb 2025

          Other Metrics

          Citations

          Cited By

          View all
          • (2022)The Four Generations of Entity ResolutionundefinedOnline publication date: 25-Feb-2022
          • (2020)Blocking and Filtering Techniques for Entity ResolutionACM Computing Surveys10.1145/337745553:2(1-42)Online publication date: 20-Mar-2020
          • (2020)A Survey on Blocking Technology of Entity ResolutionJournal of Computer Science and Technology10.1007/s11390-020-0350-435:4(769-793)Online publication date: 27-Jul-2020
          • (2020)Data linking over RDF knowledge graphs: A surveyConcurrency and Computation: Practice and Experience10.1002/cpe.574632:19Online publication date: 27-Apr-2020
          • (2019)BTC-2019: The 2019 Billion Triple Challenge DatasetThe Semantic Web – ISWC 201910.1007/978-3-030-30796-7_11(163-180)Online publication date: 17-Oct-2019
          • (2019)Context-Aware Instance Matching Through Graph Embedding in Lexical Semantic SpaceAdvances and Trends in Artificial Intelligence. From Theory to Practice10.1007/978-3-030-22999-3_37(422-433)Online publication date: 9-Jul-2019
          • (2018)Linguistic Frames as Support for Entity Alignment in Knowledge GraphsProceedings of the 20th International Conference on Information Integration and Web-based Applications & Services10.1145/3282373.3282415(226-229)Online publication date: 19-Nov-2018
          • (2018)Towards Exploring Literals to Enrich Data Linking in Knowledge Graphs2018 IEEE First International Conference on Artificial Intelligence and Knowledge Engineering (AIKE)10.1109/AIKE.2018.00024(110-114)Online publication date: Sep-2018
          • (2017)Resolving Entity on A Large scaleDistributed and Parallel Databases10.1007/s10619-017-7205-135:3-4(303-332)Online publication date: 1-Dec-2017
          • (2016)Profiling similarity links in Linked Open Data2016 IEEE 32nd International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW.2016.7495626(103-108)Online publication date: May-2016
          • Show More Cited By

          View Options

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Figures

          Tables

          Media

          Share

          Share

          Share this Publication link

          Share on social media