skip to main content
10.1145/2396761.2398406acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

An effective rule miner for instance matching in a web of data

Published: 29 October 2012 Publication History

Abstract

Publishing structured data and linking them to Linking Open Data (LOD) is an ongoing effort to create a Web of data. Each newly involved data source may contain duplicated instances (entities) whose descriptions or schemata differ from those of the existing sources in LOD. To tackle this heterogeneity issue, several matching methods have been developed to link equivalent entities together. Many general-purpose matching methods which focus on similarity metrics suffer from very diverse matching results for different data source pairs. On the other hand, the dataset-specific ones leverage heuristic rules or even manual efforts to ensure the quality, which makes it impossible to apply them to other sources or domains. In this paper, we offer a third choice, a general method of automatically discovering dataset-specific matching rules. In particular, we propose a semi-supervised learning algorithm to iteratively refine matching rules and find new matches of high confidence based on these rules. This dramatically relieves the burden on users of defining rules but still gives high-quality matching results. We carry out experiments on real-world large scale data sources in LOD; the results show the effectiveness of our approach in terms of the precision of discovered matches and the number of missing matches found. Furthermore, we discuss several extensions (like similarity embedded rules, class restriction and SPARQL rewriting) to fit various applications with different requirements.

References

[1]
R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB, 1994.
[2]
R. Albertoni and M. D. Martino. Asymmetric and context-dependent semantic similarity among ontology instances. Journal on Data Semantics, 10:1--30, 2008.
[3]
S. Auer, J. Lehmann, and S. Hellmann. LinkedGeoData: Adding a Spatial Dimension to the Web of Data. In ISWC, 2009.
[4]
C. Bizer, T. Heath, and T. Berners-Lee. Linked Data - The Story So Far. International Journal on Semantic Web and Information Systems, 5(3):1--22, 2009.
[5]
C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, and S. Hellmann. DBpedia - A crystallization point for the Web of Data. Journal of Web Semantics, 7(3):154--165, 2009.
[6]
S. Castano, A. Ferrara, S. Montanelli, and D. Lorusso. Instance matching for ontology population. In SEBD, 2008.
[7]
J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008.
[8]
L. Ding, J. Shinavier, Z. Shangguan, and D. L. McGuinness. SameAs Networks and Beyond: Analyzing Deployment Status and Implications of owl:sameAs in Linked Data. In ISWC, 2010.
[9]
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1--16, 2007.
[10]
J. Euzenat and P. Shvaiko. Ontology Matching. Springer-Verlag, Heidelberg (DE), 2007.
[11]
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD, 2000.
[12]
O. Hassanzadeh and M. Consens. Linked Movie Data Base. In I-SEMANTICS, September 2008.
[13]
O. Hassanzadeh, A. Kementsietsidis, L. Lim, R. J. Miller, and M. Wang. A framework for semantic link discovery over relational data. In CIKM, 2009.
[14]
A. Hogan, A. Polleres, J. Umbrich, and A. Zimmermann. Some entities are more equal than others: statistical methods to consolidate linked data. In NeFoRS Workshop, 2010.
[15]
W. Hu, J. Chen, and Y. Qu. A self-training approach for resolving object coreference on the semantic web. In WWW, 2011.
[16]
A. Isaac, L. van der Meij, S. Schlobach, and S. Wang. An empirical study of instance-based ontology matching. In ISWC/ASWC, 2007.
[17]
R. Isele, A. Jentzsch, and C. Bizer. Efficient multidimensional blocking for link discovery without losing recall. In WebDB, 2011.
[18]
Y. R. Jean-Mary, E. P. Shironoshita, and M. R. Kabuka. Ontology matching with semantic verification. Journal of Web Semantics, 7(3):235--251, 2009.
[19]
J. Li, J. Tang, Y. Li, and Q. Luo. RiMOM: A Dynamic Multistrategy Ontology Alignment Framework. IEEE Transactions on Knowledge and Data Engineering, 21(8):1218--1232, 2009.
[20]
F. Naumann and M. Herschel. An Introduction to Duplicate Detection. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2010.
[21]
A.-C. N. Ngomo and S. Auer. LIMES: A Time-Efficient Approach for Large-Scale Link Discovery on the Web of Data. In IJCAI, 2011.
[22]
A.-C. N. Ngomo, J. Lehmann, S. Auer, and K. Höffner. RAVEN - active learning of link specifications. In OM Workshop, 2011.
[23]
X. Niu, S. Rong, Y. Zhang, and H. Wang. Zhishi.links results for OAEI 2011. In OM Workshop, 2011.
[24]
X. Niu, H. Wang, G. Wu, G. Qi, and Y. Yu. Evaluating the stability and credibility of ontology matching methods. In ESWC, 2011.
[25]
J. Noessner, M. Niepert, C. Meilicke, and H. Stuckenschmidt. Leveraging terminological structure for object reconciliation. In ESWC, 2010.
[26]
R. Parundekar, C. A. Knoblock, and J. L. Ambite. Linking and building ontologies of linked data. In ISWC, 2010.
[27]
I. Pramudiono and M. Kitsuregawa. Parallel FP-Growth on PC Cluster. In PAKDD, 2003.
[28]
Y. Raimond, C. Sutton, and M. Sandler. Automatic interlinking of music datasets on the semantic web. In LDOW Workshop, 2008.
[29]
G. Shafer. A Mathematical Theory of Evidence. Princeton University Press, Princeton, 1976.
[30]
J. Sleeman and T. Finin. A Machine Learning Approach to Linking FOAF Instances. In AAAI Spring Symposium: Linked Data Meets Artificial Intelligence, 2010.
[31]
J. Völker and M. Niepert. Statistical schema induction. In ESWC, 2011.
[32]
J. Volz, C. Bizer, M. Gaedke, and G. Kobilarov. Discovering and maintaining links on the web of data. In ISWC, 2009.

Cited By

View all
  • (2023)NELLIE: Never-Ending Linking for Linked Open DataIEEE Access10.1109/ACCESS.2023.330069411(84957-84973)Online publication date: 2023
  • (2023)SMAAMA: A named entity alignment method based on Siamese network character feature and multi-attribute importance feature for Chinese civil aviationJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2023.10185635:10(101856)Online publication date: Dec-2023
  • (2023)Leveraging multimodal features for knowledge graph entity alignment based on dynamic self-attention networksExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.120363228:COnline publication date: 15-Oct-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management
October 2012
2840 pages
ISBN:9781450311564
DOI:10.1145/2396761
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 October 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. association rule mining
  2. em algorithm
  3. instance matching
  4. semi-supervised learning

Qualifiers

  • Research-article

Conference

CIKM'12
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)NELLIE: Never-Ending Linking for Linked Open DataIEEE Access10.1109/ACCESS.2023.330069411(84957-84973)Online publication date: 2023
  • (2023)SMAAMA: A named entity alignment method based on Siamese network character feature and multi-attribute importance feature for Chinese civil aviationJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2023.10185635:10(101856)Online publication date: Dec-2023
  • (2023)Leveraging multimodal features for knowledge graph entity alignment based on dynamic self-attention networksExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.120363228:COnline publication date: 15-Oct-2023
  • (2023)Entity alignment via graph neural networks: a component-level studyWorld Wide Web10.1007/s11280-023-01221-826:6(4069-4092)Online publication date: 29-Nov-2023
  • (2023)ODKG: An Official Document Knowledge Graph for the Effective ManagementKnowledge Graph and Semantic Computing: Knowledge Graph Empowers Artificial General Intelligence10.1007/978-981-99-7224-1_17(220-232)Online publication date: 28-Oct-2023
  • (2022)Construction and Application of a Knowledge Graph for Gold Deposits in the Jiapigou Gold Metallogenic Belt, Jilin Province, ChinaMinerals10.3390/min1209117312:9(1173)Online publication date: 17-Sep-2022
  • (2022)Cross-lingual knowledge graph entity alignment based on relation awareness and attribute involvementApplied Intelligence10.1007/s10489-022-03797-653:6(6159-6177)Online publication date: 6-Jul-2022
  • (2021)An ontology matching approach for semantic modeling: A case study in smart citiesComputational Intelligence10.1111/coin.1247438:3(876-902)Online publication date: 15-Jul-2021
  • (2021)Multi-information embedding based entity alignmentApplied Intelligence10.1007/s10489-021-02400-8Online publication date: 16-Apr-2021
  • (2021)A scalable parallel Chinese online encyclopedia knowledge denoising method based on entry tags and Spark clusterApplied Intelligence10.1007/s10489-021-02295-5Online publication date: 20-Mar-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media