Active instance matching with pairwise constraints and its application to Chinese knowledge base construction

Lu, Weiming; Dai, Hao; Zhang, Zhenyu; Wu, Chao; Zhuang, Yueting

doi:10.1007/s10115-017-1076-7

Active instance matching with pairwise constraints and its application to Chinese knowledge base construction

Regular Paper
Published: 30 June 2017

Volume 55, pages 171–214, (2018)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Weiming Lu¹,
Hao Dai¹,
Zhenyu Zhang¹,
Chao Wu² &
…
Yueting Zhuang¹

432 Accesses
Explore all metrics

Abstract

Instance matching is the problem of determining whether two instances describe the same real-world entity or not. Instance matching plays a key role in data integration and data cleansing, especially for building a knowledge base. For example, we can regard each article in encyclopedias as an instance, and a group of articles which refers to the same real-world object as an entity. Therefore, articles about Washington should be distinguished and grouped into different entities such as Washington, D.C (the capital of the USA), George Washington (first president of the USA), Washington (a state of the USA), Washington (a village in West Sussex, England), Washington F.C. (a football club based in Washington, Tyne and Wear, England), Washington, D.C. (a novel). In this paper, we proposed a novel instance matching approach Active Instance Matching with Pairwise Constraints, which can bring the human into the loop of instance matching. The proposed approach can generate candidate pairs in advance to reduce the computational complexity, and then iteratively select the most informative pairs according to the uncertainty, influence, connectivity and diversity of pairs. We evaluated our approach one two publicly available datasets AMINER and WIDE-NUS and then applied our approach to the two large-scale real-world datasets, Baidu Baike and Hudong Baike, to build a Chinese knowledge base. The experiments and practice illustrate the effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Knowledge Graphs: Opportunities and Challenges

Article Open access 03 April 2023

Relation representation based on private and shared features for adaptive few-shot link prediction

Article 10 April 2024

A two-stage entity event deduplication method based on graph node selection and node optimization strategy

Article 07 February 2024

Notes

References

Bizer C, Lehmann J, Kobilarov G, Auer S, Becker C, Cyganiak R, Hellmann S (2009) Dbpedia-a crystallization point for the web of data. Web Semant 7(3):154–165
Article Google Scholar
Suchanek FM, Kasneci G, Weikum G (2007) Yago: a core of semantic knowledge. In: Proceedings of the 16th international conference on World Wide Web. ACM, pp 697–706
Bollacker K, Evans C, Paritosh P, Sturge T, Taylor J (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, pp 1247–1250
Carlson A, Betteridge J, Kisiel B, Settles B, Hruschka ER Jr, Mitchell TM (2010) Toward an architecture for never-ending language learning, vol 5. In: AAAI, p 3
Getoor L, Machanavajjhala A (2012) Entity resolution: tutorial. VLDB, Istanbul
Google Scholar
Getoor L, Machanavajjhala A (2013) Entity resolution for big data. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 1527–1527
Suchanek F, Weikum G (2013) Knowledge harvesting in the big-data era. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data. ACM, pp 933–938
Stefanidis K, Efthymiou V, Melanie Herschel, Christophides V (2014) Entity resolution in the web of data. In: Proceedings of the companion publication of the 23rd international conference on World wide web companion. International World Wide Web Conferences Steering Committee, pp 203–204
Dorneles CF, Gonçalves R, dos Santos Mello R (2011) Approximate data instance matching: a survey. Knowl Inf Syst 27(1):1–21
Article Google Scholar
Christen P (2012) Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer, Berlin
Book Google Scholar
Rong S, Niu X, Xiang EW, Wang H, Yang Q, Yu Y (2012) A machine learning approach for instance matching based on similarity metrics. In: The semantic web–ISWC 2012. Springer, pp 460–475
Araujo S, Tran D, DeVries A, Hidders J, Schwabe D (2012) Serimi: class-based disambiguation for effective instance matching over heterogeneous web data. In: WebDB, pp 25–30
Sachan M, Hovy E, Xing EP (2015) An active learning approach to coreference resolution. In: 24th international joint conference on artificial intelligence (IJCAI)
Suchanek FM, Abiteboul S, Senellart P (2011) Paris: probabilistic alignment of relations, instances, and schema. In: Proceedings of the VLDB endowment vol 5(3). pp 157–168
Böhm C, de Melo G, Naumann F, Weikum G (2012) Linda: distributed web-of-data-scale entity matching. In: Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, pp 2104–2108
Lacoste-Julien S, Palla K, Davies A, Kasneci G, Graepel T, Ghahramani Z (2013) Sigma: simple greedy matching for aligning large knowledge bases. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 572–580
Zhang Y, Tang J, Yang Z, Pei J, Yu PS (2015) Cosnet: connecting heterogeneous social networks with local and global consistency. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1485–1494
Wang J, Kraska T (2012) Michael J Franklin, and Jianhua Feng. Crowder: Crowdsourcing entity resolution. In: Proceedings of the VLDB endowment, vol 5(11). pp 1483–1494
Vesdapunt N, Bellare K, Dalvi N (2014) Crowdsourcing algorithms for entity resolution. In: Proceedings of the VLDB endowment, vol 7(12)
Gokhale C, Das S, Doan AH, Naughton JF, Rampalli N, Shavlik J, Zhu X (2014) Corleone: hands-off crowdsourcing for entity matching. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, pp 601–612
Basu S, Banerjee A, Mooney RJ (2004) Active semi-supervision for pairwise constrained clustering. In: SDM, vol 4. SIAM, pp 333–344
Zhu X, Loy CC, Gong S (2013) Constrained clustering: effective constraint propagation with imperfect oracles. In: IEEE 13th international conference on data mining (ICDM), 2013. IEEE, pp 1307–1312
Zhu X, Loy CC, Gong S (2015) Constrained clustering with imperfect oracles. In: IEEE transactions on neural networks and learning systems, vol PP(99). pp 1–1
Biswas A, Jacobs D (2014) Active image clustering with pairwise constraints from humans. Int J Comput Vis 108(1–2):133–147
Article MathSciNet MATH Google Scholar
Wang X, Davidson I (2010) Active spectral clustering. In: IEEE 10th international conference on data mining (ICDM), 2010. IEEE, pp 561–568
Hassanzadeh O, Kementsietsidis A, Lim L, Miller RJ, Wang M (2009) A framework for semantic link discovery over relational data. In: Proceedings of the 18th ACM conference on information and knowledge management. ACM, pp 1027–1036
Zhao H, Ram S (2008) Entity matching across heterogeneous data sources: an approach based on constrained cascade generalization. Data Knowl Eng 66(3):368–381
Article Google Scholar
Nguyen K, Ichise R, Le H-B (2012) Learning approach for domain-independent linked data instance matching. In: Proceedings of the ACM SIGKDD workshop on mining data semantics. ACM, p 7
Lu Z, Carreira-Perpinan M, et al (2008) Constrained spectral clustering through affinity propagation. In: IEEE conference on computer vision and pattern recognition, 2008. CVPR 2008. IEEE, pp 1–8
Wang X, Davidson I (2010) Flexible constrained spectral clustering. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 563–572
Li Z, Liu J, Tang X (2009) Constrained clustering via spectral regularization. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009. IEEE, pp 421–428
Elsner M, Schudy W (2009) Bounding and comparing methods for correlation clustering beyond ilp. In: Proceedings of the workshop on integer linear programming for natural langauge processing. Association for Computational Linguistics, pp 19–27
Wang J, Li G, Kraska T, Franklin MJ, Feng J (2013) Leveraging transitive relations for crowdsourced joins. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data. ACM, pp 229–240
Demartini G, Difallah DE, Cudré-Mauroux P (2013) Large-scale linked data integration using probabilistic reasoning and crowdsourcing. VLDB J 22(5):665–687
Article Google Scholar
Niu X, Sun X, Wang H, Rong S, Qi G, Yu Y (2011) Zhishi. me-weaving chinese linking open data. In: The Semantic web—ISWC 2011. Springer, pp 205–220
Wang Z, Li J, Wang Z, Li S, Li M, Zhang D, Shi Y, Liu Y, Zhang P, Tang J (2013) Xlore: a large-scale english-chinese bilingual knowledge graph. In: International semantic web conference (Posters & Demos), vol 1035. pp 121–124
Zhang X-Y, Wang S, Yun X (2015) Bidirectional active learning: a two-way exploration into unlabeled and labeled data set. IEEE Trans Neural Netw Learn Syst 26(12):3034–3044
Article MathSciNet Google Scholar
Zhang X-Y, Wang S, Zhu X, Yun X, Wu G, Wang Y (2015) Update vs. upgrade. Neurocomputing 162:163–170
Article Google Scholar
Cheng J, Wang K (2007) Active learning for image retrieval with co-svm. Pattern Recognit 40(1):330–334
Article MATH Google Scholar
Zhang X (2014) Interactive patent classification based on multi-classifier fusion and active learning. Neurocomputing 127:200–205
Article Google Scholar
Cai W, Zhang M, Zhang Y (2016) Batch mode active learning for regression with expected model change. In: IEEE transactions on neural networks and learning systems, vol PP(99). pp 1–14
Xiong C, Johnson D, Corso JJ (2014) Active clustering with model-based uncertainty reduction. arXiv preprint arXiv:1402.1783
Mai ST, He X, Hubig N, Plant C, Bohm C (2013) Active density-based clustering. In: IEEE 13th international conference on data mining (ICDM), 2013. IEEE, pp 508–517
Wauthier FL, Jojic N, Jordan MI (2012) Active spectral clustering via iterative uncertainty reduction. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 1339–1347
Xiong C, Johnson D, Corso JJ (2012) Spectral active clustering via purification of the k-nearest neighbor graph. In: Proceedings of European conference on data mining
Christen P (2012) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 24(9):1537–1555
Article Google Scholar
Papadakis G, Svirsky J, Gal A, Palpanas T (2016) Comparative analysis of approximate blocking techniques for entity resolution. Proc VLDB Endow 9(9):684–695
Article Google Scholar
Papadakis G, Palpanas T (May 2016) Blocking for large-scale entity resolution: challenges, algorithms, and practical examples. In: 2016 IEEE 32nd international conference on data engineering (ICDE). pp 1436–1439
Dalvi B, Mishra A, Cohen WW (2016) Hierarchical semi-supervised classification with incomplete class hierarchies. In: Proceedings of the ninth ACM international conference on web search and data mining. ACM, pp 193–202
Dalvi B, Minkov E, Talukdar PP, Cohen WW (2015) Automatic gloss finding for a knowledge base using ontological constraints. In: Proceedings of the eighth ACM international conference on web search and data mining. ACM, pp 369–378
Nollenburg M, Wolff A (2011) Drawing and labeling high-quality metro maps by mixed-integer programming. IEEE Trans Vis Comput Graph 17(5):626–641
Article Google Scholar
Sandholm T, Gilpin A, Conitzer V (2005) Mixed-integer programming methods for finding nash equilibria. In: Proceedings of the national conference on artificial intelligence, vol 20. AAAI Press, Menlo Park, p 495
Chandra B, Halldórsson MM (2001) Approximation algorithms for dispersion problems. J Algorithms 38:438–465
Article MathSciNet MATH Google Scholar
Dasgupta A, Kumar R, Ravi S (2013) Summarization through submodularity and dispersion. In: ACL
Ghosh A, Kale S, McAfee P (2011) Who moderates the moderators: crowdsourcing abuse detection in user-generated content. In: Proceedings of the 12th ACM conference on Electronic commerce. ACM, pp 167–176
Dalvi N, Dasgupta A, Kumar R, Rastogi V (2013) Aggregating crowdsourced binary ratings. In: Proceedings of the 22nd international conference on World Wide Web. ACM, pp 285–294
Karger DR, Oh S, Shah D (2011) Iterative learning for reliable crowdsourcing systems. In: Advances in neural information processing systems. pp 1953–1961
Lenstra HW Jr (1983) Integer programming with a fixed number of variables. Math Oper Res 8(4):538–548
Article MathSciNet MATH Google Scholar
Wang X, Tang J, Cheng H, Yu PS (2011) Adana: active name disambiguation. In: IEEE 11th international conference on data mining (ICDM), 2011. IEEE, pp 794–803
McCallum A, Nigam K, Ungar LH (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 169–178
Papadakis G, Ioannou E, Niederée C, Palpanas T, Nejdl W (2012) Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data. In: Proceedings of the fifth ACM international conference on Web search and data mining. ACM, pp 53–62
Kenig B, Gal A (2013) Mfiblocks: an effective blocking algorithm for entity resolution. Inf Syst 38(6):908–926
Article Google Scholar
Papadakis George, Ioannou Ekaterini, Palpanas Themis, Niederée Claudia, Nejdl Wolfgang (2013) A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans Knowl Data Eng 25(12):2665–2682
Article Google Scholar
Breiman L (2001) Machine learning. Random For 45(1):5–32
Google Scholar
Jiang S, Bing L, Zhang Y (2013) Towards an enhanced and adaptable ontology by distilling and assembling online encyclopedias. In: Proceedings of the 22nd ACM international conference on Conference on information and knowledge management. ACM, pp 1703–1708

Download references

Acknowledgements

This work is supported by the Zhejiang Provincial Natural Science Foundation of China (No. LY17F020015), the Chinese Knowledge Center of Engineering Science and Technology (CKCEST) and the Fundamental Research Funds for the Central Universities (No. 2017FZA5016).

Author information

Authors and Affiliations

College of Computer Science and Technology, Zhejiang University, Hangzhou Shi, China
Weiming Lu, Hao Dai, Zhenyu Zhang & Yueting Zhuang
Imperial College London, London, UK
Chao Wu

Authors

Weiming Lu
View author publications
You can also search for this author in PubMed Google Scholar
Hao Dai
View author publications
You can also search for this author in PubMed Google Scholar
Zhenyu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chao Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yueting Zhuang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weiming Lu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lu, W., Dai, H., Zhang, Z. et al. Active instance matching with pairwise constraints and its application to Chinese knowledge base construction. Knowl Inf Syst 55, 171–214 (2018). https://doi.org/10.1007/s10115-017-1076-7

Download citation

Received: 30 May 2016
Revised: 28 May 2017
Accepted: 16 June 2017
Published: 30 June 2017
Issue Date: April 2018
DOI: https://doi.org/10.1007/s10115-017-1076-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Active instance matching with pairwise constraints and its application to Chinese knowledge base construction

Abstract

Access this article

Similar content being viewed by others

Knowledge Graphs: Opportunities and Challenges

Relation representation based on private and shared features for adaptive few-shot link prediction

A two-stage entity event deduplication method based on graph node selection and node optimization strategy

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Active instance matching with pairwise constraints and its application to Chinese knowledge base construction

Abstract

Access this article

Similar content being viewed by others

Knowledge Graphs: Opportunities and Challenges

Relation representation based on private and shared features for adaptive few-shot link prediction

A two-stage entity event deduplication method based on graph node selection and node optimization strategy

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation