Abstract
Instance matching is the problem of determining whether two instances describe the same real-world entity or not. Instance matching plays a key role in data integration and data cleansing, especially for building a knowledge base. For example, we can regard each article in encyclopedias as an instance, and a group of articles which refers to the same real-world object as an entity. Therefore, articles about Washington should be distinguished and grouped into different entities such as Washington, D.C (the capital of the USA), George Washington (first president of the USA), Washington (a state of the USA), Washington (a village in West Sussex, England), Washington F.C. (a football club based in Washington, Tyne and Wear, England), Washington, D.C. (a novel). In this paper, we proposed a novel instance matching approach Active Instance Matching with Pairwise Constraints, which can bring the human into the loop of instance matching. The proposed approach can generate candidate pairs in advance to reduce the computational complexity, and then iteratively select the most informative pairs according to the uncertainty, influence, connectivity and diversity of pairs. We evaluated our approach one two publicly available datasets AMINER and WIDE-NUS and then applied our approach to the two large-scale real-world datasets, Baidu Baike and Hudong Baike, to build a Chinese knowledge base. The experiments and practice illustrate the effectiveness of our approach.
Similar content being viewed by others
Notes
State of the LOD Cloud 2014, http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/.
at Nov, 2015.
References
Bizer C, Lehmann J, Kobilarov G, Auer S, Becker C, Cyganiak R, Hellmann S (2009) Dbpedia-a crystallization point for the web of data. Web Semant 7(3):154–165
Suchanek FM, Kasneci G, Weikum G (2007) Yago: a core of semantic knowledge. In: Proceedings of the 16th international conference on World Wide Web. ACM, pp 697–706
Bollacker K, Evans C, Paritosh P, Sturge T, Taylor J (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, pp 1247–1250
Carlson A, Betteridge J, Kisiel B, Settles B, Hruschka ER Jr, Mitchell TM (2010) Toward an architecture for never-ending language learning, vol 5. In: AAAI, p 3
Getoor L, Machanavajjhala A (2012) Entity resolution: tutorial. VLDB, Istanbul
Getoor L, Machanavajjhala A (2013) Entity resolution for big data. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 1527–1527
Suchanek F, Weikum G (2013) Knowledge harvesting in the big-data era. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data. ACM, pp 933–938
Stefanidis K, Efthymiou V, Melanie Herschel, Christophides V (2014) Entity resolution in the web of data. In: Proceedings of the companion publication of the 23rd international conference on World wide web companion. International World Wide Web Conferences Steering Committee, pp 203–204
Dorneles CF, Gonçalves R, dos Santos Mello R (2011) Approximate data instance matching: a survey. Knowl Inf Syst 27(1):1–21
Christen P (2012) Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer, Berlin
Rong S, Niu X, Xiang EW, Wang H, Yang Q, Yu Y (2012) A machine learning approach for instance matching based on similarity metrics. In: The semantic web–ISWC 2012. Springer, pp 460–475
Araujo S, Tran D, DeVries A, Hidders J, Schwabe D (2012) Serimi: class-based disambiguation for effective instance matching over heterogeneous web data. In: WebDB, pp 25–30
Sachan M, Hovy E, Xing EP (2015) An active learning approach to coreference resolution. In: 24th international joint conference on artificial intelligence (IJCAI)
Suchanek FM, Abiteboul S, Senellart P (2011) Paris: probabilistic alignment of relations, instances, and schema. In: Proceedings of the VLDB endowment vol 5(3). pp 157–168
Böhm C, de Melo G, Naumann F, Weikum G (2012) Linda: distributed web-of-data-scale entity matching. In: Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, pp 2104–2108
Lacoste-Julien S, Palla K, Davies A, Kasneci G, Graepel T, Ghahramani Z (2013) Sigma: simple greedy matching for aligning large knowledge bases. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 572–580
Zhang Y, Tang J, Yang Z, Pei J, Yu PS (2015) Cosnet: connecting heterogeneous social networks with local and global consistency. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1485–1494
Wang J, Kraska T (2012) Michael J Franklin, and Jianhua Feng. Crowder: Crowdsourcing entity resolution. In: Proceedings of the VLDB endowment, vol 5(11). pp 1483–1494
Vesdapunt N, Bellare K, Dalvi N (2014) Crowdsourcing algorithms for entity resolution. In: Proceedings of the VLDB endowment, vol 7(12)
Gokhale C, Das S, Doan AH, Naughton JF, Rampalli N, Shavlik J, Zhu X (2014) Corleone: hands-off crowdsourcing for entity matching. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, pp 601–612
Basu S, Banerjee A, Mooney RJ (2004) Active semi-supervision for pairwise constrained clustering. In: SDM, vol 4. SIAM, pp 333–344
Zhu X, Loy CC, Gong S (2013) Constrained clustering: effective constraint propagation with imperfect oracles. In: IEEE 13th international conference on data mining (ICDM), 2013. IEEE, pp 1307–1312
Zhu X, Loy CC, Gong S (2015) Constrained clustering with imperfect oracles. In: IEEE transactions on neural networks and learning systems, vol PP(99). pp 1–1
Biswas A, Jacobs D (2014) Active image clustering with pairwise constraints from humans. Int J Comput Vis 108(1–2):133–147
Wang X, Davidson I (2010) Active spectral clustering. In: IEEE 10th international conference on data mining (ICDM), 2010. IEEE, pp 561–568
Hassanzadeh O, Kementsietsidis A, Lim L, Miller RJ, Wang M (2009) A framework for semantic link discovery over relational data. In: Proceedings of the 18th ACM conference on information and knowledge management. ACM, pp 1027–1036
Zhao H, Ram S (2008) Entity matching across heterogeneous data sources: an approach based on constrained cascade generalization. Data Knowl Eng 66(3):368–381
Nguyen K, Ichise R, Le H-B (2012) Learning approach for domain-independent linked data instance matching. In: Proceedings of the ACM SIGKDD workshop on mining data semantics. ACM, p 7
Lu Z, Carreira-Perpinan M, et al (2008) Constrained spectral clustering through affinity propagation. In: IEEE conference on computer vision and pattern recognition, 2008. CVPR 2008. IEEE, pp 1–8
Wang X, Davidson I (2010) Flexible constrained spectral clustering. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 563–572
Li Z, Liu J, Tang X (2009) Constrained clustering via spectral regularization. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009. IEEE, pp 421–428
Elsner M, Schudy W (2009) Bounding and comparing methods for correlation clustering beyond ilp. In: Proceedings of the workshop on integer linear programming for natural langauge processing. Association for Computational Linguistics, pp 19–27
Wang J, Li G, Kraska T, Franklin MJ, Feng J (2013) Leveraging transitive relations for crowdsourced joins. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data. ACM, pp 229–240
Demartini G, Difallah DE, Cudré-Mauroux P (2013) Large-scale linked data integration using probabilistic reasoning and crowdsourcing. VLDB J 22(5):665–687
Niu X, Sun X, Wang H, Rong S, Qi G, Yu Y (2011) Zhishi. me-weaving chinese linking open data. In: The Semantic web—ISWC 2011. Springer, pp 205–220
Wang Z, Li J, Wang Z, Li S, Li M, Zhang D, Shi Y, Liu Y, Zhang P, Tang J (2013) Xlore: a large-scale english-chinese bilingual knowledge graph. In: International semantic web conference (Posters & Demos), vol 1035. pp 121–124
Zhang X-Y, Wang S, Yun X (2015) Bidirectional active learning: a two-way exploration into unlabeled and labeled data set. IEEE Trans Neural Netw Learn Syst 26(12):3034–3044
Zhang X-Y, Wang S, Zhu X, Yun X, Wu G, Wang Y (2015) Update vs. upgrade. Neurocomputing 162:163–170
Cheng J, Wang K (2007) Active learning for image retrieval with co-svm. Pattern Recognit 40(1):330–334
Zhang X (2014) Interactive patent classification based on multi-classifier fusion and active learning. Neurocomputing 127:200–205
Cai W, Zhang M, Zhang Y (2016) Batch mode active learning for regression with expected model change. In: IEEE transactions on neural networks and learning systems, vol PP(99). pp 1–14
Xiong C, Johnson D, Corso JJ (2014) Active clustering with model-based uncertainty reduction. arXiv preprint arXiv:1402.1783
Mai ST, He X, Hubig N, Plant C, Bohm C (2013) Active density-based clustering. In: IEEE 13th international conference on data mining (ICDM), 2013. IEEE, pp 508–517
Wauthier FL, Jojic N, Jordan MI (2012) Active spectral clustering via iterative uncertainty reduction. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 1339–1347
Xiong C, Johnson D, Corso JJ (2012) Spectral active clustering via purification of the k-nearest neighbor graph. In: Proceedings of European conference on data mining
Christen P (2012) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 24(9):1537–1555
Papadakis G, Svirsky J, Gal A, Palpanas T (2016) Comparative analysis of approximate blocking techniques for entity resolution. Proc VLDB Endow 9(9):684–695
Papadakis G, Palpanas T (May 2016) Blocking for large-scale entity resolution: challenges, algorithms, and practical examples. In: 2016 IEEE 32nd international conference on data engineering (ICDE). pp 1436–1439
Dalvi B, Mishra A, Cohen WW (2016) Hierarchical semi-supervised classification with incomplete class hierarchies. In: Proceedings of the ninth ACM international conference on web search and data mining. ACM, pp 193–202
Dalvi B, Minkov E, Talukdar PP, Cohen WW (2015) Automatic gloss finding for a knowledge base using ontological constraints. In: Proceedings of the eighth ACM international conference on web search and data mining. ACM, pp 369–378
Nollenburg M, Wolff A (2011) Drawing and labeling high-quality metro maps by mixed-integer programming. IEEE Trans Vis Comput Graph 17(5):626–641
Sandholm T, Gilpin A, Conitzer V (2005) Mixed-integer programming methods for finding nash equilibria. In: Proceedings of the national conference on artificial intelligence, vol 20. AAAI Press, Menlo Park, p 495
Chandra B, Halldórsson MM (2001) Approximation algorithms for dispersion problems. J Algorithms 38:438–465
Dasgupta A, Kumar R, Ravi S (2013) Summarization through submodularity and dispersion. In: ACL
Ghosh A, Kale S, McAfee P (2011) Who moderates the moderators: crowdsourcing abuse detection in user-generated content. In: Proceedings of the 12th ACM conference on Electronic commerce. ACM, pp 167–176
Dalvi N, Dasgupta A, Kumar R, Rastogi V (2013) Aggregating crowdsourced binary ratings. In: Proceedings of the 22nd international conference on World Wide Web. ACM, pp 285–294
Karger DR, Oh S, Shah D (2011) Iterative learning for reliable crowdsourcing systems. In: Advances in neural information processing systems. pp 1953–1961
Lenstra HW Jr (1983) Integer programming with a fixed number of variables. Math Oper Res 8(4):538–548
Wang X, Tang J, Cheng H, Yu PS (2011) Adana: active name disambiguation. In: IEEE 11th international conference on data mining (ICDM), 2011. IEEE, pp 794–803
McCallum A, Nigam K, Ungar LH (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 169–178
Papadakis G, Ioannou E, Niederée C, Palpanas T, Nejdl W (2012) Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data. In: Proceedings of the fifth ACM international conference on Web search and data mining. ACM, pp 53–62
Kenig B, Gal A (2013) Mfiblocks: an effective blocking algorithm for entity resolution. Inf Syst 38(6):908–926
Papadakis George, Ioannou Ekaterini, Palpanas Themis, Niederée Claudia, Nejdl Wolfgang (2013) A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans Knowl Data Eng 25(12):2665–2682
Breiman L (2001) Machine learning. Random For 45(1):5–32
Jiang S, Bing L, Zhang Y (2013) Towards an enhanced and adaptable ontology by distilling and assembling online encyclopedias. In: Proceedings of the 22nd ACM international conference on Conference on information and knowledge management. ACM, pp 1703–1708
Acknowledgements
This work is supported by the Zhejiang Provincial Natural Science Foundation of China (No. LY17F020015), the Chinese Knowledge Center of Engineering Science and Technology (CKCEST) and the Fundamental Research Funds for the Central Universities (No. 2017FZA5016).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lu, W., Dai, H., Zhang, Z. et al. Active instance matching with pairwise constraints and its application to Chinese knowledge base construction. Knowl Inf Syst 55, 171–214 (2018). https://doi.org/10.1007/s10115-017-1076-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-017-1076-7