Skip to main content
Log in

Active instance matching with pairwise constraints and its application to Chinese knowledge base construction

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Instance matching is the problem of determining whether two instances describe the same real-world entity or not. Instance matching plays a key role in data integration and data cleansing, especially for building a knowledge base. For example, we can regard each article in encyclopedias as an instance, and a group of articles which refers to the same real-world object as an entity. Therefore, articles about Washington should be distinguished and grouped into different entities such as Washington, D.C (the capital of the USA), George Washington (first president of the USA), Washington (a state of the USA), Washington (a village in West Sussex, England), Washington F.C. (a football club based in Washington, Tyne and Wear, England), Washington, D.C. (a novel). In this paper, we proposed a novel instance matching approach Active Instance Matching with Pairwise Constraints, which can bring the human into the loop of instance matching. The proposed approach can generate candidate pairs in advance to reduce the computational complexity, and then iteratively select the most informative pairs according to the uncertainty, influence, connectivity and diversity of pairs. We evaluated our approach one two publicly available datasets AMINER and WIDE-NUS and then applied our approach to the two large-scale real-world datasets, Baidu Baike and Hudong Baike, to build a Chinese knowledge base. The experiments and practice illustrate the effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Similar content being viewed by others

Notes

  1. http://dblp.uni-trier.de.

  2. http://citeseer.ist.psu.edu.

  3. http://linkeddata.org/.

  4. State of the LOD Cloud 2014, http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/.

  5. http://www.baike.baidu.com.

  6. http://www.baike.com.

  7. at Nov, 2015.

  8. https://sourceforge.net/projects/jclal/.

  9. http://www-01.ibm.com/software/commerce/optimization/cplexoptimizer/.

  10. http://www-01.ibm.com/software/commerce/optimization/cplexoptimizer/.

  11. https://www.gnu.org/software/glpk/.

  12. http://www.gurobi.com/.

  13. http://scip.zib.de/.

  14. https://aminer.org/billboard/disambiguation.

  15. http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm.

  16. http://www.ltp-cloud.com/intro/en/.

  17. http://www.cs.waikato.ac.nz/ml/weka/.

References

  1. Bizer C, Lehmann J, Kobilarov G, Auer S, Becker C, Cyganiak R, Hellmann S (2009) Dbpedia-a crystallization point for the web of data. Web Semant 7(3):154–165

    Article  Google Scholar 

  2. Suchanek FM, Kasneci G, Weikum G (2007) Yago: a core of semantic knowledge. In: Proceedings of the 16th international conference on World Wide Web. ACM, pp 697–706

  3. Bollacker K, Evans C, Paritosh P, Sturge T, Taylor J (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, pp 1247–1250

  4. Carlson A, Betteridge J, Kisiel B, Settles B, Hruschka ER Jr, Mitchell TM (2010) Toward an architecture for never-ending language learning, vol  5. In: AAAI, p 3

  5. Getoor L, Machanavajjhala A (2012) Entity resolution: tutorial. VLDB, Istanbul

    Google Scholar 

  6. Getoor L, Machanavajjhala A (2013) Entity resolution for big data. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 1527–1527

  7. Suchanek F, Weikum G (2013) Knowledge harvesting in the big-data era. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data. ACM, pp 933–938

  8. Stefanidis K, Efthymiou V, Melanie Herschel, Christophides V (2014) Entity resolution in the web of data. In: Proceedings of the companion publication of the 23rd international conference on World wide web companion. International World Wide Web Conferences Steering Committee, pp 203–204

  9. Dorneles CF, Gonçalves R, dos Santos Mello R (2011) Approximate data instance matching: a survey. Knowl Inf Syst 27(1):1–21

    Article  Google Scholar 

  10. Christen P (2012) Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer, Berlin

    Book  Google Scholar 

  11. Rong S, Niu X, Xiang EW, Wang H, Yang Q, Yu Y (2012) A machine learning approach for instance matching based on similarity metrics. In: The semantic web–ISWC 2012. Springer, pp 460–475

  12. Araujo S, Tran D, DeVries A, Hidders J, Schwabe D (2012) Serimi: class-based disambiguation for effective instance matching over heterogeneous web data. In: WebDB, pp 25–30

  13. Sachan M, Hovy E, Xing EP (2015) An active learning approach to coreference resolution. In: 24th international joint conference on artificial intelligence (IJCAI)

  14. Suchanek FM, Abiteboul S, Senellart P (2011) Paris: probabilistic alignment of relations, instances, and schema. In: Proceedings of the VLDB endowment vol 5(3). pp 157–168

  15. Böhm C, de Melo G, Naumann F, Weikum G (2012) Linda: distributed web-of-data-scale entity matching. In: Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, pp 2104–2108

  16. Lacoste-Julien S, Palla K, Davies A, Kasneci G, Graepel T, Ghahramani Z (2013) Sigma: simple greedy matching for aligning large knowledge bases. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 572–580

  17. Zhang Y, Tang J, Yang Z, Pei J, Yu PS (2015) Cosnet: connecting heterogeneous social networks with local and global consistency. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1485–1494

  18. Wang J, Kraska T (2012) Michael J Franklin, and Jianhua Feng. Crowder: Crowdsourcing entity resolution. In: Proceedings of the VLDB endowment, vol 5(11). pp 1483–1494

  19. Vesdapunt N, Bellare K, Dalvi N (2014) Crowdsourcing algorithms for entity resolution. In: Proceedings of the VLDB endowment, vol 7(12)

  20. Gokhale C, Das S, Doan AH, Naughton JF, Rampalli N, Shavlik J, Zhu X (2014) Corleone: hands-off crowdsourcing for entity matching. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, pp 601–612

  21. Basu S, Banerjee A, Mooney RJ (2004) Active semi-supervision for pairwise constrained clustering. In: SDM, vol 4. SIAM, pp 333–344

  22. Zhu X, Loy CC, Gong S (2013) Constrained clustering: effective constraint propagation with imperfect oracles. In: IEEE 13th international conference on data mining (ICDM), 2013. IEEE, pp 1307–1312

  23. Zhu X, Loy CC, Gong S (2015) Constrained clustering with imperfect oracles. In: IEEE transactions on neural networks and learning systems, vol PP(99). pp 1–1

  24. Biswas A, Jacobs D (2014) Active image clustering with pairwise constraints from humans. Int J Comput Vis 108(1–2):133–147

    Article  MathSciNet  MATH  Google Scholar 

  25. Wang X, Davidson I (2010) Active spectral clustering. In: IEEE 10th international conference on data mining (ICDM), 2010. IEEE, pp 561–568

  26. Hassanzadeh O, Kementsietsidis A, Lim L, Miller RJ, Wang M (2009) A framework for semantic link discovery over relational data. In: Proceedings of the 18th ACM conference on information and knowledge management. ACM, pp 1027–1036

  27. Zhao H, Ram S (2008) Entity matching across heterogeneous data sources: an approach based on constrained cascade generalization. Data Knowl Eng 66(3):368–381

    Article  Google Scholar 

  28. Nguyen K, Ichise R, Le H-B (2012) Learning approach for domain-independent linked data instance matching. In: Proceedings of the ACM SIGKDD workshop on mining data semantics. ACM, p 7

  29. Lu Z, Carreira-Perpinan M, et al (2008) Constrained spectral clustering through affinity propagation. In: IEEE conference on computer vision and pattern recognition, 2008. CVPR 2008. IEEE, pp 1–8

  30. Wang X, Davidson I (2010) Flexible constrained spectral clustering. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 563–572

  31. Li Z, Liu J, Tang X (2009) Constrained clustering via spectral regularization. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009. IEEE, pp 421–428

  32. Elsner M, Schudy W (2009) Bounding and comparing methods for correlation clustering beyond ilp. In: Proceedings of the workshop on integer linear programming for natural langauge processing. Association for Computational Linguistics, pp 19–27

  33. Wang J, Li G, Kraska T, Franklin MJ, Feng J (2013) Leveraging transitive relations for crowdsourced joins. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data. ACM, pp 229–240

  34. Demartini G, Difallah DE, Cudré-Mauroux P (2013) Large-scale linked data integration using probabilistic reasoning and crowdsourcing. VLDB J 22(5):665–687

    Article  Google Scholar 

  35. Niu X, Sun X, Wang H, Rong S, Qi G, Yu Y (2011) Zhishi. me-weaving chinese linking open data. In: The Semantic web—ISWC 2011. Springer, pp 205–220

  36. Wang Z, Li J, Wang Z, Li S, Li M, Zhang D, Shi Y, Liu Y, Zhang P, Tang J (2013) Xlore: a large-scale english-chinese bilingual knowledge graph. In: International semantic web conference (Posters & Demos), vol 1035. pp 121–124

  37. Zhang X-Y, Wang S, Yun X (2015) Bidirectional active learning: a two-way exploration into unlabeled and labeled data set. IEEE Trans Neural Netw Learn Syst 26(12):3034–3044

    Article  MathSciNet  Google Scholar 

  38. Zhang X-Y, Wang S, Zhu X, Yun X, Wu G, Wang Y (2015) Update vs. upgrade. Neurocomputing 162:163–170

    Article  Google Scholar 

  39. Cheng J, Wang K (2007) Active learning for image retrieval with co-svm. Pattern Recognit 40(1):330–334

    Article  MATH  Google Scholar 

  40. Zhang X (2014) Interactive patent classification based on multi-classifier fusion and active learning. Neurocomputing 127:200–205

    Article  Google Scholar 

  41. Cai W, Zhang M, Zhang Y (2016) Batch mode active learning for regression with expected model change. In: IEEE transactions on neural networks and learning systems, vol PP(99). pp 1–14

  42. Xiong C, Johnson D, Corso JJ (2014) Active clustering with model-based uncertainty reduction. arXiv preprint arXiv:1402.1783

  43. Mai ST, He X, Hubig N, Plant C, Bohm C (2013) Active density-based clustering. In: IEEE 13th international conference on data mining (ICDM), 2013. IEEE, pp 508–517

  44. Wauthier FL, Jojic N, Jordan MI (2012) Active spectral clustering via iterative uncertainty reduction. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 1339–1347

  45. Xiong C, Johnson D, Corso JJ (2012) Spectral active clustering via purification of the k-nearest neighbor graph. In: Proceedings of European conference on data mining

  46. Christen P (2012) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 24(9):1537–1555

    Article  Google Scholar 

  47. Papadakis G, Svirsky J, Gal A, Palpanas T (2016) Comparative analysis of approximate blocking techniques for entity resolution. Proc VLDB Endow 9(9):684–695

    Article  Google Scholar 

  48. Papadakis G, Palpanas T (May 2016) Blocking for large-scale entity resolution: challenges, algorithms, and practical examples. In: 2016 IEEE 32nd international conference on data engineering (ICDE). pp 1436–1439

  49. Dalvi B, Mishra A, Cohen WW (2016) Hierarchical semi-supervised classification with incomplete class hierarchies. In: Proceedings of the ninth ACM international conference on web search and data mining. ACM, pp 193–202

  50. Dalvi B, Minkov E, Talukdar PP, Cohen WW (2015) Automatic gloss finding for a knowledge base using ontological constraints. In: Proceedings of the eighth ACM international conference on web search and data mining. ACM, pp 369–378

  51. Nollenburg M, Wolff A (2011) Drawing and labeling high-quality metro maps by mixed-integer programming. IEEE Trans Vis Comput Graph 17(5):626–641

    Article  Google Scholar 

  52. Sandholm T, Gilpin A, Conitzer V (2005) Mixed-integer programming methods for finding nash equilibria. In: Proceedings of the national conference on artificial intelligence, vol 20. AAAI Press, Menlo Park, p 495

  53. Chandra B, Halldórsson MM (2001) Approximation algorithms for dispersion problems. J Algorithms 38:438–465

    Article  MathSciNet  MATH  Google Scholar 

  54. Dasgupta A, Kumar R, Ravi S (2013) Summarization through submodularity and dispersion. In: ACL

  55. Ghosh A, Kale S, McAfee P (2011) Who moderates the moderators: crowdsourcing abuse detection in user-generated content. In: Proceedings of the 12th ACM conference on Electronic commerce. ACM, pp 167–176

  56. Dalvi N, Dasgupta A, Kumar R, Rastogi V (2013) Aggregating crowdsourced binary ratings. In: Proceedings of the 22nd international conference on World Wide Web. ACM, pp 285–294

  57. Karger DR, Oh S, Shah D (2011) Iterative learning for reliable crowdsourcing systems. In: Advances in neural information processing systems. pp 1953–1961

  58. Lenstra HW Jr (1983) Integer programming with a fixed number of variables. Math Oper Res 8(4):538–548

    Article  MathSciNet  MATH  Google Scholar 

  59. Wang X, Tang J, Cheng H, Yu PS (2011) Adana: active name disambiguation. In: IEEE 11th international conference on data mining (ICDM), 2011. IEEE, pp 794–803

  60. McCallum A, Nigam K, Ungar LH (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 169–178

  61. Papadakis G, Ioannou E, Niederée C, Palpanas T, Nejdl W (2012) Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data. In: Proceedings of the fifth ACM international conference on Web search and data mining. ACM, pp 53–62

  62. Kenig B, Gal A (2013) Mfiblocks: an effective blocking algorithm for entity resolution. Inf Syst 38(6):908–926

    Article  Google Scholar 

  63. Papadakis George, Ioannou Ekaterini, Palpanas Themis, Niederée Claudia, Nejdl Wolfgang (2013) A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans Knowl Data Eng 25(12):2665–2682

    Article  Google Scholar 

  64. Breiman L (2001) Machine learning. Random For 45(1):5–32

    Google Scholar 

  65. Jiang S, Bing L, Zhang Y (2013) Towards an enhanced and adaptable ontology by distilling and assembling online encyclopedias. In: Proceedings of the 22nd ACM international conference on Conference on information and knowledge management. ACM, pp 1703–1708

Download references

Acknowledgements

This work is supported by the Zhejiang Provincial Natural Science Foundation of China (No. LY17F020015), the Chinese Knowledge Center of Engineering Science and Technology (CKCEST) and the Fundamental Research Funds for the Central Universities (No. 2017FZA5016).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weiming Lu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lu, W., Dai, H., Zhang, Z. et al. Active instance matching with pairwise constraints and its application to Chinese knowledge base construction. Knowl Inf Syst 55, 171–214 (2018). https://doi.org/10.1007/s10115-017-1076-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-017-1076-7

Keywords

Navigation