Skip to main content
Log in

Active transfer learning of matching query results across multiple sources

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Entity resolution (ER) is the problem of identifying and grouping different manifestations of the same real world object. Algorithmic approaches have been developed where most tasks offer superior performance under supervised learning. However, the prohibitive cost of labeling training data is still a huge obstacle for detecting duplicate query records from online sources. Furthermore, the unique combinations of noisy data with missing elements make ER tasks more challenging. To address this, transfer learning has been adopted to adaptively share learned common structures of similarity scoring problems between multiple sources. Although such techniques reduce the labeling cost so that it is linear with respect to the number of sources, its random sampling strategy is not successful enough to handle the ordinary sample imbalance problem. In this paper, we present a novel multi-source active transfer learning framework to jointly select fewer data instances from all sources to train classifiers with constant precision/recall. The intuition behind our approach is to actively label the most informative samples while adaptively transferring collective knowledge between sources. In this way, the classifiers that are learned can be both label-economical and flexible even for imbalanced or quality diverse sources. We compare our method with the state-of-the-art approaches on real-word datasets. Our experimental results demonstrate that our active transfer learning algorithm can achieve impressive performance with far fewer labeled samples for record matching with numerous and varied sources.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Getoor L, Machanavajjhala A. Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment, 2012, 5(12): 2018–2019

    Article  Google Scholar 

  2. Negahban N, Rubinstein P, Gemmell G. Scaling multiple-source entity resolution using statistically efficient transfer learning. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management. 2012, 2224–2228

    Google Scholar 

  3. Arasu A, Götz M, Kaushik R. On active learning of record matching packages. In: Proceedings of the 2010 International Conference on Management of Data. 2010, 783–794

    Google Scholar 

  4. Bellare K, Iyengar S, Parameswaran A, Rastogi V. Active sampling for entity matching. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012, 1131–1139

    Chapter  Google Scholar 

  5. Chuang S L, Chang K C C. Integrating web query results: holistic schema matching. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. 2008, 33–42

    Google Scholar 

  6. Köpcke H, Rahm E. Frameworks for entity matching: a comparison. Data & Knowledge Engineering, 2010, 69(2): 197–210

    Article  Google Scholar 

  7. Winkler W E. The state of record linkage and current research problems. In: Proceedings of Statistical Research Division, US Census Bureau. 1999

    Google Scholar 

  8. Chaudhuri S, Chen B C, Ganti V, Kaushik R. Example-driven design of efficient record matching queries. In: Proceedings of the 33rd International Conference on Very Large Data Bases. 2007, 327–338

    Google Scholar 

  9. Bilenko M, Mooney R J. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2003, 39–48

    Google Scholar 

  10. Su W, Wang J, Lochovsky F H. Record matching over query results from multiple web databases. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(4): 578–589

    Article  Google Scholar 

  11. Köpcke H, Rahm E. Training selection for tuning entity matching. In: Proceedings of QDB/MUD. 2008, 3–12

    Google Scholar 

  12. Altwaijry H, Kalashnikov D V, Mehrotra S. Query-driven approach to entity resolution. Proceedings of the VLDB Endowment, 2013, 6(14): 1846–1857

    Article  Google Scholar 

  13. Singla P, Domingos P. Entity resolution with Markov logic. In: Proceedings of International Conference on Data Mining. 2006, 572–582

    Google Scholar 

  14. Liu W, Xiao J G. A duplicate web entity identification approach based on iterative training. Frontiers of Computer Science and Technology, 2010, (007): 599–607

    Google Scholar 

  15. Wang J, Kraska T, Franklin M J, Feng J. Crowder: crowdsourcing entity resolution. Proceedings of the VLDB Endowment, 2012, 5(11): 1483–1494

    Article  Google Scholar 

  16. Pan S J, Yang Q. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10): 1345–1359

    Article  Google Scholar 

  17. Yang L, Hanneke S, Carbonell J. A theory of transfer learning with applications to active learning. Machine Learning, 2013, 90(2): 161–189

    Article  MathSciNet  MATH  Google Scholar 

  18. Shi X, Fan W, Ren J. Actively transfer domain knowledge. In: Proceedings of ECML/PKDD. 2008, 342–357

    Google Scholar 

  19. Zhao L, Pan S J, Xiang E W, Zhong E, Lu Z, Yang Q. Active transfer learning for cross-system recommendation. In: Proceedings of the 27th AAAI Conference on Artificial Intelogence. 2013, 1205–1211

    Google Scholar 

  20. Fang M, Yin J, Zhu X. Knowledge transfer for multi-labeler active learning. Lecture Notes in Computer Science, 2013, 8188: 273–288

    Article  Google Scholar 

  21. Jun G, Ghosh J. An efficient active learning algorithm with knowledge transfer for hyperspectral data analysis. In: Proceedings of Geoscience and Remote Sensing Symposium. 2008, 1: I-52–55

    Google Scholar 

  22. Li L, Jin X, Pan S J, Sun J T. Multi-domain active learning for text classification. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012, 1086–1094

    Chapter  Google Scholar 

  23. Christen P. Automatic record linkage using seeded nearest neighbor and support vector machine classification. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2008, 151–159

    Chapter  Google Scholar 

  24. Benjelloun O, Garcia-Molina H, Menestrina D, Su Q, Whang S E, Widom J. Swoosh: a generic approach to entity resolution. The International Journal on Very Large Data Bases, 2009, 18(1): 255–276

    Article  Google Scholar 

  25. Boyd S P, Vandenberghe L. Convex Optimization. Cambridge University Press, 2004

    Book  Google Scholar 

  26. Jalali A, Ravikumar P D, Sanghavi S, Ruan C. A dirty model for multitask learning. In: Proceedings of Advances in Neural Information Processing Systems. 2010, 964–972

    Google Scholar 

  27. Bickel P J, Ritov Y A, Tsybakov A B. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 2009, 37(4): 1705–1732

    Article  MathSciNet  MATH  Google Scholar 

  28. Tong S. Active Learning: Theory and Applications. Stanford University, 2001

    Google Scholar 

  29. Davis J, Goadrich M. The relationship between precision-recall and ROC curves. In: Proceedings of the 23rd International Conference on Machine Learning. 2006, 233–240

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jie Xin.

Additional information

Jie Xin received her MS in computer science from Queen Mary University of London, England in 2006. She is currently a PHD candidate at Soochow University. Her research interests include information retrieval and integration, and data mining.

Zhiming Cui is a professor and doctoral supervisor at the Institute of Computer Science and Technology, Soochow University, China and a CCF senior member. His researches areas are intelligent information processing, image processing, distributing computing, deep Web data mining, etc.

Pengpeng Zhao is an associate professor in the Department of Computer Science and Technology at Soochow University, China. He received his PhD in computer science from Soochow University in 2008. His main research interests are in the study of the management, retrieval, and mining of information on the World-Wide Web.

Tianxu He received his MS in computer science from Soochow University, China in 2006. He is a PhD candidate at Soochow University. His research interests include data mining and knowledge discovery in databases.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xin, J., Cui, Z., Zhao, P. et al. Active transfer learning of matching query results across multiple sources. Front. Comput. Sci. 9, 595–607 (2015). https://doi.org/10.1007/s11704-015-4068-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-015-4068-3

Keywords

Navigation