Skip to main content
Log in

Crowd-Guided Entity Matching with Consolidated Textual Data

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Entity matching (EM) identifies records referring to the same entity within or across databases. Existing methods using structured attribute values (such as digital, date or short string values) may fail when the structured information is not enough to reflect the matching relationships between records. Nowadays more and more databases may have some unstructured textual attribute containing extra consolidated textual information (CText) of the record, but seldom work has been done on using the CText for EM. Conventional string similarity metrics such as edit distance or bag-of-words are unsuitable for measuring the similarities between CText since there are hundreds or thousands of words with each piece of CText, while existing topic models either cannot work well since there are no obvious gaps between topics in CText. In this paper, we propose a novel cooccurrence-based topic model to identify various sub-topics from each piece of CText, and then measure the similarity between CText on the multiple sub-topic dimensions. To avoid ignoring some hidden important sub-topics, we let the crowd help us decide weights of different sub-topics in doing EM. Our empirical study on two real-world datasets based on Amzon Mechanical Turk Crowdsourcing Platform shows that our method outperforms the state-of-the-art EM methods and Text Understanding models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Koudas N, Sarawagi S, Srivastava D. Record linkage: Similarity measures and algorithms. In Proc. the ACM SIGMOD Int. Conf. Management of Data, June 2006, pp.802-803.

  2. Parkhomenko E, Tritchler D, Beyene J. Sparse canonical correlation analysis with application to genomic data integration. Statistical Applications in Genetics and Molecular Biology, 2009, 8(1): Article No. 1.

  3. Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: A survey. IEEE Trans. Knowledge and Data Engineering, 2007, 19(1): 1–16.

    Article  Google Scholar 

  4. Ektefa M, Jabar M A, Sidi F, Memar S, Ibrahim H, Ramli A. A threshold-based similaritymeasure for duplicate detection. In Proc. IEEE Conf. Open Systems, September 2011, pp.37-41.

  5. Gao C, Hong X G, Peng Z H, Chen H D. Web trace duplication detection based on context. In Proc. the Int. Conf. Web Information Systems and Mining, September 2011, pp.292-301.

    Google Scholar 

  6. Das D, Martins A F T. A Survey on Automatic Text Summarization. The MIT Press, 2007.

  7. Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation. The Journal of Machine Learning Research, 2003, 3: 993–1022.

    MATH  Google Scholar 

  8. Landauer T K, Foltz PW, Laham D. An introduction to latent semantic analysis. Discourse Processes, 1998, 25(2/3): 259–284.

    Article  Google Scholar 

  9. Hofmann T. Probabilistic latent semantic analysis. In Proc. the 15th Conf. Uncertainty in Artificial Intelligence, August 1999, pp.289-296.

  10. Kim D, Wang H X, Oh A. Context-dependent conceptualization. In Proc. the 23rd Int. Joint Conf. Artificial Intelligence, August 2013, pp.2654-2661.

  11. Guo S T, Dong X L, Srivastava D, Zajac R. Record linkage with uniqueness constraints and erroneous values. Proc. the VLDB Endowment, 2010, 3(1/2): 417–428.

    Article  Google Scholar 

  12. Sun L W, Franklin M J, Krishnan S, Xin R S. Finegrained partitioning for aggressive data skipping. In Proc. the ACM SIGMOD Int. Conf. Management of Data, June 2014, pp.1115-1126.

  13. Sarwar B, Karypis G, Konstan J, Riedl J. Item-based collaborative filtering recommendation algorithms. In Proc. the 10th Int. Conf. World Wide Web, May 2001, pp.285-295.

  14. Feng J H, Li G L, Wang H N, Feng J H. Incremental quality inference in crowdsourcing. In Proc. the 19th Int. Conf. Database Systems for Advanced Applications, April 2014, pp.453-467.

  15. Aizawa A, Oyama K. A fast linkage detection scheme for multi-source information integration. In Proc. the Int. Workshop on Challenges on Web Information Retrieval and Integration, April 2005, pp.30-39.

  16. Christen P. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowledge and Data Engineering, 2012, 24(9): 1537–1555.

    Article  Google Scholar 

  17. Borthwick A, Goldberg A, Cheung P, Winkel A. Batch Automated Blocking and Record Matching. The US Press, 2011.

  18. Yang Q, Li Z X, Jiang J, Zhao P P, Liu G F, Liu A, Zhu J. NokeaRM: Employing non-key attributes in record matching. In Proc. the 16th Int. Conf. Web-Age Information Management, June 2015, pp.438-442.

  19. Villarreal S E G, Brena R F. Topic mining based on graph local clustering. In Proc. the 10th Int. Conf. Artificial Intelligence: Advances in Soft Computing, November 2011, pp.201-212.

  20. Dhamankar R, Lee Y, Doan A H, Halevy A, Domingos P. iMAP: Discovering complex semantic matches between database schemas. In Proc. the ACM SIGMOD Int. Conf. Management of Data, June 2004, pp.383-394.

  21. Weiss S M, Indurkhya N, Zhang T, Damerau F. Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer, 2005.

  22. Hassell J, Aleman-Meza B, Arpinar I B. Ontology-driven automatic entity disambiguation in unstructured text. In Proc. the 5th Int. Conf. the Semantic Web, November 2006, pp.44-57.

  23. Zhang X, LeCun Y. Text understanding from scratch. arXiv:1502.01710, 2016. https://arxiv.org/abs/1502.01710, August 2017.

  24. Kim S J, Lee J H. Method of mining subtopics using dependency structure and anchor texts. In Proc. the 19th Int. Conf. String Processing and Information Retrieval, October 2012, pp.277-283.

  25. Wu M W, Zhang C D, Lan W Y, Wu Q Q. Text topic mining based on LDA and co-occurrence theory. In Proc. the 7th Int. Conf. Computer Science & Education, July 2012, pp.525-528.

  26. Li GL, Wang J N, Zheng Y D, Franklin M J. Crowdsourced data management: A survey. IEEE Trans. Knowledge and Data Engineering, 2016, 28(9): 2296–2319.

    Article  Google Scholar 

  27. Doan A H, Ramakrishnan R, Halevy A Y. Crowdsourcing systems on the world-wide web. Communications of the ACM, 2011, 54(4): 86–96.

    Article  Google Scholar 

  28. Gu B B, Li Z X, Yang Q, Xie Q, Liu A, Liu G F, Zheng K, Zhang X L. Web-ADARE: A web-aided data repairing system. Neurocomputing, 2017, 253: 201–214.

    Article  Google Scholar 

  29. Li G L, Chai C L, Fan J, Weng X P, Li J, Zheng Y D, Li Y B, Yu X, Zhang X H, Yuan H T. CDB: Optimizing queries with crowd-based selections and joins. In Proc. the ACM Int. Conf. Management of Data, May 2017, pp.1463-1478.

  30. Jiang L L, Wang Y F, Hoffart J, Weikum G. Crowdsourced entity markup. In Proc. the 1st Int. Conf. Crowdsourcing the Semantic Web, October 2013, pp.59-68.

  31. Wang J N, Kraska T, Franklin M J, Feng J H. Crowder: Crowdsourcing entity resolution. Proc. the VLDB Endowment, 2012, 5(11): 1483–1494.

    Article  Google Scholar 

  32. Gu B B, Li Z X, Zhang X L, Liu A, Liu G F, Zheng K, Zhao L, Zhou X F. The interaction between schema matching and record matching in data integration. IEEE Trans. Knowledge and Data Engineering, 2017, 29(1): 186–199.

    Article  Google Scholar 

  33. Demartini G, Difallah D E, Cudré-Mauroux P. ZenCrowd: Leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In Proc. the 21st Int. Conf. World Wide Web, April 2012, pp.469-478.

  34. Gokhale C, Das S, Doan A H, Naughton J F, Rampalli N, Shavlik J, Zhu X J. Corleone: Hands-off crowdsourcing for entity matching. In Proc. the ACM SIGMOD Int. Conf. Management of Data, June 2014, pp.601-612.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to An Liu.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 57 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, ZX., Yang, Q., Liu, A. et al. Crowd-Guided Entity Matching with Consolidated Textual Data. J. Comput. Sci. Technol. 32, 858–876 (2017). https://doi.org/10.1007/s11390-017-1769-0

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-017-1769-0

Keywords

Navigation