Skip to main content

CTextEM: Using Consolidated Textual Data for Entity Matching

  • Conference paper
  • First Online:
Database Systems for Advanced Applications (DASFAA 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9642))

Included in the following conference series:

  • 3568 Accesses

Abstract

Entity Matching (EM) identifies records referring to the same entity within or across databases. Existing methods using structured attribute values (such as digital, date or short string values) only may fail when the structured information is not enough to reflect the matching relationships between records. Nowadays more and more databases may have some unstructured textual attribute containing extra Consolidated Textual information (CText for short) of the record, but seldom work has been done on using the CText information for EM. Conventional string similarity metrics such as edit distance or bag-of-words are unsuitable for measuring the similarities between CTexts since there are hundreds or thousands of words with each CText, while existing topic models either can not work well since there is no obvious gaps between the various sub-topics in CText. In this paper, we work on employing CText in EM. A baseline algorithm identifying important phrases with high IDF scores from CTexts and then measuring the similarity between CTexts based on these phrases does not work well since it estimates the similarity in one dimension and neglects that these phrases belong to different topics. To this end, we propose a novel cooccurrence-based topic model to identify various sub-topics from each CText, and then measure the similarity between CTexts on the multiple sub-topic dimensions. Our empirical study on two real-world data set shows that our method outperforms the state-of-the-art EM methods and Text Understanding models by reaching a higher EM precision and recall.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aizawa, A., Oyama, K.: A fast linkage detection scheme for multi-source information integration. In: Proceedings of International Workshop on Challenges in Web Information Retrieval and Integration, WIRI 2005, pp. 30–39 (2005)

    Google Scholar 

  2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  3. Borthwick, A., Goldberg, A., Cheung, P., Winkel, A.: Batch automated blocking and record matching (2011)

    Google Scholar 

  4. Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. Knowl. Data Eng. IEEE Trans. 24(9), 1537–1555 (2012)

    Article  Google Scholar 

  5. Das, Martins, D., A.F.T.: A survey on automatic text summarization. Int. J. Eng (2007)

    Google Scholar 

  6. Dhamankar, R., Lee, Y., Doan, A., Halevy, A., Domingos, P.: iMAP: discovering complex semantic matches between database schemas. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 383–394. ACM (2004)

    Google Scholar 

  7. Ektefa, M., Sidi, F., Ibrahim, H., Jabar, M.A., Memar, S., Ramli, A.: A threshold-based similarity measure for duplicate detection. In: IEEE Conference on Open Systems (ICOS), pp. 37–41. IEEE (2011)

    Google Scholar 

  8. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  9. Gao, C., Hong, X., Peng, Z., Chen, H.: Web trace duplication detection based on context. In: Gong, Z., Luo, X., Chen, J., Lei, J., Wang, F.L. (eds.) WISM 2011, Part II. LNCS, vol. 6988, pp. 292–301. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  10. Guo, S., Dong, X.L., Srivastava, D., Zajac, R.: Record linkage with uniqueness constraints and erroneous values. Proc. VLDB Endowment 3(1–2), 417–428 (2010)

    Article  Google Scholar 

  11. Hassell, J., Aleman-Meza, B., Arpinar, I.B.: Ontology-driven automatic entity disambiguation in unstructured text. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 44–57. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  12. Hofmann, T.: Probabilistic latent semantic analysis. Proc. Uncertainty Artif. Intell. Uai 25(4), 289–296 (1999)

    Google Scholar 

  13. Kim, D., Wang, H., Oh, A.: Context-dependent conceptualization. In: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, pp. 2654–2661. AAAI Press (2013)

    Google Scholar 

  14. Kim, S.-J., Lee, J.-H.: Method of mining subtopics using dependency structure and anchor texts. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 277–283. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  15. Koudas, N., Sarawagi, S., Srivastava, D.: Record linkage: similarity measures and algorithms. In: Sigmod Conference, pp. 802–803 (2006)

    Google Scholar 

  16. Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25(2), 259–284 (1998). Special issue

    Article  Google Scholar 

  17. Maowen, W., Dong, Z.C., Weiyao, L., Qiang, W.Q.: Text topic mining based on LDA and co-occurrence theory. In: 7th International Conference on Computer Science & Education (ICCSE), pp. 525–528. IEEE (2012)

    Google Scholar 

  18. Parkhomenko, E., Tritchler, D., Beyene, J.: Sparse canonical correlation analysis with application to genomic data integration. Stat. Appl. Genet. Mol. Biol. 8(1), 1–34 (2009)

    MathSciNet  MATH  Google Scholar 

  19. Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Item-based collaborative filtering recommendation algorithms. In: Proceedings of the 10th International Conference on World Wide Web, pp. 285–295. ACM (2001)

    Google Scholar 

  20. Sun, L., Franklin, M.J., Krishnan, S., Xin, R.S.: Fine-grained partitioning for aggressive data skipping. In: Proceedings of the ACM SIGMOD International Conference on Management of data, pp. 1115–1126. ACM (2014)

    Google Scholar 

  21. Garza Villarreal, S.E., Brena, R.F.: Topic mining based on graph local clustering. In: Batyrshin, I., Sidorov, G. (eds.) MICAI 2011, Part II. LNCS, vol. 7095, pp. 201–212. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  22. Weiss, S.M., Indurkhya, N., Zhang, T., Damerau, F.: Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer Science & Business Media, New York (2010)

    MATH  Google Scholar 

  23. Yang, Q., Li, Z., Jiang, J., Zhao, P., Liu, G., Liu, A., Zhu, J.: NokeaRM: employing non-key attributes in record matching. In: Dong, X.L., Yu, X., Li, J., Sun, Y. (eds.) Web-Age Information Management. LNCS, vol. 9098, pp. 438–442. Springer, Heidelberg (2015)

    Chapter  Google Scholar 

  24. Zhang, X., LeCun, Y.: Text understanding from scratch. arXiv preprint arXiv:1502.01710 (2015)

Download references

Acknowledgements

This research is partially supported by Natural Science Foundation of China (Grant No. 61303019, 61402313, 61472263, 61572336), Postdoctoral scientific research funding of Jiangsu Province (No. 1501090B) National 58 batch of postdoctoral funding (No. 2015M581859) and Collaborative Innovation Center of Novel Software Technology and Industrialization, Jiangsu, China.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhixu Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Yang, Q. et al. (2016). CTextEM: Using Consolidated Textual Data for Entity Matching. In: Navathe, S., Wu, W., Shekhar, S., Du, X., Wang, X., Xiong, H. (eds) Database Systems for Advanced Applications. DASFAA 2016. Lecture Notes in Computer Science(), vol 9642. Springer, Cham. https://doi.org/10.1007/978-3-319-32025-0_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-32025-0_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-32024-3

  • Online ISBN: 978-3-319-32025-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics