Abstract
Entity Matching (EM) identifies records referring to the same entity within or across databases. Existing methods using structured attribute values (such as digital, date or short string values) only may fail when the structured information is not enough to reflect the matching relationships between records. Nowadays more and more databases may have some unstructured textual attribute containing extra Consolidated Textual information (CText for short) of the record, but seldom work has been done on using the CText information for EM. Conventional string similarity metrics such as edit distance or bag-of-words are unsuitable for measuring the similarities between CTexts since there are hundreds or thousands of words with each CText, while existing topic models either can not work well since there is no obvious gaps between the various sub-topics in CText. In this paper, we work on employing CText in EM. A baseline algorithm identifying important phrases with high IDF scores from CTexts and then measuring the similarity between CTexts based on these phrases does not work well since it estimates the similarity in one dimension and neglects that these phrases belong to different topics. To this end, we propose a novel cooccurrence-based topic model to identify various sub-topics from each CText, and then measure the similarity between CTexts on the multiple sub-topic dimensions. Our empirical study on two real-world data set shows that our method outperforms the state-of-the-art EM methods and Text Understanding models by reaching a higher EM precision and recall.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aizawa, A., Oyama, K.: A fast linkage detection scheme for multi-source information integration. In: Proceedings of International Workshop on Challenges in Web Information Retrieval and Integration, WIRI 2005, pp. 30–39 (2005)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Borthwick, A., Goldberg, A., Cheung, P., Winkel, A.: Batch automated blocking and record matching (2011)
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. Knowl. Data Eng. IEEE Trans. 24(9), 1537–1555 (2012)
Das, Martins, D., A.F.T.: A survey on automatic text summarization. Int. J. Eng (2007)
Dhamankar, R., Lee, Y., Doan, A., Halevy, A., Domingos, P.: iMAP: discovering complex semantic matches between database schemas. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 383–394. ACM (2004)
Ektefa, M., Sidi, F., Ibrahim, H., Jabar, M.A., Memar, S., Ramli, A.: A threshold-based similarity measure for duplicate detection. In: IEEE Conference on Open Systems (ICOS), pp. 37–41. IEEE (2011)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Gao, C., Hong, X., Peng, Z., Chen, H.: Web trace duplication detection based on context. In: Gong, Z., Luo, X., Chen, J., Lei, J., Wang, F.L. (eds.) WISM 2011, Part II. LNCS, vol. 6988, pp. 292–301. Springer, Heidelberg (2011)
Guo, S., Dong, X.L., Srivastava, D., Zajac, R.: Record linkage with uniqueness constraints and erroneous values. Proc. VLDB Endowment 3(1–2), 417–428 (2010)
Hassell, J., Aleman-Meza, B., Arpinar, I.B.: Ontology-driven automatic entity disambiguation in unstructured text. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 44–57. Springer, Heidelberg (2006)
Hofmann, T.: Probabilistic latent semantic analysis. Proc. Uncertainty Artif. Intell. Uai 25(4), 289–296 (1999)
Kim, D., Wang, H., Oh, A.: Context-dependent conceptualization. In: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, pp. 2654–2661. AAAI Press (2013)
Kim, S.-J., Lee, J.-H.: Method of mining subtopics using dependency structure and anchor texts. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 277–283. Springer, Heidelberg (2012)
Koudas, N., Sarawagi, S., Srivastava, D.: Record linkage: similarity measures and algorithms. In: Sigmod Conference, pp. 802–803 (2006)
Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25(2), 259–284 (1998). Special issue
Maowen, W., Dong, Z.C., Weiyao, L., Qiang, W.Q.: Text topic mining based on LDA and co-occurrence theory. In: 7th International Conference on Computer Science & Education (ICCSE), pp. 525–528. IEEE (2012)
Parkhomenko, E., Tritchler, D., Beyene, J.: Sparse canonical correlation analysis with application to genomic data integration. Stat. Appl. Genet. Mol. Biol. 8(1), 1–34 (2009)
Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Item-based collaborative filtering recommendation algorithms. In: Proceedings of the 10th International Conference on World Wide Web, pp. 285–295. ACM (2001)
Sun, L., Franklin, M.J., Krishnan, S., Xin, R.S.: Fine-grained partitioning for aggressive data skipping. In: Proceedings of the ACM SIGMOD International Conference on Management of data, pp. 1115–1126. ACM (2014)
Garza Villarreal, S.E., Brena, R.F.: Topic mining based on graph local clustering. In: Batyrshin, I., Sidorov, G. (eds.) MICAI 2011, Part II. LNCS, vol. 7095, pp. 201–212. Springer, Heidelberg (2011)
Weiss, S.M., Indurkhya, N., Zhang, T., Damerau, F.: Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer Science & Business Media, New York (2010)
Yang, Q., Li, Z., Jiang, J., Zhao, P., Liu, G., Liu, A., Zhu, J.: NokeaRM: employing non-key attributes in record matching. In: Dong, X.L., Yu, X., Li, J., Sun, Y. (eds.) Web-Age Information Management. LNCS, vol. 9098, pp. 438–442. Springer, Heidelberg (2015)
Zhang, X., LeCun, Y.: Text understanding from scratch. arXiv preprint arXiv:1502.01710 (2015)
Acknowledgements
This research is partially supported by Natural Science Foundation of China (Grant No. 61303019, 61402313, 61472263, 61572336), Postdoctoral scientific research funding of Jiangsu Province (No. 1501090B) National 58 batch of postdoctoral funding (No. 2015M581859) and Collaborative Innovation Center of Novel Software Technology and Industrialization, Jiangsu, China.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Yang, Q. et al. (2016). CTextEM: Using Consolidated Textual Data for Entity Matching. In: Navathe, S., Wu, W., Shekhar, S., Du, X., Wang, X., Xiong, H. (eds) Database Systems for Advanced Applications. DASFAA 2016. Lecture Notes in Computer Science(), vol 9642. Springer, Cham. https://doi.org/10.1007/978-3-319-32025-0_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-32025-0_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-32024-3
Online ISBN: 978-3-319-32025-0
eBook Packages: Computer ScienceComputer Science (R0)