CTextEM: Using Consolidated Textual Data for Entity Matching

Yang, Qiang; Li, Zhixu; Gu, Binbin; Liu, An; Liu, Guanfeng; Zhao, Pengpeng; Zhao, Lei

doi:10.1007/978-3-319-32025-0_8

Qiang Yang¹⁹,
Zhixu Li¹⁹,
Binbin Gu¹⁹,
An Liu¹⁹,
Guanfeng Liu¹⁹,
Pengpeng Zhao¹⁹ &
…
Lei Zhao¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9642))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

3952 Accesses

Abstract

Entity Matching (EM) identifies records referring to the same entity within or across databases. Existing methods using structured attribute values (such as digital, date or short string values) only may fail when the structured information is not enough to reflect the matching relationships between records. Nowadays more and more databases may have some unstructured textual attribute containing extra Consolidated Textual information (CText for short) of the record, but seldom work has been done on using the CText information for EM. Conventional string similarity metrics such as edit distance or bag-of-words are unsuitable for measuring the similarities between CTexts since there are hundreds or thousands of words with each CText, while existing topic models either can not work well since there is no obvious gaps between the various sub-topics in CText. In this paper, we work on employing CText in EM. A baseline algorithm identifying important phrases with high IDF scores from CTexts and then measuring the similarity between CTexts based on these phrases does not work well since it estimates the similarity in one dimension and neglects that these phrases belong to different topics. To this end, we propose a novel cooccurrence-based topic model to identify various sub-topics from each CText, and then measure the similarity between CTexts on the multiple sub-topic dimensions. Our empirical study on two real-world data set shows that our method outperforms the state-of-the-art EM methods and Text Understanding models by reaching a higher EM precision and recall.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Crowd-Guided Entity Matching with Consolidated Textual Data

Article 20 September 2017

Latent entity space: a novel retrieval approach for entity-bearing queries

Article 11 September 2015

COEA: An Efficient Method for Entity Alignment in Online Encyclopedias

References

Aizawa, A., Oyama, K.: A fast linkage detection scheme for multi-source information integration. In: Proceedings of International Workshop on Challenges in Web Information Retrieval and Integration, WIRI 2005, pp. 30–39 (2005)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Borthwick, A., Goldberg, A., Cheung, P., Winkel, A.: Batch automated blocking and record matching (2011)
Google Scholar
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. Knowl. Data Eng. IEEE Trans. 24(9), 1537–1555 (2012)
Article Google Scholar
Das, Martins, D., A.F.T.: A survey on automatic text summarization. Int. J. Eng (2007)
Google Scholar
Dhamankar, R., Lee, Y., Doan, A., Halevy, A., Domingos, P.: iMAP: discovering complex semantic matches between database schemas. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 383–394. ACM (2004)
Google Scholar
Ektefa, M., Sidi, F., Ibrahim, H., Jabar, M.A., Memar, S., Ramli, A.: A threshold-based similarity measure for duplicate detection. In: IEEE Conference on Open Systems (ICOS), pp. 37–41. IEEE (2011)
Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Article Google Scholar
Gao, C., Hong, X., Peng, Z., Chen, H.: Web trace duplication detection based on context. In: Gong, Z., Luo, X., Chen, J., Lei, J., Wang, F.L. (eds.) WISM 2011, Part II. LNCS, vol. 6988, pp. 292–301. Springer, Heidelberg (2011)
Chapter Google Scholar
Guo, S., Dong, X.L., Srivastava, D., Zajac, R.: Record linkage with uniqueness constraints and erroneous values. Proc. VLDB Endowment 3(1–2), 417–428 (2010)
Article Google Scholar
Hassell, J., Aleman-Meza, B., Arpinar, I.B.: Ontology-driven automatic entity disambiguation in unstructured text. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 44–57. Springer, Heidelberg (2006)
Chapter Google Scholar
Hofmann, T.: Probabilistic latent semantic analysis. Proc. Uncertainty Artif. Intell. Uai 25(4), 289–296 (1999)
Google Scholar
Kim, D., Wang, H., Oh, A.: Context-dependent conceptualization. In: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, pp. 2654–2661. AAAI Press (2013)
Google Scholar
Kim, S.-J., Lee, J.-H.: Method of mining subtopics using dependency structure and anchor texts. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 277–283. Springer, Heidelberg (2012)
Chapter Google Scholar
Koudas, N., Sarawagi, S., Srivastava, D.: Record linkage: similarity measures and algorithms. In: Sigmod Conference, pp. 802–803 (2006)
Google Scholar
Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25(2), 259–284 (1998). Special issue
Article Google Scholar
Maowen, W., Dong, Z.C., Weiyao, L., Qiang, W.Q.: Text topic mining based on LDA and co-occurrence theory. In: 7th International Conference on Computer Science & Education (ICCSE), pp. 525–528. IEEE (2012)
Google Scholar
Parkhomenko, E., Tritchler, D., Beyene, J.: Sparse canonical correlation analysis with application to genomic data integration. Stat. Appl. Genet. Mol. Biol. 8(1), 1–34 (2009)
MathSciNet MATH Google Scholar
Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Item-based collaborative filtering recommendation algorithms. In: Proceedings of the 10th International Conference on World Wide Web, pp. 285–295. ACM (2001)
Google Scholar
Sun, L., Franklin, M.J., Krishnan, S., Xin, R.S.: Fine-grained partitioning for aggressive data skipping. In: Proceedings of the ACM SIGMOD International Conference on Management of data, pp. 1115–1126. ACM (2014)
Google Scholar
Garza Villarreal, S.E., Brena, R.F.: Topic mining based on graph local clustering. In: Batyrshin, I., Sidorov, G. (eds.) MICAI 2011, Part II. LNCS, vol. 7095, pp. 201–212. Springer, Heidelberg (2011)
Chapter Google Scholar
Weiss, S.M., Indurkhya, N., Zhang, T., Damerau, F.: Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer Science & Business Media, New York (2010)
MATH Google Scholar
Yang, Q., Li, Z., Jiang, J., Zhao, P., Liu, G., Liu, A., Zhu, J.: NokeaRM: employing non-key attributes in record matching. In: Dong, X.L., Yu, X., Li, J., Sun, Y. (eds.) Web-Age Information Management. LNCS, vol. 9098, pp. 438–442. Springer, Heidelberg (2015)
Chapter Google Scholar
Zhang, X., LeCun, Y.: Text understanding from scratch. arXiv preprint arXiv:1502.01710 (2015)

Download references

Acknowledgements

This research is partially supported by Natural Science Foundation of China (Grant No. 61303019, 61402313, 61472263, 61572336), Postdoctoral scientific research funding of Jiangsu Province (No. 1501090B) National 58 batch of postdoctoral funding (No. 2015M581859) and Collaborative Innovation Center of Novel Software Technology and Industrialization, Jiangsu, China.

Author information

Authors and Affiliations

School of Computer Science and Technology, Soochow University, Suzhou, China
Qiang Yang, Zhixu Li, Binbin Gu, An Liu, Guanfeng Liu, Pengpeng Zhao & Lei Zhao

Authors

Qiang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Zhixu Li
View author publications
You can also search for this author in PubMed Google Scholar
Binbin Gu
View author publications
You can also search for this author in PubMed Google Scholar
An Liu
View author publications
You can also search for this author in PubMed Google Scholar
Guanfeng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Pengpeng Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Lei Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhixu Li .

Editor information

Editors and Affiliations

Georgia Institute of Technology, Atlanta, Georgia, USA
Shamkant B. Navathe
University of Texas at Dallas, Richardson, Texas, USA
Weili Wu
University of Minnesota, Minneapolis, Minnesota, USA
Shashi Shekhar
Renmin University, Beijing, China
Xiaoyong Du
Fudan University, Shanghai, China
X. Sean Wang
Rutgers, The State University of New Jer, New Brunswick, New Jersey, USA
Hui Xiong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, Q. et al. (2016). CTextEM: Using Consolidated Textual Data for Entity Matching. In: Navathe, S., Wu, W., Shekhar, S., Du, X., Wang, X., Xiong, H. (eds) Database Systems for Advanced Applications. DASFAA 2016. Lecture Notes in Computer Science(), vol 9642. Springer, Cham. https://doi.org/10.1007/978-3-319-32025-0_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-32025-0_8
Published: 25 March 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-32024-3
Online ISBN: 978-3-319-32025-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics