Abstract
Entity matching (EM) identifies records referring to the same entity within or across databases. Existing methods using structured attribute values (such as digital, date or short string values) may fail when the structured information is not enough to reflect the matching relationships between records. Nowadays more and more databases may have some unstructured textual attribute containing extra consolidated textual information (CText) of the record, but seldom work has been done on using the CText for EM. Conventional string similarity metrics such as edit distance or bag-of-words are unsuitable for measuring the similarities between CText since there are hundreds or thousands of words with each piece of CText, while existing topic models either cannot work well since there are no obvious gaps between topics in CText. In this paper, we propose a novel cooccurrence-based topic model to identify various sub-topics from each piece of CText, and then measure the similarity between CText on the multiple sub-topic dimensions. To avoid ignoring some hidden important sub-topics, we let the crowd help us decide weights of different sub-topics in doing EM. Our empirical study on two real-world datasets based on Amzon Mechanical Turk Crowdsourcing Platform shows that our method outperforms the state-of-the-art EM methods and Text Understanding models.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Koudas N, Sarawagi S, Srivastava D. Record linkage: Similarity measures and algorithms. In Proc. the ACM SIGMOD Int. Conf. Management of Data, June 2006, pp.802-803.
Parkhomenko E, Tritchler D, Beyene J. Sparse canonical correlation analysis with application to genomic data integration. Statistical Applications in Genetics and Molecular Biology, 2009, 8(1): Article No. 1.
Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: A survey. IEEE Trans. Knowledge and Data Engineering, 2007, 19(1): 1–16.
Ektefa M, Jabar M A, Sidi F, Memar S, Ibrahim H, Ramli A. A threshold-based similaritymeasure for duplicate detection. In Proc. IEEE Conf. Open Systems, September 2011, pp.37-41.
Gao C, Hong X G, Peng Z H, Chen H D. Web trace duplication detection based on context. In Proc. the Int. Conf. Web Information Systems and Mining, September 2011, pp.292-301.
Das D, Martins A F T. A Survey on Automatic Text Summarization. The MIT Press, 2007.
Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation. The Journal of Machine Learning Research, 2003, 3: 993–1022.
Landauer T K, Foltz PW, Laham D. An introduction to latent semantic analysis. Discourse Processes, 1998, 25(2/3): 259–284.
Hofmann T. Probabilistic latent semantic analysis. In Proc. the 15th Conf. Uncertainty in Artificial Intelligence, August 1999, pp.289-296.
Kim D, Wang H X, Oh A. Context-dependent conceptualization. In Proc. the 23rd Int. Joint Conf. Artificial Intelligence, August 2013, pp.2654-2661.
Guo S T, Dong X L, Srivastava D, Zajac R. Record linkage with uniqueness constraints and erroneous values. Proc. the VLDB Endowment, 2010, 3(1/2): 417–428.
Sun L W, Franklin M J, Krishnan S, Xin R S. Finegrained partitioning for aggressive data skipping. In Proc. the ACM SIGMOD Int. Conf. Management of Data, June 2014, pp.1115-1126.
Sarwar B, Karypis G, Konstan J, Riedl J. Item-based collaborative filtering recommendation algorithms. In Proc. the 10th Int. Conf. World Wide Web, May 2001, pp.285-295.
Feng J H, Li G L, Wang H N, Feng J H. Incremental quality inference in crowdsourcing. In Proc. the 19th Int. Conf. Database Systems for Advanced Applications, April 2014, pp.453-467.
Aizawa A, Oyama K. A fast linkage detection scheme for multi-source information integration. In Proc. the Int. Workshop on Challenges on Web Information Retrieval and Integration, April 2005, pp.30-39.
Christen P. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowledge and Data Engineering, 2012, 24(9): 1537–1555.
Borthwick A, Goldberg A, Cheung P, Winkel A. Batch Automated Blocking and Record Matching. The US Press, 2011.
Yang Q, Li Z X, Jiang J, Zhao P P, Liu G F, Liu A, Zhu J. NokeaRM: Employing non-key attributes in record matching. In Proc. the 16th Int. Conf. Web-Age Information Management, June 2015, pp.438-442.
Villarreal S E G, Brena R F. Topic mining based on graph local clustering. In Proc. the 10th Int. Conf. Artificial Intelligence: Advances in Soft Computing, November 2011, pp.201-212.
Dhamankar R, Lee Y, Doan A H, Halevy A, Domingos P. iMAP: Discovering complex semantic matches between database schemas. In Proc. the ACM SIGMOD Int. Conf. Management of Data, June 2004, pp.383-394.
Weiss S M, Indurkhya N, Zhang T, Damerau F. Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer, 2005.
Hassell J, Aleman-Meza B, Arpinar I B. Ontology-driven automatic entity disambiguation in unstructured text. In Proc. the 5th Int. Conf. the Semantic Web, November 2006, pp.44-57.
Zhang X, LeCun Y. Text understanding from scratch. arXiv:1502.01710, 2016. https://arxiv.org/abs/1502.01710, August 2017.
Kim S J, Lee J H. Method of mining subtopics using dependency structure and anchor texts. In Proc. the 19th Int. Conf. String Processing and Information Retrieval, October 2012, pp.277-283.
Wu M W, Zhang C D, Lan W Y, Wu Q Q. Text topic mining based on LDA and co-occurrence theory. In Proc. the 7th Int. Conf. Computer Science & Education, July 2012, pp.525-528.
Li GL, Wang J N, Zheng Y D, Franklin M J. Crowdsourced data management: A survey. IEEE Trans. Knowledge and Data Engineering, 2016, 28(9): 2296–2319.
Doan A H, Ramakrishnan R, Halevy A Y. Crowdsourcing systems on the world-wide web. Communications of the ACM, 2011, 54(4): 86–96.
Gu B B, Li Z X, Yang Q, Xie Q, Liu A, Liu G F, Zheng K, Zhang X L. Web-ADARE: A web-aided data repairing system. Neurocomputing, 2017, 253: 201–214.
Li G L, Chai C L, Fan J, Weng X P, Li J, Zheng Y D, Li Y B, Yu X, Zhang X H, Yuan H T. CDB: Optimizing queries with crowd-based selections and joins. In Proc. the ACM Int. Conf. Management of Data, May 2017, pp.1463-1478.
Jiang L L, Wang Y F, Hoffart J, Weikum G. Crowdsourced entity markup. In Proc. the 1st Int. Conf. Crowdsourcing the Semantic Web, October 2013, pp.59-68.
Wang J N, Kraska T, Franklin M J, Feng J H. Crowder: Crowdsourcing entity resolution. Proc. the VLDB Endowment, 2012, 5(11): 1483–1494.
Gu B B, Li Z X, Zhang X L, Liu A, Liu G F, Zheng K, Zhao L, Zhou X F. The interaction between schema matching and record matching in data integration. IEEE Trans. Knowledge and Data Engineering, 2017, 29(1): 186–199.
Demartini G, Difallah D E, Cudré-Mauroux P. ZenCrowd: Leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In Proc. the 21st Int. Conf. World Wide Web, April 2012, pp.469-478.
Gokhale C, Das S, Doan A H, Naughton J F, Rampalli N, Shavlik J, Zhu X J. Corleone: Hands-off crowdsourcing for entity matching. In Proc. the ACM SIGMOD Int. Conf. Management of Data, June 2014, pp.601-612.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
ESM 1
(PDF 57 kb)
Rights and permissions
About this article
Cite this article
Li, ZX., Yang, Q., Liu, A. et al. Crowd-Guided Entity Matching with Consolidated Textual Data. J. Comput. Sci. Technol. 32, 858–876 (2017). https://doi.org/10.1007/s11390-017-1769-0
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-017-1769-0