Crowd-Guided Entity Matching with Consolidated Textual Data

Li, Zhi-Xu; Yang, Qiang; Liu, An; Liu, Guan-Feng; Zhu, Jia; Xu, Jia-Jie; Zheng, Kai; Zhang, Min

doi:10.1007/s11390-017-1769-0

Crowd-Guided Entity Matching with Consolidated Textual Data

Regular Paper
Published: 20 September 2017

Volume 32, pages 858–876, (2017)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Zhi-Xu Li^1,2,
Qiang Yang¹,
An Liu¹,
Guan-Feng Liu¹,
Jia Zhu³,
Jia-Jie Xu¹,
Kai Zheng^1,4 &
…
Min Zhang¹

116 Accesses
2 Citations
Explore all metrics

Abstract

Entity matching (EM) identifies records referring to the same entity within or across databases. Existing methods using structured attribute values (such as digital, date or short string values) may fail when the structured information is not enough to reflect the matching relationships between records. Nowadays more and more databases may have some unstructured textual attribute containing extra consolidated textual information (CText) of the record, but seldom work has been done on using the CText for EM. Conventional string similarity metrics such as edit distance or bag-of-words are unsuitable for measuring the similarities between CText since there are hundreds or thousands of words with each piece of CText, while existing topic models either cannot work well since there are no obvious gaps between topics in CText. In this paper, we propose a novel cooccurrence-based topic model to identify various sub-topics from each piece of CText, and then measure the similarity between CText on the multiple sub-topic dimensions. To avoid ignoring some hidden important sub-topics, we let the crowd help us decide weights of different sub-topics in doing EM. Our empirical study on two real-world datasets based on Amzon Mechanical Turk Crowdsourcing Platform shows that our method outperforms the state-of-the-art EM methods and Text Understanding models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial Intelligence

References

Koudas N, Sarawagi S, Srivastava D. Record linkage: Similarity measures and algorithms. In Proc. the ACM SIGMOD Int. Conf. Management of Data, June 2006, pp.802-803.
Parkhomenko E, Tritchler D, Beyene J. Sparse canonical correlation analysis with application to genomic data integration. Statistical Applications in Genetics and Molecular Biology, 2009, 8(1): Article No. 1.
Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: A survey. IEEE Trans. Knowledge and Data Engineering, 2007, 19(1): 1–16.
Article Google Scholar
Ektefa M, Jabar M A, Sidi F, Memar S, Ibrahim H, Ramli A. A threshold-based similaritymeasure for duplicate detection. In Proc. IEEE Conf. Open Systems, September 2011, pp.37-41.
Gao C, Hong X G, Peng Z H, Chen H D. Web trace duplication detection based on context. In Proc. the Int. Conf. Web Information Systems and Mining, September 2011, pp.292-301.
Google Scholar
Das D, Martins A F T. A Survey on Automatic Text Summarization. The MIT Press, 2007.
Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation. The Journal of Machine Learning Research, 2003, 3: 993–1022.
MATH Google Scholar
Landauer T K, Foltz PW, Laham D. An introduction to latent semantic analysis. Discourse Processes, 1998, 25(2/3): 259–284.
Article Google Scholar
Hofmann T. Probabilistic latent semantic analysis. In Proc. the 15th Conf. Uncertainty in Artificial Intelligence, August 1999, pp.289-296.
Kim D, Wang H X, Oh A. Context-dependent conceptualization. In Proc. the 23rd Int. Joint Conf. Artificial Intelligence, August 2013, pp.2654-2661.
Guo S T, Dong X L, Srivastava D, Zajac R. Record linkage with uniqueness constraints and erroneous values. Proc. the VLDB Endowment, 2010, 3(1/2): 417–428.
Article Google Scholar
Sun L W, Franklin M J, Krishnan S, Xin R S. Finegrained partitioning for aggressive data skipping. In Proc. the ACM SIGMOD Int. Conf. Management of Data, June 2014, pp.1115-1126.
Sarwar B, Karypis G, Konstan J, Riedl J. Item-based collaborative filtering recommendation algorithms. In Proc. the 10th Int. Conf. World Wide Web, May 2001, pp.285-295.
Feng J H, Li G L, Wang H N, Feng J H. Incremental quality inference in crowdsourcing. In Proc. the 19th Int. Conf. Database Systems for Advanced Applications, April 2014, pp.453-467.
Aizawa A, Oyama K. A fast linkage detection scheme for multi-source information integration. In Proc. the Int. Workshop on Challenges on Web Information Retrieval and Integration, April 2005, pp.30-39.
Christen P. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowledge and Data Engineering, 2012, 24(9): 1537–1555.
Article Google Scholar
Borthwick A, Goldberg A, Cheung P, Winkel A. Batch Automated Blocking and Record Matching. The US Press, 2011.
Yang Q, Li Z X, Jiang J, Zhao P P, Liu G F, Liu A, Zhu J. NokeaRM: Employing non-key attributes in record matching. In Proc. the 16th Int. Conf. Web-Age Information Management, June 2015, pp.438-442.
Villarreal S E G, Brena R F. Topic mining based on graph local clustering. In Proc. the 10th Int. Conf. Artificial Intelligence: Advances in Soft Computing, November 2011, pp.201-212.
Dhamankar R, Lee Y, Doan A H, Halevy A, Domingos P. iMAP: Discovering complex semantic matches between database schemas. In Proc. the ACM SIGMOD Int. Conf. Management of Data, June 2004, pp.383-394.
Weiss S M, Indurkhya N, Zhang T, Damerau F. Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer, 2005.
Hassell J, Aleman-Meza B, Arpinar I B. Ontology-driven automatic entity disambiguation in unstructured text. In Proc. the 5th Int. Conf. the Semantic Web, November 2006, pp.44-57.
Zhang X, LeCun Y. Text understanding from scratch. arXiv:1502.01710, 2016. https://arxiv.org/abs/1502.01710, August 2017.
Kim S J, Lee J H. Method of mining subtopics using dependency structure and anchor texts. In Proc. the 19th Int. Conf. String Processing and Information Retrieval, October 2012, pp.277-283.
Wu M W, Zhang C D, Lan W Y, Wu Q Q. Text topic mining based on LDA and co-occurrence theory. In Proc. the 7th Int. Conf. Computer Science & Education, July 2012, pp.525-528.
Li GL, Wang J N, Zheng Y D, Franklin M J. Crowdsourced data management: A survey. IEEE Trans. Knowledge and Data Engineering, 2016, 28(9): 2296–2319.
Article Google Scholar
Doan A H, Ramakrishnan R, Halevy A Y. Crowdsourcing systems on the world-wide web. Communications of the ACM, 2011, 54(4): 86–96.
Article Google Scholar
Gu B B, Li Z X, Yang Q, Xie Q, Liu A, Liu G F, Zheng K, Zhang X L. Web-ADARE: A web-aided data repairing system. Neurocomputing, 2017, 253: 201–214.
Article Google Scholar
Li G L, Chai C L, Fan J, Weng X P, Li J, Zheng Y D, Li Y B, Yu X, Zhang X H, Yuan H T. CDB: Optimizing queries with crowd-based selections and joins. In Proc. the ACM Int. Conf. Management of Data, May 2017, pp.1463-1478.
Jiang L L, Wang Y F, Hoffart J, Weikum G. Crowdsourced entity markup. In Proc. the 1st Int. Conf. Crowdsourcing the Semantic Web, October 2013, pp.59-68.
Wang J N, Kraska T, Franklin M J, Feng J H. Crowder: Crowdsourcing entity resolution. Proc. the VLDB Endowment, 2012, 5(11): 1483–1494.
Article Google Scholar
Gu B B, Li Z X, Zhang X L, Liu A, Liu G F, Zheng K, Zhao L, Zhou X F. The interaction between schema matching and record matching in data integration. IEEE Trans. Knowledge and Data Engineering, 2017, 29(1): 186–199.
Article Google Scholar
Demartini G, Difallah D E, Cudré-Mauroux P. ZenCrowd: Leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In Proc. the 21st Int. Conf. World Wide Web, April 2012, pp.469-478.
Gokhale C, Das S, Doan A H, Naughton J F, Rampalli N, Shavlik J, Zhu X J. Corleone: Hands-off crowdsourcing for entity matching. In Proc. the ACM SIGMOD Int. Conf. Management of Data, June 2014, pp.601-612.

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Soochow University, Suzhou, 215006, China
Zhi-Xu Li, Qiang Yang, An Liu, Guan-Feng Liu, Jia-Jie Xu, Kai Zheng & Min Zhang
Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou, 510006, China
Zhi-Xu Li
School of Computer, South China Normal University, Guangzhou, 510631, China
Jia Zhu
Beijing Key Laboratory of Big Data Management and Analysis Methods, Beijing, 100872, China
Kai Zheng

Authors

Zhi-Xu Li
View author publications
You can also search for this author inPubMed Google Scholar
Qiang Yang
View author publications
You can also search for this author inPubMed Google Scholar
An Liu
View author publications
You can also search for this author inPubMed Google Scholar
Guan-Feng Liu
View author publications
You can also search for this author inPubMed Google Scholar
Jia Zhu
View author publications
You can also search for this author inPubMed Google Scholar
Jia-Jie Xu
View author publications
You can also search for this author inPubMed Google Scholar
Kai Zheng
View author publications
You can also search for this author inPubMed Google Scholar
Min Zhang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to An Liu.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 57 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, ZX., Yang, Q., Liu, A. et al. Crowd-Guided Entity Matching with Consolidated Textual Data. J. Comput. Sci. Technol. 32, 858–876 (2017). https://doi.org/10.1007/s11390-017-1769-0

Download citation

Received: 01 March 2017
Revised: 09 August 2017
Published: 20 September 2017
Issue Date: September 2017
DOI: https://doi.org/10.1007/s11390-017-1769-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Crowd-Guided Entity Matching with Consolidated Textual Data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

CTextEM: Using Consolidated Textual Data for Entity Matching

Latent entity space: a novel retrieval approach for entity-bearing queries

Entity Matching with String Transformation and Similarity-Based Features

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Crowd-Guided Entity Matching with Consolidated Textual Data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

CTextEM: Using Consolidated Textual Data for Entity Matching

Latent entity space: a novel retrieval approach for entity-bearing queries

Entity Matching with String Transformation and Similarity-Based Features

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now