Mis-categorized entities detection

Hao, Shuang; Tang, Nan; Li, Guoliang; Feng, Jianhua; Wang, Ning

doi:10.1007/s00778-021-00653-w

Mis-categorized entities detection

Regular Paper
Published: 06 March 2021

Volume 30, pages 515–536, (2021)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Shuang Hao^1,2,
Nan Tang³,
Guoliang Li ORCID: orcid.org/0000-0002-1398-0621²,
Jianhua Feng² &
…
Ning Wang¹

359 Accesses
1 Citation
Explore all metrics

Abstract

Entity categorization, the process of categorizing entities into groups, is an important problem with many applications. However, in practice, many entities are mis-categorized, such as Google Scholar and Amazon products. In this paper, we study the problem of discovering mis-categorized entities from a given group of categorized entities. This problem is inherently hard: All entities within the same group have been “well” categorized by the state-of-the-art solutions. Apparently, it is nontrivial to differentiate them. We propose a novel rule-based framework to solve this problem. It first uses positive rules to compute disjoint partitions of entities, where the partition with the largest size is taken as the correctly categorized partition, namely the pivot partition. It then uses negative rules to identify mis-categorized entities in other partitions that are dissimilar to the entities in the pivot partition. We describe optimizations on applying these rules and discuss how to generate positive/negative rules. In addition, we propose novel strategies to resolve inconsistent rules. Extensive experimental results on real-world datasets show the effectiveness of our solution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

EnAli: entity alignment across multiple heterogeneous data sources

Article 09 June 2018

An effective weighted rule-based method for entity resolution

Article 02 August 2018

Unsupervised Entity Resolution Method Based on Random Forest

Notes

https://scholar.google.com/citations?view_op=top_venues.
Here, we use exact string matching for example. We can also use approximate matching based on similarity functions.
We assign each entity a partition ID and utilize a union-find data structure. If e and \(e'\) are verified that they satisfy a positive rule, we update their partition ID to the same ID. Assume the partition ID of e is i and that of \(e'\) is j, and \(i<j\), we change the partition ID of \(e'\) to i.
\(\lambda \) is a constant that determines the approximation ratio. It has been proved in [22] that \(\lambda \) must be greater than 2.
We gathered the Google Scholar data at the end of the year 2016, which may be dirtier than the current Google Scholar pages. However, mis-categorized entities still exist in the current version due to the paper assignment method applied in Google Scholar.

References

Abe, N., Zadrozny, B., Langford, J.: Outlier detection by active learning. In: SIGKDD (2006)
Aggarwal, C.C.: Outlier ensembles: position paper. ACM SIGKDD Explor. Newslett. 14(2), 49–58 (2013)
Article Google Scholar
Alhelbawy, A., Gaizauskas, R.: Graph ranking for collective named entity disambiguation. In: Annual Meeting of the Association for Computational Linguistics (2014)
Arasu, A., Ré, C., Suciu, D.: Large-scale deduplication with constraints using dedupalog. In: ICDE (2009)
Bansal, N., Blum, A., Chawla, S.: Correlation clustering. Mach. Learn. 56(1–3), 89–113 (2004)
Article MathSciNet Google Scholar
Bellare, K., Iyengar, S., Parameswaran, A.G., Rastogi, V.: Active sampling for entity matching. In: SIGKDD (2012)
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)
Article Google Scholar
Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. TKDD 1(1), 5 (2007)
Article Google Scholar
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: SIGKDD (2003)
Breunig, M.M., Kriegel, H., Ng, R.T., Sander, J.: Lof: identifying density-based local outliers. In: SIGMOD (2000)
Bunescu, R.C., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: EACL (2006)
Campos, G.O., Zimek, A., Sander, J., Campello, R.J., Micenková, B., Schubert, E., Assent, I., Houle, M.E.: On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min. Knowl. Discov. 30(4), 891–927 (2016)
Article MathSciNet Google Scholar
Chai, C., Li, G., Li, J., Deng, D., Feng, J.: Cost-effective crowdsourced entity resolution: a partial-order approach. In: SIGMOD
Chang, C.-C., Lin, C.-J.: Libsvm: a library for support vector machines. ACM TIST 2(3), 27 (2011)
Google Scholar
Charikar, M., Guruswami, V., Wirth, A.: Clustering with qualitative information. J. Comput. Syst. Sci. 71(3), 360–383 (2005)
Article MathSciNet Google Scholar
Chawla, S., Makarychev, K., Schramm, T., Yaroslavtsev, G.: Near optimal lp rounding algorithm for correlation clustering on complete and complete k-partite graphs. In: Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, pp. 219–228. ACM (2015)
Chu, X., Ilyas, I.F., Koutris, P.: Distributed data deduplication. In: PVLDB (2016)
Cucerzan, S.: Large-scale named entity disambiguation based on wikipedia data. In: EMNLP-CoNLL (2007)
Cunningham, P., Delany, S.J.: k-nearest neighbour classifiers. Multiple Classif. Syst. 34(8), 1–17 (2007)
Google Scholar
Das, R., Zaheer, M., Dyer, C.: Gaussian LDA for topic models with word embeddings. In: ACL (2015)
Das, S., GC, P.S., Doan, A., Naughton, J.F., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V., Park, Y.: Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. In: SIGMOD (2017)
Demaine, E.D., Emanuel, D., Fiat, A., Immorlica, N.: Correlation clustering in general weighted graphs. Theor. Comput. Sci. 361(2–3), 172–187 (2006)
Article MathSciNet Google Scholar
Demaine, E.D., Immorlica, N.: Correlation clustering with partial information. In: Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pp. 1–13. Springer (2003)
Demartini, G., Difallah, D.E., Cudré-Mauroux, P.: Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: WWW (2012)
Ebraheem, M., Thirumuruganathan, S., Joty, S., Ouzzani, M., Tang, N.: Distributed representations of tuples for entity resolution. In: VLDB (2018)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Article Google Scholar
Eshel, Y., Cohen, N., Radinsky, K., Markovitch, S., Yamada, I., Levy, O.: Named entity disambiguation for noisy text. In: CoNLL (2017)
Fellegi, I., Sunter, A.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)
Article Google Scholar
Francis-Landau, M., Durrett, G., Klein, D.: Capturing semantic similarity for entity linking with convolutional neural networks. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1256–1261 (2016)
Gabrilovich, E., Markovitch, S. et al.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: IJCAI (2007)
Gentile, A.L., Zhang, Z., Xia, L., Iria, J.: Graph-based semantic relatedness for named entity disambiguation. In: International Conference on Software, Services and Semantic Technologies (2009)
Gokhale, C., Das, S., Doan, A., Naughton, J.F., Rampalli, N., Shavlik, J., Zhu, X.: Corleone: Hands-off crowdsourcing for entity matching. In: SIGMOD (2014)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D. et al.: Approximate string joins in a database (almost) for free. In: VLDB (2001)
Hakimov, S., Oto, S.A., Dogdu, E.: Named entity recognition and disambiguation using linked data and graph-based centrality scoring. In: International workshop on semantic web information management. ACM (2012)
Hao, S., Tang, N., Li, G., Feng, J.: Discovering mis-categorized entities. In: ICDE (2018)
Hao, S., Xu, Y., Tang, N., Li, G., Feng, J.: Cleaning your wrong google scholar entries. In: ICDE demo (2018)
Hazman, M., El-Beltagy, S.R., Rafea, A.: A survey of ontology learning approaches. Database 7, 6 (2011)
Google Scholar
He, Z., Xu, X., Deng, S.: Discovering cluster-based local outliers. Pattern Recognit. Lett.. 24(9–10), 1641–1650 (2003)
Article Google Scholar
Hochba, D.S.: Approximation algorithms for np-hard problems. ACM Sigact News 28(2), 40–52 (1997)
Article Google Scholar
Hu, Z., Huang, P., Deng, Y., Gao, Y., Xing, E.: Entity hierarchy embedding. In: ACL (2015)
Jiang, Y., Li, G., Feng, J., Li, W.: String similarity joins: an experimental evaluation. In: PVLDB (2014)
Karpinski, M., Schudy, W.: Linear time approximation schemes for the Gale-Berlekamp game and related minimization problems. In: Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, pp. 313–322. ACM (2009)
Kolb, L., Thor, A., Rahm, E.: Dedoop: efficient deduplication with hadoop. In: PVLDB (2012)
Köpcke, H., Rahm, E.: Training selection for tuning entity matching. In: Program Committee Workshop on Management of Uncertain Data, p. 3 (2008)
Liaw, A., Wiener, M., et al.: Classification and regression by randomforest. R News 2(3), 18–22 (2002)
Google Scholar
Lippmann, R.P.: Anintroduction to computing with neural nets. IEEE ASSP Mag. 4(2), 4–22 (1987)
Article Google Scholar
McAuley, J., Targett, C., Shi, Q., Van Den Hengel, A.: Image-based recommendations on styles and substitutes. In: SIGIR (2015)
Phua, C., Alahakoon, D., Lee, V.: Minority report in fraud detection: classification of skewed data. ACM SIGKDD Explor. Newslett. 6(1), 50–59 (2004)
Article Google Scholar
Quinlan, J.R.: C4. 5: Programs for Machine Learning. Elsevier, Amsterdam (2014)
Google Scholar
Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: ACM Sigmod Record, vol. 29, pp. 427–438. ACM (2000)
Rish, I. et al.: An empirical study of the Naive Bayes classifier. In: IJCAI workshop (2001)
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: KDD (2002)
Singh, R., Meduri, V.V., Elmagarmid, A.K., Madden, S., Papotti, P., Quiané-Ruiz, J., Solar-Lezama, A., Tang, N.: Synthesizing entity matching rules by examples. In: PVLDB (2017)
Singla, P., Domingos, P.: Entity resolution with Markov logic. In: ICDM (2006)
Steinwart, I., Hush, D., Scovel, C.: A classification framework for anomaly detection. J. Mach. Learn. Res. 6(Feb), 211–232 (2005)
MathSciNet MATH Google Scholar
Sun, Y., Lin, L., Tang, D., Yang, N., Ji, Z., Wang, X.: Modeling mention, context and entity with neural networks for entity disambiguation. In: IJCAI (2015)
Swamy, C.: Correlation clustering: maximizing agreements via semidefinite programming. In: Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 526–527. Society for Industrial and Applied Mathematics (2004)
Vesdapunt, N., Bellare, K., Dalvi, N.N.: Crowdsourcing algorithms for entity resolution. In: Proceedings of VLDB Endow (2015)
Vilalta, R., Ma, S.: Predicting rare events in temporal domains. In: ICDM (2002)
Wang, J., Kraska, T., Franklin, M.J., Feng, J.: Crowder: crowdsourcing entity resolution. In: PVLDB (2012)
Wang, J., Li, G., Kraska, T., Franklin, M. J., Feng, J.: Leveraging transitive relations for crowdsourced joins. In: SIGMOD (2013)
Wang, J., Li, G., Yu, J. X., Feng, J.: Entity matching: How similar is similar. In: PVLDB (2011)
Weiss, G. M., Hirsh, H.: Learning to predict rare events in event sequences. In: KDD (1998)
Yamada, I., Shindo, H., Takeda, H., Takefuji, Y.: Joint learning of the embedding of words and entities for named entity disambiguation. In: The SIGNLL Conference on Computational Natural Language Learning (2016)
Zimek, A., Campello, R.J., Sander, J.: Ensembles for unsupervised outlier detection: challenges and research questions a position paper. ACM SIGKDD Explor. Newslett. 15(1), 11–22 (2014)
Article Google Scholar

Download references

Acknowledgements

This work was supported by NSF of China (Grant Nos. 61902017, 61925205, 61632016, 61521002, 61661166012), Huawei, TAL Education Group, China Postdoctoral Science Foundation (2019M650468) and China Scholarship Council. Note that Ning Wang’s partial work was supported by National Key R&D Program of China (2018YFC0809800) National Basic Research Program of China (973 Program) (Grant No. 2015CB358700), Fundamental Research Funds for the Central Universities (Grant No. 2019RC015).

Author information

Authors and Affiliations

School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China
Shuang Hao & Ning Wang
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Shuang Hao, Guoliang Li & Jianhua Feng
Qatar Computing Research Institute, HBKU, Ar-Rayyan, Qatar
Nan Tang

Authors

Shuang Hao
View author publications
You can also search for this author in PubMed Google Scholar
Nan Tang
View author publications
You can also search for this author in PubMed Google Scholar
Guoliang Li
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Feng
View author publications
You can also search for this author in PubMed Google Scholar
Ning Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guoliang Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hao, S., Tang, N., Li, G. et al. Mis-categorized entities detection. The VLDB Journal 30, 515–536 (2021). https://doi.org/10.1007/s00778-021-00653-w

Download citation

Received: 11 December 2019
Revised: 30 May 2020
Accepted: 14 January 2021
Published: 06 March 2021
Issue Date: July 2021
DOI: https://doi.org/10.1007/s00778-021-00653-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mis-categorized entities detection

Abstract

Access this article

Similar content being viewed by others

EnAli: entity alignment across multiple heterogeneous data sources

An effective weighted rule-based method for entity resolution

Unsupervised Entity Resolution Method Based on Random Forest

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Mis-categorized entities detection

Abstract

Access this article

Similar content being viewed by others

EnAli: entity alignment across multiple heterogeneous data sources

An effective weighted rule-based method for entity resolution

Unsupervised Entity Resolution Method Based on Random Forest

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation