Skip to main content
Log in

Mis-categorized entities detection

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Entity categorization, the process of categorizing entities into groups, is an important problem with many applications. However, in practice, many entities are mis-categorized, such as Google Scholar and Amazon products. In this paper, we study the problem of discovering mis-categorized entities from a given group of categorized entities. This problem is inherently hard: All entities within the same group have been “well” categorized by the state-of-the-art solutions. Apparently, it is nontrivial to differentiate them. We propose a novel rule-based framework to solve this problem. It first uses positive rules to compute disjoint partitions of entities, where the partition with the largest size is taken as the correctly categorized partition, namely the pivot partition. It then uses negative rules to identify mis-categorized entities in other partitions that are dissimilar to the entities in the pivot partition. We describe optimizations on applying these rules and discuss how to generate positive/negative rules. In addition, we propose novel strategies to resolve inconsistent rules. Extensive experimental results on real-world datasets show the effectiveness of our solution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. https://scholar.google.com/citations?view_op=top_venues.

  2. Here, we use exact string matching for example. We can also use approximate matching based on similarity functions.

  3. We assign each entity a partition ID and utilize a union-find data structure. If e and \(e'\) are verified that they satisfy a positive rule, we update their partition ID to the same ID. Assume the partition ID of e is i and that of \(e'\) is j, and \(i<j\), we change the partition ID of \(e'\) to i.

  4. \(\lambda \) is a constant that determines the approximation ratio. It has been proved in [22] that \(\lambda \) must be greater than 2.

  5. We gathered the Google Scholar data at the end of the year 2016, which may be dirtier than the current Google Scholar pages. However, mis-categorized entities still exist in the current version due to the paper assignment method applied in Google Scholar.

References

  1. Abe, N., Zadrozny, B., Langford, J.: Outlier detection by active learning. In: SIGKDD (2006)

  2. Aggarwal, C.C.: Outlier ensembles: position paper. ACM SIGKDD Explor. Newslett. 14(2), 49–58 (2013)

    Article  Google Scholar 

  3. Alhelbawy, A., Gaizauskas, R.: Graph ranking for collective named entity disambiguation. In: Annual Meeting of the Association for Computational Linguistics (2014)

  4. Arasu, A., Ré, C., Suciu, D.: Large-scale deduplication with constraints using dedupalog. In: ICDE (2009)

  5. Bansal, N., Blum, A., Chawla, S.: Correlation clustering. Mach. Learn. 56(1–3), 89–113 (2004)

    Article  MathSciNet  Google Scholar 

  6. Bellare, K., Iyengar, S., Parameswaran, A.G., Rastogi, V.: Active sampling for entity matching. In: SIGKDD (2012)

  7. Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)

    Article  Google Scholar 

  8. Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. TKDD 1(1), 5 (2007)

    Article  Google Scholar 

  9. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: SIGKDD (2003)

  10. Breunig, M.M., Kriegel, H., Ng, R.T., Sander, J.: Lof: identifying density-based local outliers. In: SIGMOD (2000)

  11. Bunescu, R.C., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: EACL (2006)

  12. Campos, G.O., Zimek, A., Sander, J., Campello, R.J., Micenková, B., Schubert, E., Assent, I., Houle, M.E.: On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min. Knowl. Discov. 30(4), 891–927 (2016)

    Article  MathSciNet  Google Scholar 

  13. Chai, C., Li, G., Li, J., Deng, D., Feng, J.: Cost-effective crowdsourced entity resolution: a partial-order approach. In: SIGMOD

  14. Chang, C.-C., Lin, C.-J.: Libsvm: a library for support vector machines. ACM TIST 2(3), 27 (2011)

    Google Scholar 

  15. Charikar, M., Guruswami, V., Wirth, A.: Clustering with qualitative information. J. Comput. Syst. Sci. 71(3), 360–383 (2005)

    Article  MathSciNet  Google Scholar 

  16. Chawla, S., Makarychev, K., Schramm, T., Yaroslavtsev, G.: Near optimal lp rounding algorithm for correlation clustering on complete and complete k-partite graphs. In: Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, pp. 219–228. ACM (2015)

  17. Chu, X., Ilyas, I.F., Koutris, P.: Distributed data deduplication. In: PVLDB (2016)

  18. Cucerzan, S.: Large-scale named entity disambiguation based on wikipedia data. In: EMNLP-CoNLL (2007)

  19. Cunningham, P., Delany, S.J.: k-nearest neighbour classifiers. Multiple Classif. Syst. 34(8), 1–17 (2007)

    Google Scholar 

  20. Das, R., Zaheer, M., Dyer, C.: Gaussian LDA for topic models with word embeddings. In: ACL (2015)

  21. Das, S., GC, P.S., Doan, A., Naughton, J.F., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V., Park, Y.: Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. In: SIGMOD (2017)

  22. Demaine, E.D., Emanuel, D., Fiat, A., Immorlica, N.: Correlation clustering in general weighted graphs. Theor. Comput. Sci. 361(2–3), 172–187 (2006)

    Article  MathSciNet  Google Scholar 

  23. Demaine, E.D., Immorlica, N.: Correlation clustering with partial information. In: Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pp. 1–13. Springer (2003)

  24. Demartini, G., Difallah, D.E., Cudré-Mauroux, P.: Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: WWW (2012)

  25. Ebraheem, M., Thirumuruganathan, S., Joty, S., Ouzzani, M., Tang, N.: Distributed representations of tuples for entity resolution. In: VLDB (2018)

  26. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  27. Eshel, Y., Cohen, N., Radinsky, K., Markovitch, S., Yamada, I., Levy, O.: Named entity disambiguation for noisy text. In: CoNLL (2017)

  28. Fellegi, I., Sunter, A.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)

    Article  Google Scholar 

  29. Francis-Landau, M., Durrett, G., Klein, D.: Capturing semantic similarity for entity linking with convolutional neural networks. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1256–1261 (2016)

  30. Gabrilovich, E., Markovitch, S. et al.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: IJCAI (2007)

  31. Gentile, A.L., Zhang, Z., Xia, L., Iria, J.: Graph-based semantic relatedness for named entity disambiguation. In: International Conference on Software, Services and Semantic Technologies (2009)

  32. Gokhale, C., Das, S., Doan, A., Naughton, J.F., Rampalli, N., Shavlik, J., Zhu, X.: Corleone: Hands-off crowdsourcing for entity matching. In: SIGMOD (2014)

  33. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D. et al.: Approximate string joins in a database (almost) for free. In: VLDB (2001)

  34. Hakimov, S., Oto, S.A., Dogdu, E.: Named entity recognition and disambiguation using linked data and graph-based centrality scoring. In: International workshop on semantic web information management. ACM (2012)

  35. Hao, S., Tang, N., Li, G., Feng, J.: Discovering mis-categorized entities. In: ICDE (2018)

  36. Hao, S., Xu, Y., Tang, N., Li, G., Feng, J.: Cleaning your wrong google scholar entries. In: ICDE demo (2018)

  37. Hazman, M., El-Beltagy, S.R., Rafea, A.: A survey of ontology learning approaches. Database 7, 6 (2011)

    Google Scholar 

  38. He, Z., Xu, X., Deng, S.: Discovering cluster-based local outliers. Pattern Recognit. Lett.. 24(9–10), 1641–1650 (2003)

    Article  Google Scholar 

  39. Hochba, D.S.: Approximation algorithms for np-hard problems. ACM Sigact News 28(2), 40–52 (1997)

    Article  Google Scholar 

  40. Hu, Z., Huang, P., Deng, Y., Gao, Y., Xing, E.: Entity hierarchy embedding. In: ACL (2015)

  41. Jiang, Y., Li, G., Feng, J., Li, W.: String similarity joins: an experimental evaluation. In: PVLDB (2014)

  42. Karpinski, M., Schudy, W.: Linear time approximation schemes for the Gale-Berlekamp game and related minimization problems. In: Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, pp. 313–322. ACM (2009)

  43. Kolb, L., Thor, A., Rahm, E.: Dedoop: efficient deduplication with hadoop. In: PVLDB (2012)

  44. Köpcke, H., Rahm, E.: Training selection for tuning entity matching. In: Program Committee Workshop on Management of Uncertain Data, p. 3 (2008)

  45. Liaw, A., Wiener, M., et al.: Classification and regression by randomforest. R News 2(3), 18–22 (2002)

    Google Scholar 

  46. Lippmann, R.P.: Anintroduction to computing with neural nets. IEEE ASSP Mag. 4(2), 4–22 (1987)

    Article  Google Scholar 

  47. McAuley, J., Targett, C., Shi, Q., Van Den Hengel, A.: Image-based recommendations on styles and substitutes. In: SIGIR (2015)

  48. Phua, C., Alahakoon, D., Lee, V.: Minority report in fraud detection: classification of skewed data. ACM SIGKDD Explor. Newslett. 6(1), 50–59 (2004)

    Article  Google Scholar 

  49. Quinlan, J.R.: C4. 5: Programs for Machine Learning. Elsevier, Amsterdam (2014)

    Google Scholar 

  50. Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: ACM Sigmod Record, vol. 29, pp. 427–438. ACM (2000)

  51. Rish, I. et al.: An empirical study of the Naive Bayes classifier. In: IJCAI workshop (2001)

  52. Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: KDD (2002)

  53. Singh, R., Meduri, V.V., Elmagarmid, A.K., Madden, S., Papotti, P., Quiané-Ruiz, J., Solar-Lezama, A., Tang, N.: Synthesizing entity matching rules by examples. In: PVLDB (2017)

  54. Singla, P., Domingos, P.: Entity resolution with Markov logic. In: ICDM (2006)

  55. Steinwart, I., Hush, D., Scovel, C.: A classification framework for anomaly detection. J. Mach. Learn. Res. 6(Feb), 211–232 (2005)

    MathSciNet  MATH  Google Scholar 

  56. Sun, Y., Lin, L., Tang, D., Yang, N., Ji, Z., Wang, X.: Modeling mention, context and entity with neural networks for entity disambiguation. In: IJCAI (2015)

  57. Swamy, C.: Correlation clustering: maximizing agreements via semidefinite programming. In: Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 526–527. Society for Industrial and Applied Mathematics (2004)

  58. Vesdapunt, N., Bellare, K., Dalvi, N.N.: Crowdsourcing algorithms for entity resolution. In: Proceedings of VLDB Endow (2015)

  59. Vilalta, R., Ma, S.: Predicting rare events in temporal domains. In: ICDM (2002)

  60. Wang, J., Kraska, T., Franklin, M.J., Feng, J.: Crowder: crowdsourcing entity resolution. In: PVLDB (2012)

  61. Wang, J., Li, G., Kraska, T., Franklin, M. J., Feng, J.: Leveraging transitive relations for crowdsourced joins. In: SIGMOD (2013)

  62. Wang, J., Li, G., Yu, J. X., Feng, J.: Entity matching: How similar is similar. In: PVLDB (2011)

  63. Weiss, G. M., Hirsh, H.: Learning to predict rare events in event sequences. In: KDD (1998)

  64. Yamada, I., Shindo, H., Takeda, H., Takefuji, Y.: Joint learning of the embedding of words and entities for named entity disambiguation. In: The SIGNLL Conference on Computational Natural Language Learning (2016)

  65. Zimek, A., Campello, R.J., Sander, J.: Ensembles for unsupervised outlier detection: challenges and research questions a position paper. ACM SIGKDD Explor. Newslett. 15(1), 11–22 (2014)

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by NSF of China (Grant Nos. 61902017, 61925205, 61632016, 61521002, 61661166012), Huawei, TAL Education Group, China Postdoctoral Science Foundation (2019M650468) and China Scholarship Council. Note that Ning Wang’s partial work was supported by National Key R&D Program of China (2018YFC0809800) National Basic Research Program of China (973 Program) (Grant No. 2015CB358700), Fundamental Research Funds for the Central Universities (Grant No. 2019RC015).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guoliang Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hao, S., Tang, N., Li, G. et al. Mis-categorized entities detection. The VLDB Journal 30, 515–536 (2021). https://doi.org/10.1007/s00778-021-00653-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-021-00653-w

Keywords

Navigation