Skip to main content
Log in

Understanding a bag of words by conceptual labeling with prior weights

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

In many natural language processing tasks, e.g., text classification or information extraction, the weighted bag-of-words model is widely used to represent the semantics of text, where the importance of each word is quantified by its weight. However, it is still difficult for machines to understand a weighted bag of words (WBoW) without explicit explanations, which seriously limits its application in downstream tasks. To make a machine better understand a WBoW, we introduce the task of conceptual labeling, which aims at generating the minimum number of concepts as labels to explicitly represent and explain the semantics of a WBoW. Specifically, we first propose three principles for label generation and then model each principle as an objective function. To satisfy the three principles simultaneously, a multi-objective optimization problem is solved. In our framework, a taxonomy (i.e., Microsoft Concept Graph) is used to provide high-quality candidate concepts, and a corresponding search algorithm is proposed to derive the optimal solution (i.e., a small set of proper concepts as labels). Furthermore, two pruning strategies are also proposed to reduce the search space and improve the performance. Our experiments and results prove that the proposed method is capable of generating proper labels for WBoWs. Besides, we also apply the generated labels to the task of text classification and observe an increase in performance, which further justifies the effectiveness of our conceptual labeling framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1

Similar content being viewed by others

Notes

  1. In this paper, we only consider the case that all the words in a bag are entity mentions, e.g., Obama, notebook, rose, etc. Because entities are core components in most text analysis tasks.

  2. https://concept.research.microsoft.com/

  3. For simplicity, the words in WBoWs are also known as instances.

  4. https://dumps.wikipedia.org/

  5. Note that the noise instances are required to have smaller weights than the non-noise.

  6. http://qwone.com/~jason/20Newsgroups/

  7. http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html

  8. https://docs.microsoft.com/en-us/azure/cognitive-services/entitylinking/home

  9. https://code.google.com/archive/p/word2vec/

References

  1. Beliga, S., Meštrović, A., Martinčić-Ipšić, S.: An overview of graph-based keyword extraction methods and approaches. J. Inf. Organ. Sci. 39(1), 1–20 (2015)

    Google Scholar 

  2. Blei, D. M.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)

    Article  Google Scholar 

  3. Blei, D. M., Mcauliffe, J. D.: Supervised topic models. Adv. Neural Inf. Process. Syst. 3, 327–332 (2010)

    Google Scholar 

  4. Blei, D. M., Ng, A. Y., Jordan, M. I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  5. Bloom, P.: Glue for the mental world. Nature 421(6920), 212–213 (2003)

    Article  Google Scholar 

  6. Boutsidis, C., Mahoney, M. W., Drineas, P.: An improved approximation algorithm for the column subset selection problem. In: Proceedings of the twentieth annual ACM-SIAM symposium on Discrete algorithms, pp. 968–977. Society for Industrial and Applied Mathematics (2009)

  7. Chaney, A. J., Blei, D. M., Eliassi-rad, T.: A probabilistic model for using social networks in personalized item recommendation. In: Proceedings of the 9th ACM Conference on Recommender Systems, pp. 43–50. ACM (2015)

  8. Chasanis, V., Kalogeratos, A., Likas, A.: Movie segmentation into scenes and chapters using locally weighted bag of visual words. In: Proceedings of the ACM International Conference on Image and Video Retrieval, pp. 35. ACM (2009)

  9. Deb, K.: Multi-objective optimization. Springer US, 403–449 (2014)

  10. Deshpande, A., Rademacher, L.: Efficient volume sampling for row/column subset selection. In: 2010 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 329–338. IEEE (2010)

  11. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (2018)

  12. Hartigan, J. A., Wong, M. A.: Algorithm as 136: A k-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979)

    MATH  Google Scholar 

  13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)

  14. Hua, W., Wang, Z., Wang, H., Zheng, K.: Short text understanding through lexical-semantic analysis. In: IEEE International Conference on Data Engineering, pp. 495–506 (2015)

  15. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv:1607.01759 (2016)

  16. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: EACL (2016)

  17. Kang, S. H., Sandberg, B., Yip, A. M.: A regularized k-means and multiphase scale segmentation. Inverse Probl. Imaging 5(2), 407–429 (2017)

    Article  MathSciNet  Google Scholar 

  18. Kim, D., Wang, H., Oh, A.: Context-dependent conceptualization. In: International Joint Conference on Artificial Intelligence, pp. 2654–2661 (2013)

  19. Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: AAAI (2015)

  20. Lau, J. H., Grieser, K., Newman, D., Baldwin, T.: Automatic labelling of topic models. In: The Meeting of the Association for Computational Linguistics Human Language Technologies, Proceedings of the Conference, Portland, Oregon, pp. 1536–1545 (2012)

  21. Lebanon, G., Mao, Y., Dillon, J.: The locally weighted bag of words framework for document representation. J. Mach. Learn. Res. 8(Oct), 2405–2441 (2007)

    MathSciNet  MATH  Google Scholar 

  22. Liu, C., Sharan, L., Adelson, E. H., Rosenholtz, R.: Exploring features in a bayesian framework for material recognition. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 239–246. IEEE (2010)

  23. Mei, Q., Zhai, C. X.: Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In: Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 198–207 (2005)

  24. Mei, Q., Liu, C., Su, H., Zhai, C. X.: A probabilistic approach to spatiotemporal theme pattern mining on weblogs. In: International Conference on World Wide Web, pp. 533–542 (2006)

  25. Mei, Q., Shen, X., Zhai, C. X.: Automatic labeling of multinomial topic models. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 490–499 (2007)

  26. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 26, 3111–3119 (2013)

    Google Scholar 

  27. Murphy, G. L.: The big book of concepts. MIT Press, Cambridge (2004)

  28. Pay, T.: Totally automated keyword extraction. 2016 IEEE International Conference on Big Data (Big Data) pp. 3859–3863 (2016)

  29. Prabhumoye, S., Botros, F., Chandu, K., Choudhary, S., Keni, E., Malaviya, C., Manzini, T., Pasumarthi, R., Poddar, S., Ravichander, A., et al.: Building cmu magnus from user feedback. Alexa Prize Proceedings (2017)

  30. Ramage, D., Hall, D., Nallapati, R., Manning, C. D.: Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In: Conference on Empirical Methods in Natural Language Processing: Volume, pp. 248–256 (2009)

  31. Rissanen, J.: Minimum description length principle. Encyclopedia of Statistical Sciences (1985)

  32. Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S. K., Albertson, B., Rand, D. G.: Structural topic models for open-ended survey responses. Am. J. Polit. Sci. 58(4), 1064–1082 (2014)

    Article  Google Scholar 

  33. Rose, S., Engel, D., Cramer, N., Cowley, W.: Automatic keyword extraction from individual Documents. Wiley, New York (2010)

    Book  Google Scholar 

  34. Song, Y., Wang, H., Wang, Z., Li, H., Chen, W.: Short text conceptualization using a probabilistic knowledgebase. The Journal of machine Learning research, pp. 2330–2336 (2011)

  35. Song, Y., Wang, H., Wang, H.: Open domain short text conceptualization: a generative + descriptive modeling approach. In: International Conference on Artificial Intelligence, pp. 3820–3826 (2015)

  36. Su, Y., Liu, H., Yavuz, S., Gur, I., Sun, H., Yan, X.: Global relation embedding for relation extraction. arXiv:1704.05958 (2017)

  37. Sun, X., Xiao, Y., Wang, H.: On conceptual labeling of a bag of words. IJCAI 22, 1326–1332 (2015)

    Google Scholar 

  38. Tomita, E.: Efficient algorithms for finding maximum and maximal cliques and their applications. In: International Workshop on Algorithms and Computation, pp. 3–15 (2017)

  39. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017)

  40. Wang, X., Mccallum, A.: Topics over time: a non-markov continuous-time model of topical trends. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 424–433 (2006)

  41. Wang, Z., Wang, H., Wen, J. R., Xiao, Y.: An inference approach to basic level of categorization. In: The ACM International, pp. 653–662 (2015)

  42. Wang, Z., Zhao, K., Wang, H., Meng, X., Wen, J. R.: Query understanding through knowledge-based conceptualization. In: International Conference on Artificial Intelligence, pp. 3264–3270 (2015)

  43. Wu, W., Li, H., Wang, H., Zhu, K. Q.: Probase: A probabilistic taxonomy for text understanding. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 481–492. ACM (2012)

  44. Zhang, X., Zhao, J.J., LeCun, Y.: Character-level convolutional networks for text classification. In: NIPS (2015)

  45. Zhang, Y, Jin, R, Zhou, Z-H: Understanding bag-of-words model: a statistical framework. Int J Mach Learn Cybern 1(1-4), 43–52 (2010)

    Article  Google Scholar 

Download references

Funding

This paper was supported by National Key R&D Program of China No. 2017YFC1201203, National NSF of China No.U1636207 and Shanghai Science and technology innovation action plan (No. 19511120400).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanghua Xiao.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiang, H., Yang, D., Xiao, Y. et al. Understanding a bag of words by conceptual labeling with prior weights. World Wide Web 23, 2429–2447 (2020). https://doi.org/10.1007/s11280-020-00806-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-020-00806-x

Keywords

Navigation