Understanding a bag of words by conceptual labeling with prior weights

Jiang, Haiyun; Yang, Deqing; Xiao, Yanghua; Wang, Wei

doi:10.1007/s11280-020-00806-x

Understanding a bag of words by conceptual labeling with prior weights

Published: 14 April 2020

Volume 23, pages 2429–2447, (2020)
Cite this article

World Wide Web Aims and scope Submit manuscript

Haiyun Jiang¹,
Deqing Yang²,
Yanghua Xiao¹ &
…
Wei Wang¹

550 Accesses
3 Citations
Explore all metrics

Abstract

In many natural language processing tasks, e.g., text classification or information extraction, the weighted bag-of-words model is widely used to represent the semantics of text, where the importance of each word is quantified by its weight. However, it is still difficult for machines to understand a weighted bag of words (WBoW) without explicit explanations, which seriously limits its application in downstream tasks. To make a machine better understand a WBoW, we introduce the task of conceptual labeling, which aims at generating the minimum number of concepts as labels to explicitly represent and explain the semantics of a WBoW. Specifically, we first propose three principles for label generation and then model each principle as an objective function. To satisfy the three principles simultaneously, a multi-objective optimization problem is solved. In our framework, a taxonomy (i.e., Microsoft Concept Graph) is used to provide high-quality candidate concepts, and a corresponding search algorithm is proposed to derive the optimal solution (i.e., a small set of proper concepts as labels). Furthermore, two pruning strategies are also proposed to reduce the search space and improve the performance. Our experiments and results prove that the proposed method is capable of generating proper labels for WBoWs. Besides, we also apply the generated labels to the task of text classification and observe an increase in performance, which further justifies the effectiveness of our conceptual labeling framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Explaining a bag of words with hierarchical conceptual labels

Article 12 February 2020

Haiyun Jiang, Yanghua Xiao & Wei Wang

Hierarchical Conceptual Labeling

Mining Semantic Relationships between Concepts across Documents Incorporating Wikipedia Knowledge

Notes

In this paper, we only consider the case that all the words in a bag are entity mentions, e.g., Obama, notebook, rose, etc. Because entities are core components in most text analysis tasks.
https://concept.research.microsoft.com/
For simplicity, the words in WBoWs are also known as instances.
https://dumps.wikipedia.org/
Note that the noise instances are required to have smaller weights than the non-noise.
http://qwone.com/~jason/20Newsgroups/
http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html
https://docs.microsoft.com/en-us/azure/cognitive-services/entitylinking/home
https://code.google.com/archive/p/word2vec/

References

Beliga, S., Meštrović, A., Martinčić-Ipšić, S.: An overview of graph-based keyword extraction methods and approaches. J. Inf. Organ. Sci. 39(1), 1–20 (2015)
Google Scholar
Blei, D. M.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)
Article Google Scholar
Blei, D. M., Mcauliffe, J. D.: Supervised topic models. Adv. Neural Inf. Process. Syst. 3, 327–332 (2010)
Google Scholar
Blei, D. M., Ng, A. Y., Jordan, M. I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Bloom, P.: Glue for the mental world. Nature 421(6920), 212–213 (2003)
Article Google Scholar
Boutsidis, C., Mahoney, M. W., Drineas, P.: An improved approximation algorithm for the column subset selection problem. In: Proceedings of the twentieth annual ACM-SIAM symposium on Discrete algorithms, pp. 968–977. Society for Industrial and Applied Mathematics (2009)
Chaney, A. J., Blei, D. M., Eliassi-rad, T.: A probabilistic model for using social networks in personalized item recommendation. In: Proceedings of the 9th ACM Conference on Recommender Systems, pp. 43–50. ACM (2015)
Chasanis, V., Kalogeratos, A., Likas, A.: Movie segmentation into scenes and chapters using locally weighted bag of visual words. In: Proceedings of the ACM International Conference on Image and Video Retrieval, pp. 35. ACM (2009)
Deb, K.: Multi-objective optimization. Springer US, 403–449 (2014)
Deshpande, A., Rademacher, L.: Efficient volume sampling for row/column subset selection. In: 2010 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 329–338. IEEE (2010)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (2018)
Hartigan, J. A., Wong, M. A.: Algorithm as 136: A k-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979)
MATH Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)
Hua, W., Wang, Z., Wang, H., Zheng, K.: Short text understanding through lexical-semantic analysis. In: IEEE International Conference on Data Engineering, pp. 495–506 (2015)
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv:1607.01759 (2016)
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: EACL (2016)
Kang, S. H., Sandberg, B., Yip, A. M.: A regularized k-means and multiphase scale segmentation. Inverse Probl. Imaging 5(2), 407–429 (2017)
Article MathSciNet Google Scholar
Kim, D., Wang, H., Oh, A.: Context-dependent conceptualization. In: International Joint Conference on Artificial Intelligence, pp. 2654–2661 (2013)
Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: AAAI (2015)
Lau, J. H., Grieser, K., Newman, D., Baldwin, T.: Automatic labelling of topic models. In: The Meeting of the Association for Computational Linguistics Human Language Technologies, Proceedings of the Conference, Portland, Oregon, pp. 1536–1545 (2012)
Lebanon, G., Mao, Y., Dillon, J.: The locally weighted bag of words framework for document representation. J. Mach. Learn. Res. 8(Oct), 2405–2441 (2007)
MathSciNet MATH Google Scholar
Liu, C., Sharan, L., Adelson, E. H., Rosenholtz, R.: Exploring features in a bayesian framework for material recognition. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 239–246. IEEE (2010)
Mei, Q., Zhai, C. X.: Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In: Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 198–207 (2005)
Mei, Q., Liu, C., Su, H., Zhai, C. X.: A probabilistic approach to spatiotemporal theme pattern mining on weblogs. In: International Conference on World Wide Web, pp. 533–542 (2006)
Mei, Q., Shen, X., Zhai, C. X.: Automatic labeling of multinomial topic models. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 490–499 (2007)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 26, 3111–3119 (2013)
Google Scholar
Murphy, G. L.: The big book of concepts. MIT Press, Cambridge (2004)
Pay, T.: Totally automated keyword extraction. 2016 IEEE International Conference on Big Data (Big Data) pp. 3859–3863 (2016)
Prabhumoye, S., Botros, F., Chandu, K., Choudhary, S., Keni, E., Malaviya, C., Manzini, T., Pasumarthi, R., Poddar, S., Ravichander, A., et al.: Building cmu magnus from user feedback. Alexa Prize Proceedings (2017)
Ramage, D., Hall, D., Nallapati, R., Manning, C. D.: Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In: Conference on Empirical Methods in Natural Language Processing: Volume, pp. 248–256 (2009)
Rissanen, J.: Minimum description length principle. Encyclopedia of Statistical Sciences (1985)
Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S. K., Albertson, B., Rand, D. G.: Structural topic models for open-ended survey responses. Am. J. Polit. Sci. 58(4), 1064–1082 (2014)
Article Google Scholar
Rose, S., Engel, D., Cramer, N., Cowley, W.: Automatic keyword extraction from individual Documents. Wiley, New York (2010)
Book Google Scholar
Song, Y., Wang, H., Wang, Z., Li, H., Chen, W.: Short text conceptualization using a probabilistic knowledgebase. The Journal of machine Learning research, pp. 2330–2336 (2011)
Song, Y., Wang, H., Wang, H.: Open domain short text conceptualization: a generative + descriptive modeling approach. In: International Conference on Artificial Intelligence, pp. 3820–3826 (2015)
Su, Y., Liu, H., Yavuz, S., Gur, I., Sun, H., Yan, X.: Global relation embedding for relation extraction. arXiv:1704.05958 (2017)
Sun, X., Xiao, Y., Wang, H.: On conceptual labeling of a bag of words. IJCAI 22, 1326–1332 (2015)
Google Scholar
Tomita, E.: Efficient algorithms for finding maximum and maximal cliques and their applications. In: International Workshop on Algorithms and Computation, pp. 3–15 (2017)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017)
Wang, X., Mccallum, A.: Topics over time: a non-markov continuous-time model of topical trends. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 424–433 (2006)
Wang, Z., Wang, H., Wen, J. R., Xiao, Y.: An inference approach to basic level of categorization. In: The ACM International, pp. 653–662 (2015)
Wang, Z., Zhao, K., Wang, H., Meng, X., Wen, J. R.: Query understanding through knowledge-based conceptualization. In: International Conference on Artificial Intelligence, pp. 3264–3270 (2015)
Wu, W., Li, H., Wang, H., Zhu, K. Q.: Probase: A probabilistic taxonomy for text understanding. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 481–492. ACM (2012)
Zhang, X., Zhao, J.J., LeCun, Y.: Character-level convolutional networks for text classification. In: NIPS (2015)
Zhang, Y, Jin, R, Zhou, Z-H: Understanding bag-of-words model: a statistical framework. Int J Mach Learn Cybern 1(1-4), 43–52 (2010)
Article Google Scholar

Download references

Funding

This paper was supported by National Key R&D Program of China No. 2017YFC1201203, National NSF of China No.U1636207 and Shanghai Science and technology innovation action plan (No. 19511120400).

Author information

Authors and Affiliations

Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, Shanghai, China
Haiyun Jiang, Yanghua Xiao & Wei Wang
School of Data Science, Fudan University, Shanghai, China
Deqing Yang

Authors

Haiyun Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Deqing Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yanghua Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Wei Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanghua Xiao.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jiang, H., Yang, D., Xiao, Y. et al. Understanding a bag of words by conceptual labeling with prior weights. World Wide Web 23, 2429–2447 (2020). https://doi.org/10.1007/s11280-020-00806-x

Download citation

Received: 04 April 2019
Revised: 04 October 2019
Accepted: 21 February 2020
Published: 14 April 2020
Issue Date: July 2020
DOI: https://doi.org/10.1007/s11280-020-00806-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Understanding a bag of words by conceptual labeling with prior weights

Abstract

Access this article

Similar content being viewed by others

Explaining a bag of words with hierarchical conceptual labels

Hierarchical Conceptual Labeling

Mining Semantic Relationships between Concepts across Documents Incorporating Wikipedia Knowledge

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Understanding a bag of words by conceptual labeling with prior weights

Abstract

Access this article

Similar content being viewed by others

Explaining a bag of words with hierarchical conceptual labels

Hierarchical Conceptual Labeling

Mining Semantic Relationships between Concepts across Documents Incorporating Wikipedia Knowledge

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation