Explaining a bag of words with hierarchical conceptual labels

Jiang, Haiyun; Xiao, Yanghua; Wang, Wei

doi:10.1007/s11280-019-00752-3

Explaining a bag of words with hierarchical conceptual labels

Published: 12 February 2020

Volume 23, pages 1693–1713, (2020)
Cite this article

World Wide Web Aims and scope Submit manuscript

670 Accesses
9 Citations
Explore all metrics

Abstract

In natural language processing and information retrieval tasks, the bag-of-words model is widely used to represent the semantics of texts. However, it is difficult for machines to sufficiently understand a bag of words as well as the corresponding text without explicit semantic explanation, thus hindering the power of the bag-of-words model in many scenarios. In this paper, we introduce the task of hierarchical conceptual labeling (HCL), which aims to generate a set of conceptual labels with a hierarchy to explicitly explain the semantics of a bag of words, where the candidate labels are selected from a large-scale knowledge base, i.e., Microsoft Concept Graph. To this end, we first propose a denoising algorithm to filter out the noise in a bag of words in advance. Then the hierarchical conceptual labels are generated for the clean bag of words based on a hierarchical clustering algorithm, i.e., Bayesian rose trees. We conduct extensive experiments and prove that (1) the proposed denoising algorithm can effectively delete the noise words from a bag of words, (2) the Bayesian rose trees based algorithm can generate hierarchical conceptual labels for a bag of words with a high accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Figure 1

Figure 3

Figure 7

Hierarchical Conceptual Labeling

Understanding a bag of words by conceptual labeling with prior weights

Article 14 April 2020

Moving from Formal Towards Coherent Concept Analysis: Why, When and How

Notes

In this paper, we consider the BoWs where all the words are entities in MCG.
Similarly, we can also fix m,n (l,n) and analyze the effect of l (m) on the performance.
https://snap.stanford.edu/data/web-flickr.html
https://dumps.wikimedia.org/
To save space, we only select BoWs with small sizes.

References

Aravamudan, M, Daren, G, Venkataraman, S, Agarwal, V, Ramamoorthy, G: Method for using pauses detected in speech input to assist in interpreting the input during conversational interaction for information retrieval, Oct. 24 2017. US Patent 9,799,328 (2017)
Arnold, CW: Clinical case-based retrieval using latent topic analysis. AMIA Annual Symposium Proceedings (2010)
Beliga, S, Meštrović, A, Martinčić-Ipšić S: An overview of graph-based keyword extraction methods and approaches. J Inf Org Sci 39(1), 1–20 (2015)
Google Scholar
Bharti, SK, Babu, KS, Pradhan, A, Devi, S, Priya, TE, Orhorhoro, E, Orhorhoro, O, Atumah, V, Baruah, E, Konwar, P, et al: Automatic keyword extraction for text summarization in multi-document e-newspapers articles. Eur J Adv Eng Technol 4(6), 410–427 (2017)
Google Scholar
Blei, DM: Probabilistic topic models. Commun ACM 55(4), 77–84 (2012)
Article Google Scholar
Blei, DM, McAuliffe, JD: papap. Supervised topic models. Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems 12, 121–128 (2007)
Google Scholar
Blei, DM, Mcauliffe, JD: Supervised topic models. Adv Neur Inf Process Syst 3, 327–332 (2010)
Google Scholar
Blei, DM, Ng, AY, Jordan, MI: Latent dirichlet allocation. J Mach Learn Res 3, 993–1022 (2003)
MATH Google Scholar
Blundell, C, Teh, YW, Heller, KA: Bayesian rose trees. In: UAI (2010)
Campos, R, Dias, G, Nunes, C: Wise: hierarchical soft clustering of Web page search results based on Web content mining techniques. In: IEEE/WIC/ACM International conference on Web intelligence, pp 301–304 (2006)
Ding, B, Wang, H, Jin, R, Han, J, Wang, Z: Optimizing index for taxonomy keyword search. In: Proceedings of the 2012 ACM SIGMOD international conference on management of data, pp 493–504. ACM (2012)
Dumais, S, Cutrell, E, Cadiz, JJ, Jancke, G, Sarin, R, Robbins, DC: Stuff i’ve seen: a system for personal information retrieval and re-use. In: ACM SIGIR forum, vol. 49, pp 28–35. ACM (2016)
Erkan, G, Radev, DR: Lexrank: graph-based centrality as salience in text summarization. J Artif Intell Res 22, 457–479 (2004)
Article Google Scholar
Fang, H, Gupta, S, Iandola, F, Srivastava, R, Deng, L, Dollár, P, Gao, J, He, X, Mitchell, M, Platt, J, et al: From captions to visual concepts and back (2015)
Gabrilovich, E, Markovitch, S: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proc. International joint conference on artificial intelligence, pp 1606–1611 (2007)
Galindo, C, Saffiotti, A, Coradeschi, S, Buschka, P, Fernandez-Madrigal, J: Multi-hierarchical semantic maps for mobile robotics. In: Proc.of the IEEE/RSJ intl. conf. on intelligent robots systems, pp 2278–2283 (2015)
Gambhir, M, Gupta, V: Recent automatic text summarization techniques: a survey. Artif Intell Rev 47(1), 1–66 (2017)
Article Google Scholar
Giamblanco, N, Siddavaatam, P: Keyword and keyphrase extraction using newton’s law of universal gravitation. In: 2017 IEEE 30th Canadian conference on electrical and computer engineering (CCECE), pp 1–4. IEEE (2017)
Hansen, JA, Ringger, EK, Seppi, KD: Probabilistic explicit topic modeling using wikipedia. Language Process Knowl Web, 69–82 (2013)
Heller, KA, Ghahramani, Z: Bayesian hierarchical clustering. In: ICML, p 21 (2005)
Hua, W, Song, Y, Wang, H, Zhou, X: Identifying users’ topical tasks in Web search. In: Proceedings of the sixth ACM international conference on Web search and data mining, pp 93–102. ACM (2013)
Hua, W, Wang, Z, Wang, H, Zheng, K: Short text understanding through lexical-semantic analysis. In: IEEE International conference on data engineering, pp 495–506 (2015)
Kim, D, Wang, H, Oh, A: Context-dependent conceptualization. In: International joint conference on artificial intelligence, pp 2654–2661 (2013)
Lau, JH, Grieser, K, Newman, D, Baldwin, T: Automatic labelling of topic models. In: The meeting of the association for computational linguistics: human language technologies, proceedings of the conference, 19-24 June 2011, pp 1536–1545, Portland (2012)
Lewandowski, D: Evaluating the retrieval effectiveness of Web search engines using a representative query sample. J Assoc Inf Sci Technol 66(9), 1763–1775 (2015)
Article Google Scholar
Li, P, Wang, H, Zhu, KQ, Wang, Z, Wu, X: Computing term similarity by large probabilistic isa knowledge. In: Proceedings of the 22nd ACM international conference on conference on information, knowledge management, pp 1401–1410. ACM (2013)
Liu, H, Liu, YS, Pauwels, P, Guo, H, Gu, M: Enhanced explicit semantic analysis for product model retrieval in construction industry. IEEE Trans Indust Inf PP(99), 1–1 (2017)
Google Scholar
Liu, J, Zhou, X, Huang, J, Liu, S, Li, H, Wen, S, Liu, J: Semantic classification for hyperspectral image by integrating distance measurement and relevance vector machine. Multimed Syst 23(1), 95–104 (2017)
Article Google Scholar
Marin, JM, Pillai, NS, Robert, CP: Relevant statistics for Bayesian model choice. J R Stat Soc: Series B (Stat Methodol) 76(5), 833–859 (2014)
Article MathSciNet Google Scholar
Mei, Q, Zhai, CX: Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In: Eleventh ACM SIGKDD international conference on knowledge discovery in data mining, pp 198–207 (2005)
Mei, Q, Shen, X, Zhai, CX: Automatic labeling of multinomial topic models (2007)
Mihalcea, R, Tarau, P: Textrank: bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing (2004)
Mukherjee, S, Bhayani, JV, Chand, J, Raj, RN: Keyword recommendation for internet search engines, Mar. 18 2014. US Patent 8,676,830 (2014)
Murphy, GL: The big book of concepts. MIT Press (2004)
Ntalianis, K, Otterbacher, J, Mastorakis, N: Content relatedness in the social Web based on social explicit semantic analysis. In: Applied mathematics, computer science: international conference on applied mathematics, computer science, pp 130–150 (2017)
Pay, T: Totally automated keyword extraction. In: 2016 IEEE International conference on big data (big data), pp 3859–3863 (2016)
Ramage, D, Hall, D, Nallapati, R, Manning, CD: Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora (2009)
Roberts, ME, Stewart, BM, Tingley, D, Lucas, C, Leder-Luis, J, Gadarian, SK, Albertson, B, Rand, DG: Structural topic models for open-ended survey responses. Am J Polit Sci 58(4), 1064–1082 (2014)
Article Google Scholar
Rose, S, Engel, D, Cramer, N, Cowley, W: Automatic keyword extraction from individual documents. Wiley (2010)
Shen, Y, Huang, P-S, Gao, J, Chen, W: Reasonet: learning to stop reading in machine comprehension. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1047–1055. ACM (2017)
Song, Y, Wang, H, Wang, Z, Li, H, Chen, W: Short text conceptualization using a probabilistic knowledge base. IJCAI, 2330–2336 (2011)
Song, Y, Liu, S, Liu, X, Wang, H: Automatic taxonomy construction from keywords via scalable Bayesian rose trees. IEEE Trans Knowl Data Eng 27(7), 1861–1874 (2015)
Article Google Scholar
Song, Y, Wang, H, Wang, H: Open domain short text conceptualization: a generative + descriptive modeling approach. In: International conference on artificial intelligence, pp 3820–3826 (2015)
Sun, X, Xiao, Y, Wangy, H, Wang, W: On conceptual labeling of a bag of words. IJCAI, 1326–1332 (2015)
Tomita, E: Efficient algorithms for finding maximum and maximal cliques and their applications. In: International workshop on algorithms and computation, pp 3–15 (2017)
Chapter Google Scholar
Wang, X, Mccallum, A: Topics over time: a non-Markov continuous-time model of topical trends. In: ACM SIGKDD International conference on knowledge discovery and data mining, pp 424–433 (2006)
Wang, Z, Wang, H, Hu, Z: Head, modifier, and constraint detection in short texts. In: IEEE International conference on data engineering, pp 280–291 (2014)
Wang, Z, Zhao, K, Wang, H, Meng, X, Wen, JR: Query understanding through knowledge-based conceptualization. In: International conference on artificial intelligence, pp 3264–3270 (2015)
Wang, H, Wang, H, Wen, JR, Xiao, Y: An inference approach to basic level of categorization. In: ACM International on conference on information and knowledge management, pp 653–662 (2015)
Wood, J, Tan, P, Das, A: Source-lda: enhancing probabilistic topic models using prior knowledge sources. Conference on Neural Information Processing Systems, p 2009 (2016)
Wu, W, Li, H, Wang, H, Zhu, KQ: Probase: a probabilistic taxonomy for text understanding. In: SIGMOD, pp 481–492 (2012)
Yang, F, Zhu, YS, Ma, YJ: Ws-rank:bringing sentences into graph for keyword extraction. In: Asia-Pacific Web conference, pp 474–477 (2016)
Chapter Google Scholar
Zhang, D, Dong, Y: Semantic, hierarchical, online clustering of Web search results. Adv Web Technol Appl 32(14), 69–78 (2004)
Article Google Scholar

Download references

Funding

This paper was supported by Shanghai science and technology innovation action plan (No. 19511120400) and National NSFC (No. 61732004).

Author information

Authors and Affiliations

Fudan University, Shanghai, China
Haiyun Jiang, Yanghua Xiao & Wei Wang
Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, Shanghai, China
Yanghua Xiao

Authors

Haiyun Jiang
View author publications
You can also search for this author inPubMed Google Scholar
Yanghua Xiao
View author publications
You can also search for this author inPubMed Google Scholar
Wei Wang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Yanghua Xiao.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jiang, H., Xiao, Y. & Wang, W. Explaining a bag of words with hierarchical conceptual labels. World Wide Web 23, 1693–1713 (2020). https://doi.org/10.1007/s11280-019-00752-3

Download citation

Received: 13 March 2019
Revised: 07 August 2019
Accepted: 14 October 2019
Published: 12 February 2020
Issue Date: May 2020
DOI: https://doi.org/10.1007/s11280-019-00752-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Explaining a bag of words with hierarchical conceptual labels

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Hierarchical Conceptual Labeling

Understanding a bag of words by conceptual labeling with prior weights

Moving from Formal Towards Coherent Concept Analysis: Why, When and How

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now