Effective Seed-Guided Topic Labeling for Dataless Hierarchical Short Text Classification

Yang, Yi; Wang, Hongan; Zhu, Jiaqi; Shi, Wandong; Guo, Wenli; Zhang, Jiawen

doi:10.1007/978-3-030-74296-6_21

Yi Yang ORCID: orcid.org/0000-0002-8133-6678^12,14,
Hongan Wang^12,14,
Jiaqi Zhu ORCID: orcid.org/0000-0002-4261-6749^12,13,14,
Wandong Shi^12,14,
Wenli Guo¹² &
…
Jiawen Zhang^12,14

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12706))

Included in the following conference series:

International Conference on Web Engineering

1956 Accesses
1 Citations

Abstract

Hierarchical text classification has a wide application prospect on the Internet, which aims to classify texts into a given hierarchy. Supervised methods require a large amount of labeled data and are thus costly. For this purpose, the task of dataless hierarchical text classification has attracted more and more attention of researchers in recent years, which only requires a few relevant seed words for given categories. However, existing approaches mainly focus on long texts without considering the characteristics of short texts, so are not suitable in many scenarios. In this paper, we tackle dataless hierarchical short text classification for the first time, and propose an innovative model named Hierarchical Seeded Biterm Topic Model (HierSeedBTM), which effectively leverages seed words in Biterm Topic Model (BTM) to guide the hierarchical topic labeling. Specifically, our model introduces iterative distribution propagation mechanism among topic models in different levels to incorporate the hierarchical structure information. Experiments on two public datasets show that the proposed model is more effective than the state-of-the-art methods of dataless hierarchical text classification designed for long texts.

This work is supported by National Key Research and Development Program of China (2018YFC0116703), Strategic Priority Research Program of Chinese Academy of Sciences (XDC02060500), and Zhejiang Lab (2020NF0AC02).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Bennett, P., Nguyen, N.: Refined experts: improving classification in large taxonomies. In: SIGIR, pp. 11–18. ACM (2009)
Google Scholar
Chen, H., Dumais, S.T.: Bringing order to the web: automatically categorizing search results. In: CHI, pp. 145–152. ACM (2000)
Google Scholar
Chen, W., Wang, J., Zhang, Y., Yan, H., Li, X.: User based aggregation for biterm topic model. In: ACL, vol. 2 (Short Papers), pp. 489–494 (2015)
Google Scholar
Chen, X., Xia, Y., Jin, P., Carroll, J.: Dataless text classification with descriptive LDA. In: AAAI, pp. 2224–2231 (2015)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186. Association for Computational Linguistics (2019)
Google Scholar
Druck, G., Mann, G., McCallum, A.: Learning from labeled features using generalized expectation criteria. In: SIGIR, pp. 595–602. ACM (2008)
Google Scholar
Dumais, S.T., Chen, H.: Hierarchical classification of web content. In: SIGIR, pp. 256–263. ACM (2000)
Google Scholar
Jiang, L., Lu, H., Xu, M., Wang, C.: Biterm pseudo document topic model for short text. In: ICTAI, pp. 865–872. IEEE (2016)
Google Scholar
Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: ICML, pp. 170–178. Morgan Kaufmann (1997)
Google Scholar
Lang, K.: Newsweeder: learning to filter netnews. In: ICML, pp. 331–339. Morgan Kaufmann (1995)
Google Scholar
Li, C., Chen, S., Qi, Y.: Filtering and classifying relevant short text with a few seed words. Data Inf. Manag. 3(3), 165–186 (2019)
Google Scholar
Li, C., Chen, S., Xing, J., Sun, A., Ma, Z.: Seed-guided topic model for document filtering and classification. ACM Trans. Inf. Syst. 37(1), 9:1–9:37 (2019)
Article Google Scholar
Li, C., Xing, J., Sun, A., Ma, Z.: Effective document labeling with very few seed words: a topic model approach. In: CIKM, pp. 85–94. ACM (2016)
Google Scholar
Li, X., Zhang, A., Li, C., Guo, L., Wang, W., Ouyang, J.: Relational biterm topic model: short-text topic modeling using word embeddings. Comput. J. 62(3), 359–372 (2018)
Article Google Scholar
Liu, B., Li, X., Lee, W.S., Yu, P.S.: Text classification by labeling words. In: AAAI, vol. 4, pp. 425–430 (2004)
Google Scholar
Mekala, D., Shang, J.: Contextualized weak supervision for text classification. In: ACL, pp. 323–333. Association for Computational Linguistics (2020)
Google Scholar
Meng, Y., Shen, J., Zhang, C., Han, J.: Weakly-supervised neural text classification. In: CIKM, pp. 983–992. ACM (2018)
Google Scholar
Meng, Y., Shen, J., Zhang, C., Han, J.: Weakly-supervised hierarchical text classification, vol. 33, no. 01, pp. 6826–6833 (2019)
Google Scholar
Misra, R.: News category dataset (2018). https://doi.org/10.13140/RG.2.2.20331.18729
Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? sentiment classification using machine learning techniques. In: EMNLP, pp. 79–86 (2002)
Google Scholar
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543. Association for Computational Linguistics (2014)
Google Scholar
Quan, X., Kit, C., Ge, Y., Pan, S.J.: Short and sparse text topic modeling via self-aggregation. In: IJCAI, pp. 2270–2276 (2015)
Google Scholar
Song, Y., Roth, D.: On dataless hierarchical text classification. In: AAAI, pp. 1579–1585. AAAI Press (2014)
Google Scholar
Tang, D., Qin, B., Liu, T.: EMNLP, pp. 1422–1432. The Association for Computational Linguistics (2015)
Google Scholar
Weng, J., Lim, E.P., Jiang, J., He, Q.: Twitterrank: finding topic-sensitive influential Twitterers. In: WSDM, pp. 261–270. ACM (2010)
Google Scholar
Xiao, H., Liu, X., Song, Y.: Efficient path prediction for semi-supervised and weakly supervised hierarchical text classification. In: WWW, pp. 3370–3376. ACM (2019)
Google Scholar
Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: WWW, pp. 1445–1456. ACM (2013)
Google Scholar
Yang, Y., et al.: Dataless short text classification based on biterm topic model and word embeddings. In: Bessiere, C. (ed.) International Joint Conferences on Artificial Intelligence Organization, IJCAI, pp. 3969–3975 (2020)
Google Scholar
Yin, J., Wang, J.: A Dirichlet multinomial mixture model-based approach for short text clustering. In: SIGKDD, pp. 233–242. ACM (2014)
Google Scholar
Zhao, W.X., et al.: Comparing twitter and traditional media using topic models. In: Clough, P., et al. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 338–349. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20161-5_34
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

SKLCS, Institute of Software, Chinese Academy of Sciences, Beijing, China
Yi Yang, Hongan Wang, Jiaqi Zhu, Wandong Shi, Wenli Guo & Jiawen Zhang
Zhejiang Lab, Hangzhou, Zhejiang, China
Jiaqi Zhu
University of Chinese Academy of Sciences, Beijing, China
Yi Yang, Hongan Wang, Jiaqi Zhu, Wandong Shi & Jiawen Zhang

Authors

Yi Yang
View author publications
You can also search for this author in PubMed Google Scholar
Hongan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jiaqi Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Wandong Shi
View author publications
You can also search for this author in PubMed Google Scholar
Wenli Guo
View author publications
You can also search for this author in PubMed Google Scholar
Jiawen Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiaqi Zhu .

Editor information

Editors and Affiliations

Dipartimento di Elettronica, Politecnico di Milano, Milan, Italy
Marco Brambilla
E2S UPPA, LIUPPA, Université de Pau et des Pays de l’Adour, Anglet, France
Richard Chbeir
Econometric Institute, Erasmus University Rotterdam, Rotterdam, The Netherlands
Flavius Frasincar
Inria Saclay-Île-de-France, Institut Polytechnique de Paris, Palaiseau, France
Ioana Manolescu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, Y., Wang, H., Zhu, J., Shi, W., Guo, W., Zhang, J. (2021). Effective Seed-Guided Topic Labeling for Dataless Hierarchical Short Text Classification. In: Brambilla, M., Chbeir, R., Frasincar, F., Manolescu, I. (eds) Web Engineering. ICWE 2021. Lecture Notes in Computer Science(), vol 12706. Springer, Cham. https://doi.org/10.1007/978-3-030-74296-6_21

Download citation

DOI: https://doi.org/10.1007/978-3-030-74296-6_21
Published: 11 May 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-74295-9
Online ISBN: 978-3-030-74296-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics