Open-categorical text classification based on multi-LDA models

Fu, Ruiji; Qin, Bing; Liu, Ting

doi:10.1007/s00500-014-1374-x

Open-categorical text classification based on multi-LDA models

Focus
Published: 31 July 2014

Volume 19, pages 29–38, (2015)
Cite this article

Soft Computing Aims and scope Submit manuscript

Ruiji Fu¹,
Bing Qin¹ &
Ting Liu¹

827 Accesses
16 Citations
Explore all metrics

Abstract

We present a new and realistic problem, open-categorical text classification, which requires us to classify documents without the categorization system known beforehand. To solve this problem, we propose a novel approach to construct the categorization system and classify documents based on multi-latent Dirichlet allocation (LDA) models. We cluster topics and extract topical keywords to help category annotation. Subsequently, the LDA models are applied to predict the categories of documents comprehensively. Our result, a macro-averaged F1 measure of 84.02 %, outperforms the state-of-the-art supervised and semi-supervised text classification methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using Topic Models to Label Documents for Classification

S-LDA: Documents Classification Enrichment for Information Retrieval

Text Classification Using LDA-W2V Hybrid Algorithm

Notes

http://gibbslda.sourceforge.net/.
http://www.wechat.com/en/.
http://www.ltp-cloud.com/.
The p value is the probability of obtaining a test statistic result at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. We use the significance testing method proposed by Zhang et al. (2004).
Table 4 The performance of document classification
Full size table
These categories are constructed using our proposed semi-automatic approach based on multi-LDA models. Totally, we obtain 83 categories.

References

Blei DM, Griffiths TL, Jordan MI, Tenenbaum JB (2003a) Hierarchical topic models and the nested Chinese restaurant process. In: NIPS, vol 16
Blei DM, Ng AY, Jordan MI (2003b) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Blei DM, McAuliffe JD (2007) Supervised topic models. NIPS 7:121–128
Google Scholar
Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of COLT, pp 92–100
Brown PF, Desouza PV, Mercer RL, Della Pietra VJ, Lai JC (1992) Class-based n-gram models of natural language. Comput Linguist 18(4):467–479
Google Scholar
Carletta J (1996) Assessing agreement on classification tasks: the kappa statistic. Comput Linguist 22(2):249–254
Google Scholar
Carlson A, Betteridge J, Wang RC, Hruschka Jr ER, Mitchell TM (2010) Coupled semi-supervised learning for information extraction. In: Proceedings of the third ACM international conference on Web search and data mining, pp 101–110
Che W, Li Z, Liu T (2010) Ltp: a Chinese language technology platform. In: Coling 2010: demonstrations, pp 13–16
Cheng SJ, Huang QC, Liu JF, Tang XL (2013) A novel inductive semi-supervised SVM with graph-based self-training. In: Intelligent science and intelligent data engineering. Springer, Berlin Heidelberg, pp 82–89
Collins M, Singer Y (1999) Unsupervised models for named entity classification. In: Proceedings of EMNLP, pp 100–110
Danesh A, Moshiri B, Fatemi O (2007) Improve text classification accuracy based on classifier fusion methods. 10th international conference on information fusion, pp 1–6
Donghui C, Zhijing L (2010) A new text categorization method based on HMM and SVM. In: 2nd international conference on computer engineering and technology (ICCET), vol 7, pp 383–386
Fu JH, Lee SL (2012) A multi-class SVM classification system based on learning methods from indistinguishable chinese official documents. Expert Syst Appl 39(3):3127–3134
Article Google Scholar
Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 50–57
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of ECML-98, 10th European conference on machine learning (Chemnitz, DE), pp 137–142
Johnson DE, Oles FJ, Zhang T, Goetz T (2002) A decision-tree-based symbolic rule induction system for text categorization. IBM Syst J 41(3):428–437
Article Google Scholar
Kim S-B, Rim H-C, Yook DS, Lim H-S (2002) Effective methods for improving naive bayes text classifiers. LNAI 2417:414–423
Google Scholar
Li CH, Park SC (2009) n efficient document classification model using an improved back propagation neural network and singular value decomposition. Expert Syst Appl 36(2):3208–3215
Article MathSciNet Google Scholar
Lin Y (2002) Support vector machines and the Bayes rule in classification. Data Min Knowl Discov 6:259–275
Article MathSciNet Google Scholar
Mao X-L, Ming Z-Y, Chua T-S, Li S, Yan H, Li X (2012) SSHLDA: a semi-supervised hierarchical topic model. In: Proceedings of EMNLP-CoNLL, pp 800–809
McClosky D,Charniak E, Johnson M (2006) Effective self-training for parsing. In: Proceedings of NAACL, pp 152–159
Ng HT, Goh WB, Low KL (1997) Feature selection, perception learning, and a usability case study for text categorization. In: Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, Philadelphia PA, pp 67–73
Petinot Y, McKeown K, Thadani K (2011) A hierarchical model of web summaries. Proc ACL HLT Short Pap Vol 2:670–675
Google Scholar
Pham DT, Dimov SS, Nguyen CD (2005) Selection of K in K-means clustering. Proc Inst Mech Eng Part C J Mech Eng Sci 219(1):103–109
Article Google Scholar
Qin Y-P, Wang X-K (2009) Study on multi-label text classification based on SVM. Sixth international conference on fuzzy systems and knowledge discovery, pp 300–304
Salton G, Wong A, Yan C-S (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Article MATH Google Scholar
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47
Article Google Scholar
Trappey AJC, Hsu F-C, Trappey CV, Lin C-I (2006) Development of a patent document classification and search platform using a back-propagation network. Expert Syst Appl 31(4):755–765
Article Google Scholar
Turian J, Ratinov L, Bengio Y (2010) Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the ACL, pp 384–394
Ueffing N (2006) Self-training for machine translation. In: Proceedings of NIPS workshop on machine learning for multilingual information access
Vateekul P, Kubat M (2009) Fast induction of multiple decision trees in text categorization from large scale, imbalanced, and multi-label data. IEEE International Conference on Data Mining Workshops, pp 320–325
Yang Y (1999) An evaluation of statistical approaches to text categorization. Inf Retr 1(1–2):69–90
Article Google Scholar
Zhang Y, Vogel S, Waibel A (2004) Interpreting bleu/nist scores: how much improvement do we need to have a better system. In: Proceedings of the 2004 international conference on language resources and evaluation. pp 2051–2054

Download references

Acknowledgments

This work is supported by National Natural Science Foundation of China (NSFC) via Grant 61133012, 61273321 and the National 863 Leading Technology Research Project via grant 2012AA011102. Special thanks to Jianfei Guo and Xiaocheng Feng for their help in the experiments..

Author information

Authors and Affiliations

Harbin Institute of Technology, 6th Floor, No.29, Jiaohua Street, Nangang District, Harbin, 150001, People’s Republic of China
Ruiji Fu, Bing Qin & Ting Liu

Authors

Ruiji Fu
View author publications
You can also search for this author in PubMed Google Scholar
Bing Qin
View author publications
You can also search for this author in PubMed Google Scholar
Ting Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ting Liu.

Additional information

Communicated by L. Xie.

Appendix: the categorization system of WeChat subscription accounts

^{Footnote 5}

finance and economics

1.
banking institutions
2.
business
3.
financing
4.
insurance
5.
marketing
6.
realty
7.
start-ups
- shopping
8.
automobile
9.
commodity
10.
decoration
11.
discount shopping
12.
dresses
13.
electronic products
14.
luxuries
15.
online shopping
16.
purchasing agents
17.
sports equipments
18.
wholesale
19.
health care
20.
maternal and infant
21.
nourishing of life
22.
dating
- communication platform
23.
friends making
24.
job hunting
- education
25.
art schools
26.
business administration
27.
driving schools
28.
foreign language training
29.
raining for study abroad
30.
tutoring
- military affairs
31.
military affairs
- science and technology
32.
IT
33.
mobile internet applications
- media
34.
news media
35.
print media
36.
TV and radio
37.
we-media
38.
cosmetic surgery
39.
hairdressing
40.
skin protection
- food and drink
41.
green food
42.
restaurants
43.
tea
44.
western-style pastry
45.
wine
- services for life
46.
air tickets booking
47.
Campus
48.
car rental
49.
community
50.
design
51.
emotion
52.
environmental protection
53.
Express delivery
54.
homemaking
55.
hot lines
56.
hotel booking
57.
law works
58.
life assistants
59.
lotteries
60.
public good
61.
recharging
62.
tourism
63.
weddings
- culture
64.
art
65.
culture
66.
originality
67.
popularization of science
68.
reading
- entertainment
69.
adult entertainment
70.
caricatures
71.
entertainment stars
72.
entertainment venues
73.
fashion
74.
games
75.
image show
76.
jokes
77.
movies
78.
music
79.
pets
- sports
80.
sports clubs
81.
sports news
- others
82.
brand
83.
government

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fu, R., Qin, B. & Liu, T. Open-categorical text classification based on multi-LDA models. Soft Comput 19, 29–38 (2015). https://doi.org/10.1007/s00500-014-1374-x

Download citation

Published: 31 July 2014
Issue Date: January 2015
DOI: https://doi.org/10.1007/s00500-014-1374-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Open-categorical text classification based on multi-LDA models

Abstract

Access this article

Similar content being viewed by others

Using Topic Models to Label Documents for Classification

S-LDA: Documents Classification Enrichment for Information Retrieval

Text Classification Using LDA-W2V Hybrid Algorithm

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: the categorization system of WeChat subscription accounts

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Open-categorical text classification based on multi-LDA models

Abstract

Access this article

Similar content being viewed by others

Using Topic Models to Label Documents for Classification

S-LDA: Documents Classification Enrichment for Information Retrieval

Text Classification Using LDA-W2V Hybrid Algorithm

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: the categorization system of WeChat subscription accounts

Appendix: the categorization system of WeChat subscription accounts

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation