Abstract
We present a new and realistic problem, open-categorical text classification, which requires us to classify documents without the categorization system known beforehand. To solve this problem, we propose a novel approach to construct the categorization system and classify documents based on multi-latent Dirichlet allocation (LDA) models. We cluster topics and extract topical keywords to help category annotation. Subsequently, the LDA models are applied to predict the categories of documents comprehensively. Our result, a macro-averaged F1 measure of 84.02 %, outperforms the state-of-the-art supervised and semi-supervised text classification methods.
Similar content being viewed by others
Notes
The p value is the probability of obtaining a test statistic result at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. We use the significance testing method proposed by Zhang et al. (2004).
These categories are constructed using our proposed semi-automatic approach based on multi-LDA models. Totally, we obtain 83 categories.
References
Blei DM, Griffiths TL, Jordan MI, Tenenbaum JB (2003a) Hierarchical topic models and the nested Chinese restaurant process. In: NIPS, vol 16
Blei DM, Ng AY, Jordan MI (2003b) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Blei DM, McAuliffe JD (2007) Supervised topic models. NIPS 7:121–128
Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of COLT, pp 92–100
Brown PF, Desouza PV, Mercer RL, Della Pietra VJ, Lai JC (1992) Class-based n-gram models of natural language. Comput Linguist 18(4):467–479
Carletta J (1996) Assessing agreement on classification tasks: the kappa statistic. Comput Linguist 22(2):249–254
Carlson A, Betteridge J, Wang RC, Hruschka Jr ER, Mitchell TM (2010) Coupled semi-supervised learning for information extraction. In: Proceedings of the third ACM international conference on Web search and data mining, pp 101–110
Che W, Li Z, Liu T (2010) Ltp: a Chinese language technology platform. In: Coling 2010: demonstrations, pp 13–16
Cheng SJ, Huang QC, Liu JF, Tang XL (2013) A novel inductive semi-supervised SVM with graph-based self-training. In: Intelligent science and intelligent data engineering. Springer, Berlin Heidelberg, pp 82–89
Collins M, Singer Y (1999) Unsupervised models for named entity classification. In: Proceedings of EMNLP, pp 100–110
Danesh A, Moshiri B, Fatemi O (2007) Improve text classification accuracy based on classifier fusion methods. 10th international conference on information fusion, pp 1–6
Donghui C, Zhijing L (2010) A new text categorization method based on HMM and SVM. In: 2nd international conference on computer engineering and technology (ICCET), vol 7, pp 383–386
Fu JH, Lee SL (2012) A multi-class SVM classification system based on learning methods from indistinguishable chinese official documents. Expert Syst Appl 39(3):3127–3134
Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 50–57
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of ECML-98, 10th European conference on machine learning (Chemnitz, DE), pp 137–142
Johnson DE, Oles FJ, Zhang T, Goetz T (2002) A decision-tree-based symbolic rule induction system for text categorization. IBM Syst J 41(3):428–437
Kim S-B, Rim H-C, Yook DS, Lim H-S (2002) Effective methods for improving naive bayes text classifiers. LNAI 2417:414–423
Li CH, Park SC (2009) n efficient document classification model using an improved back propagation neural network and singular value decomposition. Expert Syst Appl 36(2):3208–3215
Lin Y (2002) Support vector machines and the Bayes rule in classification. Data Min Knowl Discov 6:259–275
Mao X-L, Ming Z-Y, Chua T-S, Li S, Yan H, Li X (2012) SSHLDA: a semi-supervised hierarchical topic model. In: Proceedings of EMNLP-CoNLL, pp 800–809
McClosky D,Charniak E, Johnson M (2006) Effective self-training for parsing. In: Proceedings of NAACL, pp 152–159
Ng HT, Goh WB, Low KL (1997) Feature selection, perception learning, and a usability case study for text categorization. In: Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, Philadelphia PA, pp 67–73
Petinot Y, McKeown K, Thadani K (2011) A hierarchical model of web summaries. Proc ACL HLT Short Pap Vol 2:670–675
Pham DT, Dimov SS, Nguyen CD (2005) Selection of K in K-means clustering. Proc Inst Mech Eng Part C J Mech Eng Sci 219(1):103–109
Qin Y-P, Wang X-K (2009) Study on multi-label text classification based on SVM. Sixth international conference on fuzzy systems and knowledge discovery, pp 300–304
Salton G, Wong A, Yan C-S (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47
Trappey AJC, Hsu F-C, Trappey CV, Lin C-I (2006) Development of a patent document classification and search platform using a back-propagation network. Expert Syst Appl 31(4):755–765
Turian J, Ratinov L, Bengio Y (2010) Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the ACL, pp 384–394
Ueffing N (2006) Self-training for machine translation. In: Proceedings of NIPS workshop on machine learning for multilingual information access
Vateekul P, Kubat M (2009) Fast induction of multiple decision trees in text categorization from large scale, imbalanced, and multi-label data. IEEE International Conference on Data Mining Workshops, pp 320–325
Yang Y (1999) An evaluation of statistical approaches to text categorization. Inf Retr 1(1–2):69–90
Zhang Y, Vogel S, Waibel A (2004) Interpreting bleu/nist scores: how much improvement do we need to have a better system. In: Proceedings of the 2004 international conference on language resources and evaluation. pp 2051–2054
Acknowledgments
This work is supported by National Natural Science Foundation of China (NSFC) via Grant 61133012, 61273321 and the National 863 Leading Technology Research Project via grant 2012AA011102. Special thanks to Jianfei Guo and Xiaocheng Feng for their help in the experiments..
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by L. Xie.
Appendix: the categorization system of WeChat subscription accounts
Appendix: the categorization system of WeChat subscription accounts
-
finance and economics
-
1.
banking institutions
-
2.
business
-
3.
financing
-
4.
insurance
-
5.
marketing
-
6.
realty
-
7.
start-ups
-
shopping
-
-
8.
automobile
-
9.
commodity
-
10.
decoration
-
11.
discount shopping
-
12.
dresses
-
13.
electronic products
-
14.
luxuries
-
15.
online shopping
-
16.
purchasing agents
-
17.
sports equipments
-
18.
wholesale
-
19.
health care
-
20.
maternal and infant
-
21.
nourishing of life
-
22.
dating
-
communication platform
-
-
23.
friends making
-
24.
job hunting
-
education
-
-
25.
art schools
-
26.
business administration
-
27.
driving schools
-
28.
foreign language training
-
29.
raining for study abroad
-
30.
tutoring
-
military affairs
-
-
31.
military affairs
-
science and technology
-
-
32.
IT
-
33.
mobile internet applications
-
media
-
-
34.
news media
-
35.
print media
-
36.
TV and radio
-
37.
we-media
-
38.
cosmetic surgery
-
39.
hairdressing
-
40.
skin protection
-
food and drink
-
-
41.
green food
-
42.
restaurants
-
43.
tea
-
44.
western-style pastry
-
45.
wine
-
services for life
-
-
46.
air tickets booking
-
47.
Campus
-
48.
car rental
-
49.
community
-
50.
design
-
51.
emotion
-
52.
environmental protection
-
53.
Express delivery
-
54.
homemaking
-
55.
hot lines
-
56.
hotel booking
-
57.
law works
-
58.
life assistants
-
59.
lotteries
-
60.
public good
-
61.
recharging
-
62.
tourism
-
63.
weddings
-
culture
-
-
64.
art
-
65.
culture
-
66.
originality
-
67.
popularization of science
-
68.
reading
-
entertainment
-
-
69.
adult entertainment
-
70.
caricatures
-
71.
entertainment stars
-
72.
entertainment venues
-
73.
fashion
-
74.
games
-
75.
image show
-
76.
jokes
-
77.
movies
-
78.
music
-
79.
pets
-
sports
-
-
80.
sports clubs
-
81.
sports news
-
others
-
-
82.
brand
-
83.
government
Rights and permissions
About this article
Cite this article
Fu, R., Qin, B. & Liu, T. Open-categorical text classification based on multi-LDA models. Soft Comput 19, 29–38 (2015). https://doi.org/10.1007/s00500-014-1374-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-014-1374-x