skip to main content
10.1145/2996890.3007856acmotherconferencesArticle/Chapter ViewAbstractPublication PagesuccConference Proceedingsconference-collections
short-paper

Multi-level topical text categorization with wikipedia

Published: 06 December 2016 Publication History

Abstract

This paper introduces an automatic categorical-marking model for text categorization. Traditional classification algorithms are generally applying labeled training set and call for a lot of manual work to tag classifications beforehand. Also due to the ambiguity and fuzziness of texts, the results of traditional text categorization algorithms may not be clear enough and abundant in content. This paper presents an unsupervised, training-set-free and hierarchical categorization model called Folk-Topical Text Categorization (FTTC). FTTC applies topic model to abstract documents to topical words and make use of Wikipedia's crowd-sourcing and collective control to extend hierarchical classifications. The results are not restricted to predefined categories but contain categories abstracted to deeper semantic levels and greatly facilitate traditional text categorization applications. For a document, its topical words are obtained using a popular topic model called Latent Dirichlet Allocation (LDA). Afterwards, the topical words are used to build and trace through the category-trees of Wikipedia. Based on the filtered results, the final classifications comprehensively reflect the diversified and content-rich information of the text, and fully cover different aspects of the text. Experimental results on different kinds of datasets show that our model advances in classification accuracy, flexibility and intelligibility, as compared with traditional models.

References

[1]
G. Alimjan, T. L. Sun, Hurxida, and Tilekbek. An approach to the text categorization of the kazakh language based on svm-modified knn algorithm. Journal of Northwest Normal University, 2014.
[2]
C. Aptĺę, F. Damerau, and S. M. Weiss. Towards language independent automated learning of text categorization models. In SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and developm, pages 23--30, 1999.
[3]
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003.
[4]
Z. Chen, A. Mukherjee, B. Liu, M. Hsu, M. Castellanos, and R. Ghosh. Discovering coherent topics using general knowledge. In CIKM, pages 209--218, 2013.
[5]
W. W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. Acm Transactions on Information Systems, 17(2):307--315, 2002.
[6]
A. Csomai and R. Mihalcea. Wikify!: Linking documents to encyclopedic knowledge. Intelligent Systems IEEE, 23(5):34--41, 2008.
[7]
L. Du, W. Buntine, and H. Jin. Modelling sequential text with an adaptive topic model. In Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 9--9, 2012.
[8]
E. Gabrilovich and S. Markovitch. Wikipedia-based semantic interpretation for natural language processing. Journal of Artificial Intelligence Research, 34(4):443--498, 2009.
[9]
A. Genkin, D. D. Lewis, and D. Madigan. Large-scale bayesian logistic regression for text categorization. Technometrics, 49(3):291--304, 2007.
[10]
Goldszmidt and Moises. Bayesian network classifiers. Machine Learning, 29(2-3):131--163, 1997.
[11]
Gruber and Tom. Ontology of folksonomy: A mash-up of apples and oranges. International Journal on Semantic Web & Information Systems, 3(1):1--11, 2007.
[12]
G. Guo, H. Wang, D. Bell, Y. Bi, and K. Greer. Using knn model for automatic text categorization. soft comput. Soft Computing, 10(5):423--430, 2006.
[13]
K. Hornik, J. Rauch, C. Buchta, and I. Feinerer. textcat: N-gram based text categorization. 2011.
[14]
T. Joachims. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. PhD thesis, Springer US, 1997.
[15]
T. Joachims. Transductive inference for text classification using support vector machines. In Sixteenth International Conference on Machine Learning, pages 200--209, 1999.
[16]
W. Lam, M. Ruiz, and P. Srinivasan. Automatic text categorization and its application to text retrieval. IEEE Transactions on Knowledge & Data Engineering, 11(6):865--879, 1999.
[17]
D. Lewis and W. Gale. Training text classifiers by uncertainty sampling. 1994.
[18]
H. Li and K. Yamanishi. Text classification using esc-based stochastic decision lists. Information Processing & Management, 38(3):343--361, 2002.
[19]
K. Li, J. Xie, X. Sun, Y. Ma, and H. Bai. Multi-class text categorization based on lda and svm. Procedia Engineering, 15(1):1963--1967, 2011.
[20]
W. Li, L. Sun, Y. Feng, and D. Zhang. Smoothing lda model for text categorization. In Asia Information Retrieval Conference on Information Retrieval Technology, pages 83--94, 2008.
[21]
R. V. Lindsey, W. P. H. Iii, and M. J. Stipicevic. A phrase-discovering topic model using hierarchical pitman-yor processes. In Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 214--222, 2012.
[22]
N. Liu, M. X. Li, Y. Lu, X. J. Tang, H. W. Wang, and P. Xiao. Mixture of topic model for multi-document summarization. In Chinese Control and Decision Conference, pages 5168--5172, 2014.
[23]
L. M. Manevitz and M. Yousef. One-class svms for document classification. Journal of Machine Learning Research, 2(1):139--154, 2001.
[24]
M. Michelson and S. A. Macskassy. Discovering users' topics of interest on twitter: a first look. And , pages 73--80, 2010.
[25]
D. Milne and I. H. Witten. Learning to link with wikipedia. In ACM Conference on Information and Knowledge Management, pages 509--518, 2008.
[26]
K. Nigam, A. K. Mccallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using em. Machine Learning, 39(2-3):103--134, 2000.
[27]
W. Shen, J. Wang, P. Luo, and M. Wang. Linden: linking named entities with knowledge base via semantic knowledge. In Proceedings of the 21st international conference on World Wide Web, pages 449--458, 2012.
[28]
R. Studer, V. R. Benjamins, and D. Fensel. Knowledge engineering: Principles and methods. Data & Knowledge Engineering, 25(1ĺC2):161--197, 1998.
[29]
S. Tasci and T. Gungor. Lda-based keyword selection in text categorization. In International Symposium on Computer and Information Sciences, pages 230 -- 235, 2009.
[30]
X. Wang, A. Mccallum, and X. Wei. Topical n-grams: Phrase and topic discovery, with an application to information retrieval. pages 697--702, 2007.
[31]
Y. Wang, E. Agichtein, and M. Benzi. Tm-lda: efficient online modeling of latent topic transitions in social media. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 123--131, 2012.
[32]
H. E. Xiao-Liang, W. Song, and J. Z. Liang. Text categorization based on resource allocating network and semantic feature selection. Computer Engineering & Science, 2014.
[33]
Y. Yang. An example-based mapping method for text categorization and retrieval. Acm Transactions on Information Systems, 12(3):252--277, 1994.
[34]
Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1):69--90, 1999.
[35]
Y. Yang. A re-examination of text categorization methods. In International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 42--49, 1999.
[36]
W. Zhang, J. Su, and C. L. Tan. A wikipedia-lda model for entity linking with batch size changing instance selection. Ijcnlp, 2011.
[37]
T. Zhao, C. Li, Q. Ding, and L. Li. User-sentiment topic model: refining user's topics with sentiment information. In ACM SIGKDD Workshop on Mining Data Semantics, pages pĺćgs. 12437--12442, 2012.

Cited By

View all
  • (2023)Text classification using embeddings: a surveyKnowledge and Information Systems10.1007/s10115-023-01856-z65:7(2761-2803)Online publication date: 26-Mar-2023
  • (2022)Accurate Context Extraction from Unstructured Text Based on Deep Learning2022 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)10.1109/WI-IAT55865.2022.00052(309-314)Online publication date: Nov-2022
  • (2021)Research on an Two-Channel ACNN-LSTM Model for Financial Text Sentiment Analysis2021 IEEE International Conference on Progress in Informatics and Computing (PIC)10.1109/PIC53636.2021.9687020(200-205)Online publication date: 17-Dec-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
UCC '16: Proceedings of the 9th International Conference on Utility and Cloud Computing
December 2016
549 pages
ISBN:9781450346160
DOI:10.1145/2996890
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 December 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. latent dirichlet allocation
  2. text categorization
  3. topic extraction
  4. topic model
  5. wikipedia

Qualifiers

  • Short-paper

Conference

UCC '16

Acceptance Rates

Overall Acceptance Rate 38 of 125 submissions, 30%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)1
Reflects downloads up to 23 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Text classification using embeddings: a surveyKnowledge and Information Systems10.1007/s10115-023-01856-z65:7(2761-2803)Online publication date: 26-Mar-2023
  • (2022)Accurate Context Extraction from Unstructured Text Based on Deep Learning2022 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)10.1109/WI-IAT55865.2022.00052(309-314)Online publication date: Nov-2022
  • (2021)Research on an Two-Channel ACNN-LSTM Model for Financial Text Sentiment Analysis2021 IEEE International Conference on Progress in Informatics and Computing (PIC)10.1109/PIC53636.2021.9687020(200-205)Online publication date: 17-Dec-2021
  • (2020)An Unsupervised Approach for Precise Context Identification from Unstructured Text Documents2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI50040.2020.00130(821-826)Online publication date: Nov-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media