short-paper

Multi-level topical text categorization with wikipedia

Authors:

Cheng WangAuthors Info & Claims

UCC '16: Proceedings of the 9th International Conference on Utility and Cloud Computing

Pages 343 - 352

https://doi.org/10.1145/2996890.3007856

Published: 06 December 2016 Publication History

Abstract

This paper introduces an automatic categorical-marking model for text categorization. Traditional classification algorithms are generally applying labeled training set and call for a lot of manual work to tag classifications beforehand. Also due to the ambiguity and fuzziness of texts, the results of traditional text categorization algorithms may not be clear enough and abundant in content. This paper presents an unsupervised, training-set-free and hierarchical categorization model called Folk-Topical Text Categorization (FTTC). FTTC applies topic model to abstract documents to topical words and make use of Wikipedia's crowd-sourcing and collective control to extend hierarchical classifications. The results are not restricted to predefined categories but contain categories abstracted to deeper semantic levels and greatly facilitate traditional text categorization applications. For a document, its topical words are obtained using a popular topic model called Latent Dirichlet Allocation (LDA). Afterwards, the topical words are used to build and trace through the category-trees of Wikipedia. Based on the filtered results, the final classifications comprehensively reflect the diversified and content-rich information of the text, and fully cover different aspects of the text. Experimental results on different kinds of datasets show that our model advances in classification accuracy, flexibility and intelligibility, as compared with traditional models.

References

[1]

G. Alimjan, T. L. Sun, Hurxida, and Tilekbek. An approach to the text categorization of the kazakh language based on svm-modified knn algorithm. Journal of Northwest Normal University, 2014.

[2]

C. Aptĺę, F. Damerau, and S. M. Weiss. Towards language independent automated learning of text categorization models. In SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and developm, pages 23--30, 1999.

Digital Library

[3]

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003.

Digital Library

[4]

Z. Chen, A. Mukherjee, B. Liu, M. Hsu, M. Castellanos, and R. Ghosh. Discovering coherent topics using general knowledge. In CIKM, pages 209--218, 2013.

Digital Library

[5]

W. W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. Acm Transactions on Information Systems, 17(2):307--315, 2002.

Digital Library

[6]

A. Csomai and R. Mihalcea. Wikify!: Linking documents to encyclopedic knowledge. Intelligent Systems IEEE, 23(5):34--41, 2008.

Digital Library

[7]

L. Du, W. Buntine, and H. Jin. Modelling sequential text with an adaptive topic model. In Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 9--9, 2012.

Digital Library

[8]

E. Gabrilovich and S. Markovitch. Wikipedia-based semantic interpretation for natural language processing. Journal of Artificial Intelligence Research, 34(4):443--498, 2009.

Digital Library

[9]

A. Genkin, D. D. Lewis, and D. Madigan. Large-scale bayesian logistic regression for text categorization. Technometrics, 49(3):291--304, 2007.

[10]

Goldszmidt and Moises. Bayesian network classifiers. Machine Learning, 29(2-3):131--163, 1997.

Digital Library

[11]

Gruber and Tom. Ontology of folksonomy: A mash-up of apples and oranges. International Journal on Semantic Web & Information Systems, 3(1):1--11, 2007.

[12]

G. Guo, H. Wang, D. Bell, Y. Bi, and K. Greer. Using knn model for automatic text categorization. soft comput. Soft Computing, 10(5):423--430, 2006.

Digital Library

[13]

K. Hornik, J. Rauch, C. Buchta, and I. Feinerer. textcat: N-gram based text categorization. 2011.

[14]

T. Joachims. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. PhD thesis, Springer US, 1997.

[15]

T. Joachims. Transductive inference for text classification using support vector machines. In Sixteenth International Conference on Machine Learning, pages 200--209, 1999.

Digital Library

[16]

W. Lam, M. Ruiz, and P. Srinivasan. Automatic text categorization and its application to text retrieval. IEEE Transactions on Knowledge & Data Engineering, 11(6):865--879, 1999.

Digital Library

[17]

D. Lewis and W. Gale. Training text classifiers by uncertainty sampling. 1994.

[18]

H. Li and K. Yamanishi. Text classification using esc-based stochastic decision lists. Information Processing & Management, 38(3):343--361, 2002.

Digital Library

[19]

K. Li, J. Xie, X. Sun, Y. Ma, and H. Bai. Multi-class text categorization based on lda and svm. Procedia Engineering, 15(1):1963--1967, 2011.

[20]

W. Li, L. Sun, Y. Feng, and D. Zhang. Smoothing lda model for text categorization. In Asia Information Retrieval Conference on Information Retrieval Technology, pages 83--94, 2008.

Digital Library

[21]

R. V. Lindsey, W. P. H. Iii, and M. J. Stipicevic. A phrase-discovering topic model using hierarchical pitman-yor processes. In Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 214--222, 2012.

Digital Library

[22]

N. Liu, M. X. Li, Y. Lu, X. J. Tang, H. W. Wang, and P. Xiao. Mixture of topic model for multi-document summarization. In Chinese Control and Decision Conference, pages 5168--5172, 2014.

[23]

L. M. Manevitz and M. Yousef. One-class svms for document classification. Journal of Machine Learning Research, 2(1):139--154, 2001.

Digital Library

[24]

M. Michelson and S. A. Macskassy. Discovering users' topics of interest on twitter: a first look. And , pages 73--80, 2010.

Digital Library

[25]

D. Milne and I. H. Witten. Learning to link with wikipedia. In ACM Conference on Information and Knowledge Management, pages 509--518, 2008.

Digital Library

[26]

K. Nigam, A. K. Mccallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using em. Machine Learning, 39(2-3):103--134, 2000.

Digital Library

[27]

W. Shen, J. Wang, P. Luo, and M. Wang. Linden: linking named entities with knowledge base via semantic knowledge. In Proceedings of the 21st international conference on World Wide Web, pages 449--458, 2012.

Digital Library

[28]

R. Studer, V. R. Benjamins, and D. Fensel. Knowledge engineering: Principles and methods. Data & Knowledge Engineering, 25(1ĺC2):161--197, 1998.

Digital Library

[29]

S. Tasci and T. Gungor. Lda-based keyword selection in text categorization. In International Symposium on Computer and Information Sciences, pages 230 -- 235, 2009.

[30]

X. Wang, A. Mccallum, and X. Wei. Topical n-grams: Phrase and topic discovery, with an application to information retrieval. pages 697--702, 2007.

Digital Library

[31]

Y. Wang, E. Agichtein, and M. Benzi. Tm-lda: efficient online modeling of latent topic transitions in social media. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 123--131, 2012.

Digital Library

[32]

H. E. Xiao-Liang, W. Song, and J. Z. Liang. Text categorization based on resource allocating network and semantic feature selection. Computer Engineering & Science, 2014.

[33]

Y. Yang. An example-based mapping method for text categorization and retrieval. Acm Transactions on Information Systems, 12(3):252--277, 1994.

Digital Library

[34]

Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1):69--90, 1999.

Digital Library

[35]

Y. Yang. A re-examination of text categorization methods. In International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 42--49, 1999.

Digital Library

[36]

W. Zhang, J. Su, and C. L. Tan. A wikipedia-lda model for entity linking with batch size changing instance selection. Ijcnlp, 2011.

[37]

T. Zhao, C. Li, Q. Ding, and L. Li. User-sentiment topic model: refining user's topics with sentiment information. In ACM SIGKDD Workshop on Mining Data Semantics, pages pĺćgs. 12437--12442, 2012.

Digital Library

Cited By

da Costa LOliveira IFileto R(2023)Text classification using embeddings: a surveyKnowledge and Information Systems10.1007/s10115-023-01856-z65:7(2761-2803)Online publication date: 26-Mar-2023
https://doi.org/10.1007/s10115-023-01856-z
Mallek MGuetari RFournier SChaari WEspinasse B(2022)Accurate Context Extraction from Unstructured Text Based on Deep Learning2022 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)10.1109/WI-IAT55865.2022.00052(309-314)Online publication date: Nov-2022
https://doi.org/10.1109/WI-IAT55865.2022.00052
Shi HYou LRen MLi X(2021)Research on an Two-Channel ACNN-LSTM Model for Financial Text Sentiment Analysis2021 IEEE International Conference on Progress in Informatics and Computing (PIC)10.1109/PIC53636.2021.9687020(200-205)Online publication date: 17-Dec-2021
https://doi.org/10.1109/PIC53636.2021.9687020
Show More Cited By

Recommendations

ADM-LDA: An aspect detection model based on topic modelling using the structure of review sentences

Probabilistic topic models are statistical methods whose aim is to discover the latent structure in a large collection of documents. The intuition behind topic models is that, by generating documents by latent topics, the word distribution for each ...
LDA-AdaBoost.MH: Accelerated AdaBoost.MH based on latent Dirichlet allocation for text categorization

AdaBoost.MH is a boosting algorithm that is considered to be one of the most accurate algorithms for multilabel classification. It works by iteratively building a committee of weak hypotheses of decision stumps. To build the weak hypotheses, in each ...
Diversionary Comments under Blog Posts

There has been a recent swell of interest in the analysis of blog comments. However, much of the work focuses on detecting comment spam in the blogsphere. An important issue that has been neglected so far is the identification of diversionary comments. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

UCC '16: Proceedings of the 9th International Conference on Utility and Cloud Computing

December 2016

549 pages

ISBN:9781450346160

DOI:10.1145/2996890

General Chairs:
Changjun Jiang
Tongji University, China
,
Omer Rana
Cardiff University, UK
,
Nick Antonopoulos
University of Derby, UK

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 December 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Conference

UCC '16

UCC '16: 9th International Conference on Utility and Cloud Computing

December 6 - 9, 2016

Shanghai, China

Acceptance Rates

Overall Acceptance Rate 38 of 125 submissions, 30%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
119
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Reflects downloads up to 23 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

da Costa LOliveira IFileto R(2023)Text classification using embeddings: a surveyKnowledge and Information Systems10.1007/s10115-023-01856-z65:7(2761-2803)Online publication date: 26-Mar-2023
https://doi.org/10.1007/s10115-023-01856-z
Mallek MGuetari RFournier SChaari WEspinasse B(2022)Accurate Context Extraction from Unstructured Text Based on Deep Learning2022 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)10.1109/WI-IAT55865.2022.00052(309-314)Online publication date: Nov-2022
https://doi.org/10.1109/WI-IAT55865.2022.00052
Shi HYou LRen MLi X(2021)Research on an Two-Channel ACNN-LSTM Model for Financial Text Sentiment Analysis2021 IEEE International Conference on Progress in Informatics and Computing (PIC)10.1109/PIC53636.2021.9687020(200-205)Online publication date: 17-Dec-2021
https://doi.org/10.1109/PIC53636.2021.9687020
Mallek MFournier SGuetari REspinasse BChaari W(2020)An Unsupervised Approach for Precise Context Identification from Unstructured Text Documents2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI50040.2020.00130(821-826)Online publication date: Nov-2020
https://doi.org/10.1109/ICTAI50040.2020.00130

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten