research-article

Representation models for text classification: a comparative analysis over three web document types

Authors:

George Giannakopoulos,

Petra Mavridi,

Georgios Paliouras,

George Papadakis,

Konstantinos TserpesAuthors Info & Claims

WIMS '12: Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics

Article No.: 13, Pages 1 - 12

https://doi.org/10.1145/2254129.2254148

Published: 13 June 2012 Publication History

Get Access

Abstract

Text classification constitutes a popular task in Web research with various applications that range from spam filtering to sentiment analysis. To address it, patterns of co-occurring words or characters are typically extracted from the textual content of Web documents. However, not all documents are of the same quality; for example, the curated content of news articles usually entails lower levels of noise than the user-generated content of the blog posts and the other Social Media.

In this paper, we provide some insight and a preliminary study on a tripartite categorization of Web documents, based on inherent document characteristics. We claim and support that each category calls for different classification settings with respect to the representation model. We verify this claim experimentally, by showing that topic classification on these different document types offers very different results per type. In addition, we consider a novel approach that improves the performance of topic classification across all types of Web documents: namely the n-gram graphs. This model goes beyond the established bag-of-words one, representing each document as a graph. Individual graphs can be combined into a class graph and graph similarities are then employed to position and classify documents into the vector space. Accuracy is increased due to the contextual information that is encapsulated in the edges of the n-gram graphs; efficiency, on the other hand, is boosted by reducing the feature space to a limited set of dimensions that depend on the number of classes, rather than the size of the vocabulary. Our experimental study over three large-scale, real-world data sets validates the higher performance of n-gram graphs in all three domains of Web documents.

References

[1]

E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne. Finding high-quality content in social media. In WSDM, pages 183--194, 2008.

Abstract

References

Cited By

Index Terms

Recommendations

Graph vs. bag representation models for the topic classification of web documents

Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values

Urdu text classification

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations