Graph vs. bag representation models for the topic classification of web documents

Papadakis, George; Giannakopoulos, George; Paliouras, Georgios

doi:10.1007/s11280-015-0365-x

Graph vs. bag representation models for the topic classification of web documents

Published: 12 August 2015

Volume 19, pages 887–920, (2016)
Cite this article

World Wide Web Aims and scope Submit manuscript

George Papadakis¹,
George Giannakopoulos² &
Georgios Paliouras²

955 Accesses
12 Citations
Explore all metrics

Abstract

Text classification constitutes a popular task in Web research with various applications that range from spam filtering to sentiment analysis. In this paper, we argue that its performance depends on the quality of Web documents, which varies significantly. For example, the curated content of news articles involves different challenges than the user-generated content of blog posts and Social Media messages. We experimentally verify our claim, quantifying the main factors that affect the performance of text classification. We also argue that the established bag-of-words representation models are inadequate for handling all document types, as they merely extract frequent, yet distinguishing terms from the textual content of the training set. Thus, they suffer from low robustness in the context of noisy or unseen content, unless they are enriched with contextual, application-specific information. In their place, we propose the use of n-gram graphs, a model that goes beyond the bag-of-words representation, transforming every document into a graph: its nodes correspond to character or word n-grams and the co-occurring ones are connected by weighted edges. Individual document graphs can be combined into class graphs and graph similarities are employed to position and classify documents into the vector space. This approach offers two advantages with respect to bag models: first, classification accuracy increases due to the contextual information that is encapsulated in the edges of the n-gram graphs. Second, it reduces the search space to a limited set of robust, endogenous features that depend on the number of classes, rather than the size of the vocabulary. Our thorough experimental study over three large, real-world corpora confirms the superior performance of n-gram graphs across the main types of Web documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Article 05 March 2020

Impact of word embedding models on text analytics in deep learning environment: a review

Article 22 February 2023

Sentiment analysis: A survey on design framework, applications and future scopes

Article 20 March 2023

Notes

http://wordnet.princeton.edu
https://twitter.com
An alternative approach to forming a class vector is to extract the centroid from the vectors of the individual documents it comprises [29].
Example borrowed from [29].
The implementation of this procedure in Java is provided publicly through the “Text Representation Models” project of Sourceforge.net at: http://sourceforge.net/projects/textmodels.
It is worth stressing that these three types do not correspond to document genres; instead, the aim is to explain the difference in the quality of Web documents and the resulting impact on TC.
http://www.facebook.com
http://www.youtube.com
http://sourceforge.net/projects/jinsect
http://www.cs.waikato.ac.nz/ml/weka
http://trec.nist.gov/data/reuters/reuters.html
http://www.blogpulse.com/www2006-workshop/datashare-instructions.txt
A hashtag in Twitter consists of the symbol #, followed by a series of concatenated words and/or alphanumerics (e.g., #worldcup2014).
The nominal features are also useful for powerful classification algorithms that are inherently crafted for this kind of evidence, such as C4.5. However, preliminary experiments demonstrated that such algorithms do not scale well to the large search space of bag models. Hence, we do not consider them in our analysis.
A “dependency triple” is a language-dependent feature comprising two words that are semantically connected with one of the syntactic relators that are supported by the corresponding parser. For example, s u b j(Y,X) denotes a feature consisting of a noun Y that is connected with a verb X through the relator “subject”.
Topic Detection is similar to Topic Classification, but differs in that it involves many more classes, which are also so rare that an unlabelled document is likely to belong to none of them [37].

References

Amini, M.R., Usunier, N., Goutte, C.: Learning from multiple partially observed views - an application to multilingual text categorization. In: NIPS, pp. 28–36 (2009)
Batista, F., Ribeiro, R.: Sentiment analysis and topic classification based on binary maximum entropy classifiers. Proc. Leng. Nat. 50, 77–84 (2013)
Berry, M.W., Kogan, J.: Text Mining: Applications and Theory. Wiley, Chichester (2010)
Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Choudhary, B., Bhattacharyya, P.: Text clustering using semantics. World Wide Web Conference (2002)
Choudhary, B., Bhattacharyya, P.: Text clustering using universal networking language representation. World Wide Web Conference (2002)
Dasgupta, A., Drineas, P., Harb, B., Josifovski, V., Mahoney, M.W.: Feature selection methods for text classification. In: KDD, pp 230–239 (2007)
Deerwester, S., Dumais, S., Furnas, G., Landauer, T., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
Article Google Scholar
D’hondt, E., Verberne, S., Koster, C.H.A., Boves, L.: Text representations for patent classification. Comput. Linguist. 39(3), 755–775 (2013)
Article Google Scholar
Dumais, S., Chen, H.: Hierarchical classification of web content. In: SIGIR, pp. 256–263 (2000)
Eisenstein, J., O’Connor, B., Smith, N.A., Xing, E.P.: A latent variable model for geographic lexical variation. In: EMNLP, pp. 1277–1287 (2010)
Fan, R., Chang, K., Hsieh, C., Wang, X., Lin, C.: Liblinear: A library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
MATH Google Scholar
Figueiredo, F., Belém, F., Pinto, H., Almeida, J.M., Gonçalves, M.A., Fernandes, D., de Moura, E.S., Cristo, M.: Evidence of quality of textual features on the web 2.0. In: CIKM, pp 909–918 (2009)
Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)
MATH Google Scholar
Garcia Esparza, S., O’Mahony, M., Smyth, B.: Towards tagging and categorization for micro-blogs. In: AICS (2010)
Genc, Y., Sakamoto, Y., Nickerson, J.V.: Discovering Context: Classifying Tweets through a Semantic Transform Based on Wikipedia, pp 484–492 (2011)
Giannakopoulos, G., Karkaletsis, V., Vouros, G.A., Stamatopoulos, P.: Summarization system evaluation revisited: N-gram graphs. TSLP 5(3) (2008)
Giannakopoulos, G., Palpanas, T.: Content and type as orthogonal modeling features: a study on user interest awareness in entity subscription services. Int. J. Adv. Netw. Serv. 3(2) (2010)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The WEKA data mining software: An update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Article Google Scholar
Hong, L., Davison, B.: Empirical study of topic modeling in twitter. In: SOMA, pp. 80–88 (2010)
Irani, D., Webb, S., Pu, C., Li, K.: Study of trend-stuffing on twitter through text classification. In: CEAS, pp. 40–49 (2010)
Joachims, T.: Text categorization with suport vector machines: Learning with many relevant features. In: ECML, pp. 137–142 (1998)
Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: PACLING, pp. 255–264 (2003)
Khorsi, A.: An overview of content-based spam filtering techniques. Informatica 31, 269–277 (2007)
MATH Google Scholar
Kinsella, S., Passant, A., Breslin, J.G.: Topic classification in social media using metadata from hyperlinked objects. In: ECIR, pp 201–206 (2011)
Kinsella, S., Wang, M., Breslin, J.G., Hayes, C.: Improving categorisation in social media using hyperlinks to structured data sources. In: ESWC (2), pp 390–404 (2011)
Li, Z., Zhou, D., Juan, Y.F., Han, J.: Keyword extraction for social snippets. In: WWW, pp. 1143–1144 (2010)
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification using string kernels. J. Mach. Learn. Res. 2, 419–444 (2002)
MATH Google Scholar
Manning, C., Raghavan, P., Schuetze, H.: Introduction to information retrieval, vol. 1. Cambridge University Press (2008)
Meng, W., Lanfen, L., Jing, W., Penghua, Y., Jiaolong, L., Fei, X.: Improving short text classification using public search engines. In: Integrated Uncertainty in Knowledge Modelling and Decision Making, pp 157–166 (2013)
Pak, A., Paroubek, P.: Twitter as a corpus for sentiment analysis and opinion mining, pp. 1320–1326. LREC (2010)
Peng, F., Schuurmans, D.: Combining naive bayes and n-gram language models for text classification. Advances in Information Retrieval, pp. 547–547 (2003)
Phan, X.H., Nguyen, M.L., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: WWW, pp. 91–100 (2008)
Rosa, H., Batista, F., Carvalho, J.P.: Twitter topic fuzzy fingerprints. In: IEEE International Conference on Fuzzy Systems, pp 776–783 (2014)
Salton, G.: The Smart Retrieval System – Experiments in Automatic Document Processing, p. 556. Prentice-Hall (1971)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article Google Scholar
Sebastiani, F.: Text categorization. In: Encyclopedia of Database Technologies and Applications, pp. 683–687 (2005)
Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., Demirbas, M.: Short text classification in twitter to improve information filtering. In: SIGIR, pp. 841–842 (2010)
Stamatatos, E.: Ensemble-based author identification using character n-grams. In: Proceedings of the 3rd International Workshop on Text-based Information Retrieval, pp. 41–46 (2006)
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)
Article Google Scholar
Stamatatos, E.: On the robustness of authorship attribution based on character n-gram features. J. Law Policy 21(2), 421–439 (2013)
Google Scholar
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26(4), 471–495 (2000)
Article Google Scholar
Sun, X., Wang, H., Yu, Y.: Towards effective short text deep classification. In: SIGIR, pp. 1143–1144 (2011)
Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, p. 560. Morgan Kaufmann, San Francisco (2005)
Yang, J., Leskovec, J.: Patterns of temporal variation in online media. In: WSDM, pp. 177–186 (2011)
Yang, S., Kolcz, A., Schlaikjer, A., Gupta, P.: Large-scale high-precision topic modeling on twitter. In: KDD, pp. 1907–1916 (2014)
Zelikovitz, S., Hirsh, H.: Transductive lsi for short text classification problems. In: FLAIRS, pp. 556–561 (2004)

Download references

Author information

Authors and Affiliations

Department of Informatics and Telecommunications, University of Athens, Panepistimiopolis, Ilissia, 15784, Athens, Greece
George Papadakis
National Center for Scientific Research “Demokritos”, Patriarchou Grigoriou 27, Agia Paraskevi, 15310, Attica, Greece
George Giannakopoulos & Georgios Paliouras

Authors

George Papadakis
View author publications
You can also search for this author in PubMed Google Scholar
George Giannakopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Georgios Paliouras
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to George Papadakis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Papadakis, G., Giannakopoulos, G. & Paliouras, G. Graph vs. bag representation models for the topic classification of web documents. World Wide Web 19, 887–920 (2016). https://doi.org/10.1007/s11280-015-0365-x

Download citation

Received: 23 December 2014
Revised: 19 May 2015
Accepted: 20 July 2015
Published: 12 August 2015
Issue Date: September 2016
DOI: https://doi.org/10.1007/s11280-015-0365-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Graph vs. bag representation models for the topic classification of web documents

Abstract

Access this article

Similar content being viewed by others

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Impact of word embedding models on text analytics in deep learning environment: a review

Sentiment analysis: A survey on design framework, applications and future scopes

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Graph vs. bag representation models for the topic classification of web documents

Abstract

Access this article

Similar content being viewed by others

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Impact of word embedding models on text analytics in deep learning environment: a review

Sentiment analysis: A survey on design framework, applications and future scopes

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation