skip to main content
10.1145/2254129.2254148acmotherconferencesArticle/Chapter ViewAbstractPublication PageswimsConference Proceedingsconference-collections
research-article

Representation models for text classification: a comparative analysis over three web document types

Published: 13 June 2012 Publication History

Abstract

Text classification constitutes a popular task in Web research with various applications that range from spam filtering to sentiment analysis. To address it, patterns of co-occurring words or characters are typically extracted from the textual content of Web documents. However, not all documents are of the same quality; for example, the curated content of news articles usually entails lower levels of noise than the user-generated content of the blog posts and the other Social Media.
In this paper, we provide some insight and a preliminary study on a tripartite categorization of Web documents, based on inherent document characteristics. We claim and support that each category calls for different classification settings with respect to the representation model. We verify this claim experimentally, by showing that topic classification on these different document types offers very different results per type. In addition, we consider a novel approach that improves the performance of topic classification across all types of Web documents: namely the n-gram graphs. This model goes beyond the established bag-of-words one, representing each document as a graph. Individual graphs can be combined into a class graph and graph similarities are then employed to position and classify documents into the vector space. Accuracy is increased due to the contextual information that is encapsulated in the edges of the n-gram graphs; efficiency, on the other hand, is boosted by reducing the feature space to a limited set of dimensions that depend on the number of classes, rather than the size of the vocabulary. Our experimental study over three large-scale, real-world data sets validates the higher performance of n-gram graphs in all three domains of Web documents.

References

[1]
E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne. Finding high-quality content in social media. In WSDM, pages 183--194, 2008.
[2]
M.-R. Amini, N. Usunier, and C. Goutte. Learning from multiple partially observed views - an application to multilingual text categorization. In NIPS, pages 28--36, 2009.
[3]
M. W. Berry and J. Kogan. Text Mining: Applications and Theory. Wiley, 2010.
[4]
D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993--1022, 2003.
[5]
A. Dasgupta, P. Drineas, B. Harb, V. Josifovski, and M. W. Mahoney. Feature selection methods for text classification. In KDD, pages 230--239, 2007.
[6]
S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391--407, 1990.
[7]
S. Dumais and H. Chen. Hierarchical classification of web content. In SIGIR, pages 256--263, 2000.
[8]
J. Eisenstein, B. O'Connor, N. A. Smith, and E. P. Xing. A latent variable model for geographic lexical variation. In EMNLP, pages 1277--1287, 2010.
[9]
H. Escalante, T. Solorio, and M. Montes-y Gómez. Local histograms of character n-grams for authorship attribution. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 288--298. Association for Computational Linguistics, 2011.
[10]
R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin. Liblin-ear: A library for large linear classification. The Journal of Machine Learning Research, 9:1871--1874, 2008.
[11]
F. Figueiredo, F. Belém, H. Pinto, J. M. Almeida, M. A. Gonçalves, D. Fernandes, E. S. de Moura, and M. Cristo. Evidence of quality of textual features on the web 2.0. In CIKM, pages 909--918, 2009.
[12]
G. Forman. An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research, 3:1289--1305, 2003.
[13]
S. Garcia Esparza, M. O'Mahony, and B. Smyth. Towards tagging and categorization for micro-blogs. In AICS, 2010.
[14]
Y. Genc, Y. Sakamoto, and J. V. Nickerson. Discovering Context: Classifying Tweets through a Semantic Transform Based on Wikipedia, pages 484--492. 2011.
[15]
G. Giannakopoulos, V. Karkaletsis, G. A. Vouros, and P. Stamatopoulos. Summarization system evaluation revisited: N-gram graphs. TSLP, 5(3), 2008.
[16]
G. Giannakopoulos and T. Palpanas. Content and type as orthogonal modeling features: a study on user interest awareness in entity subscription services. International Journal of Advances on Networks and Services, 3(2), 2010.
[17]
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. Witten. The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1):10--18, 2009.
[18]
L. Hong and B. Davison. Empirical study of topic modeling in twitter. In SOMA, pages 80--88, 2010.
[19]
D. Irani, S. Webb, C. Pu, and K. Li. Study of trend-stuffing on twitter through text classification. In CEAS, 2010.
[20]
T. Joachims. Text categorization with support vector machines: Learning with many relevant features. pages 137--142, 1998.
[21]
I. Kanaris, K. Kanaris, I. Houvardas, and E. Stamatatos. Words versus character n-grams for anti-spam filtering. International Journal on Artificial Intelligence Tools, 16(6):1047, 2007.
[22]
A. Khorsi. An overview of content-based spam filtering techniques. Informatica, 31:269--277, 2007.
[23]
S. Kinsella, A. Passant, and J. G. Breslin. Topic Classification in Social Media Using Metadata from Hyperlinked Objects, pages 201--206. 2011.
[24]
S. Kinsella, M. Wang, J. G. Breslin, and C. Hayes. Improving categorisation in social media using hyperlinks to structured data sources. In ESWC (2), pages 390--404, 2011.
[25]
Z. Li, D. Zhou, Y. F. Juan, and J. Han. Keyword extraction for social snippets. In WWW, pages 1143--1144, 2010.
[26]
C. Manning, P. Raghavan, and H. Schuetze. Introduction to information retrieval, volume 1. Cambridge University Press, 2008.
[27]
A. Pak and P. Paroubek. Twitter as a corpus for sentiment analysis and opinion mining. Proceedings of LREC 2010, 2010.
[28]
X. H. Phan, M. L. Nguyen, and S. Horiguchi. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In WWW, pages 91--100, 2008.
[29]
G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, Inc. New York, NY, USA, 1986.
[30]
F. Sebastiani. Machine learning in automated text categorization. ACM Comput. Surv., 34(1):1--47, 2002.
[31]
F. Sebastiani. Text categorization. In Encyclopedia of Database Technologies and Applications, pages 683--687. 2005.
[32]
B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu, and M. Demirbas. Short text classification in twitter to improve information filtering. In SIGIR, pages 841--842, 2010.
[33]
E. Stamatatos, N. Fakotakis, and G. Kokkinakis. Automatic text categorization in terms of genre and author. Computational linguistics, 26(4):471--495, 2000.
[34]
X. Sun, H. Wang, and Y. Yu. Towards effective short text deep classification. In SIGIR, pages 1143--1144, 2011.
[35]
T. Wilson and S. Raaijmakers. Comparing word, character, and phoneme n-grams for subjective utterance recognition. In Ninth Annual Conference of the International Speech Communication Association, 2008.
[36]
I. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2005.
[37]
J. Yang and J. Leskovec. Patterns of temporal variation in online media. In WSDM, pages 177--186, 2011.
[38]
S. Zelikovitz and H. Hirsh. Transductive lsi for short text classification problems. In FLAIRS, pages 556--561, 2004.

Cited By

View all
  • (2024)Extracting Features from Text Flows based on Semantic Similarity for Text Classification: an Approach Inspired by Audio AnalysisJournal of the Brazilian Computer Society10.5753/jbcs.2024.375930:1(297-314)Online publication date: 25-Sep-2024
  • (2023)A Study of Text Representations for Hate Speech DetectionComputational Linguistics and Intelligent Text Processing10.1007/978-3-031-24340-0_32(424-437)Online publication date: 26-Feb-2023
  • (2021)An Explainable Approach Based on Emotion and Sentiment Features for Detecting People with Mental Disorders on Social NetworksApplied Sciences10.3390/app11221093211:22(10932)Online publication date: 19-Nov-2021
  • Show More Cited By

Index Terms

  1. Representation models for text classification: a comparative analysis over three web document types

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    WIMS '12: Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
    June 2012
    571 pages
    ISBN:9781450309158
    DOI:10.1145/2254129
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    • UCV: University of Craiova
    • WNRI: Western Norway Research Institute

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 June 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. n-gram graphs
    2. text classification
    3. web document types

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    WIMS '12
    Sponsor:
    • UCV
    • WNRI

    Acceptance Rates

    Overall Acceptance Rate 140 of 278 submissions, 50%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)12
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 01 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Extracting Features from Text Flows based on Semantic Similarity for Text Classification: an Approach Inspired by Audio AnalysisJournal of the Brazilian Computer Society10.5753/jbcs.2024.375930:1(297-314)Online publication date: 25-Sep-2024
    • (2023)A Study of Text Representations for Hate Speech DetectionComputational Linguistics and Intelligent Text Processing10.1007/978-3-031-24340-0_32(424-437)Online publication date: 26-Feb-2023
    • (2021)An Explainable Approach Based on Emotion and Sentiment Features for Detecting People with Mental Disorders on Social NetworksApplied Sciences10.3390/app11221093211:22(10932)Online publication date: 19-Nov-2021
    • (2021)Designing a knowledge management system for Naval Materials FailuresMATEC Web of Conferences10.1051/matecconf/202134903006349(03006)Online publication date: 15-Nov-2021
    • (2021)Analysis of Changing Trends in Textual Data RepresentationRecent Trends in Image Processing and Pattern Recognition10.1007/978-981-16-0507-9_21(237-251)Online publication date: 26-Feb-2021
    • (2020)Text Mining in Big Data AnalyticsBig Data and Cognitive Computing10.3390/bdcc40100014:1(1)Online publication date: 16-Jan-2020
    • (2020)Domain- and Structure-Agnostic End-to-End Entity Resolution with JedAIACM SIGMOD Record10.1145/3385658.338566448:4(30-36)Online publication date: 25-Feb-2020
    • (2020)Comparative Analysis and Enhancement of Sentiment Intensity Based Tools2020 14th International Conference on Open Source Systems and Technologies (ICOSST)10.1109/ICOSST51357.2020.9333119(1-6)Online publication date: 16-Dec-2020
    • (2019)GeoSensorProceedings of the 34th ACM/SIGAPP Symposium on Applied Computing10.1145/3297280.3297504(2259-2266)Online publication date: 8-Apr-2019
    • (2019)Exploring the Influence of News Articles on Bitcoin Price with Machine Learning2019 IEEE Symposium on Computers and Communications (ISCC)10.1109/ISCC47284.2019.8969596(1-6)Online publication date: Jun-2019
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media