Abstract
The categorization of natural language texts is a well established research field in computational and quantitative linguistics (Joachims 2002). In the majority of cases, the vector space model is used in terms of a bag of words approach. That is, lexical features are extracted from input texts in order to train some categorization model and, thus, to attribute, for example, authorship or topic categories. Parallel to these approaches there has been some effort in performing text categorization not in terms of lexical, but of structural features of document structure. More specifically, quantitative text characteristics have been computed in order to derive a sort of structural text signature which nevertheless allows reliable text categorizations (Kelih & Grzybek 2005; Pieper 1975). This “bag of features” approach regains attention when it comes to categorizing websites and other document types whose structure is far away from the simplicity of tree-like structures. Here we present a novel approach to structural classifiers which systematically computes structural signatures of documents. In summary, we present a text categorization algorithm which in the absence of any lexical features nevertheless performs a remarkably good classification even if the classes are thematically defined.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
ALTMANN, G. (1988): Wiederholungen in Texten. Brockmeyer, Bochum.
BIBER, D. (1995): Dimensions of Register Variation: A Cross-Linguistic Comparison. Uni-versity Press, Cambridge.
BOCK, H.H. (1974): Automatische Klassifikation. Theoretische und praktische Methoden zur Gruppierung und Strukturierung von Daten (Cluster-Analyse). Vandenhoeck & Ruprecht, Göttingen.
GLEIM, R.; MEHLER, A.; DEHMER, M.; PUSTYLNIKOV, O. (2007): Isles Through the Category Forest — Utilising the Wikipedia Category System for Corpus Building in Ma-chine Learning. In: WEBIST ’07, WIA(2). Barcelona, Spain, 142-149.
JOACHIMS, T. (2002): Learning to classify text using support vector machines. Kluwer, Boston/Dordrecht/London.
KELIH, E.; GRZYBEK, P. (2005): Satzlänge: Definitionen, Häufigkeiten, Modelle (Am Beispiel slowenischer Prosatexte). In: LDV-Forum 20(2), 31-51.
MEHLER, A.; GEIBEL, P.; GLEIM, R.; HEROLD, S.; JAIN, B.; PUSTYLNIKOV, O. (2006): Much Ado About Text Content. Learning Text Types Solely by Structural Differentiae. In: OTT’06.
MEHLER, A.; GEIBEL, P.; PUSTYLNIKOV, O.; HEROLD, S. (2007): Structural Classifiers of Text Types. To appear in: LDV Forum.
PIEPER, U. (1975): Differenzierung von Texten nach Numerischen Kriterien. In: Folia Lin-guistica VII, 61-113.
RIEGER, B. (1989): Unscharfe Semantik: Die empirische Analyse, quantitative Beschreibung, formale Repräsentation und prozedurale Modellierung vager Wortbedeutungen in Texten. Peter Lang, Frankfurt a. M.
SÜDDEUTSCHER VERLAG (2004). Süddeutsche Zeitung 1994-2003. 10 Jahre auf DVD. München.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pustylnikov, O., Mehler, A. (2008). Structural Differentiae of Text Types – A Quantitative Model. In: Preisach, C., Burkhardt, H., Schmidt-Thieme, L., Decker, R. (eds) Data Analysis, Machine Learning and Applications. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78246-9_77
Download citation
DOI: https://doi.org/10.1007/978-3-540-78246-9_77
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78239-1
Online ISBN: 978-3-540-78246-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)