Genre as noise: noise in genre

Stubbe, Andrea; Ringlstetter, Christoph; Schulz, Klaus U.

doi:10.1007/s10032-007-0060-2

Andrea Stubbe¹,
Christoph Ringlstetter² &
Klaus U. Schulz¹

108 Accesses
14 Citations
Explore all metrics

Abstract

Given a specific information need, documents of the wrong genre can be considered as noise. From this perspective, genre classification helps to separate relevant documents from noise. Orthographic errors represent a second, finer notion of noise. Since specific genres often include documents with many errors, an interesting question is whether this “micro-noise” can help to classify genre. In this paper we consider both problems. After introducing a comprehensive hierarchy of genres, we present an intuitive method to build specialized and distinctive classifiers that also work for very small training corpora. Special emphasis is given to the selection of intelligent high-level features. We then investigate the correlation between genre and micro noise. Using special error dictionaries, we estimate the typical error rates for each genre. Finally, we test if the error rate of a document represents a useful feature for genre classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Arning, A.: Fehlersuche in großen Datenmengen unter Verwendung der in den Daten vorhandenen Redundanz. Ph.D. thesis, University of Osnabrück (1995)
Crowston, K., Williams, M.: Reproduced and emergent genres of communication on the world-wide web. In: 30th Hawaii International Conference on System Sciences (HICSS) (6), pp. 30–39 (1997)
Dewdney, N., VanEss-Dykema, C., MacMillan, R.: The form is the substance: classification of genres in text. In: Proceedings of the workshop on Human Language Technology and Knowledge Management, pp. 1–8. Association for Computational Linguistics, Morristown (2001)
Dewe, J., Karlgren, J., Bretan, I.: Assembling a balanced corpus from the internet. In: Proceedings of 11th Nordic Conference of Computational Linguistics. Copenhagen (1998)
Hastie, T., Tibshirani, R.: Classification by pairwise coupling. In: M.I. Jordan, M.J. Kearns, S.A. Solla (eds.) Advances in Neural Information Processing Systems, vol. 10. MIT, Cambridge (1998)
Huang, Y., Suen, C.: The behavior-knowledge space method for combination of multiple classifiers. In: Proceedings of Computer Vision and Pattern Recognition CVPR ’93, pp. 347–352 (1993)
Joachims, T.: A statistical learning learning model of text classification for support vector machines. In: SIGIR ’01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 128–136. ACM Press, New York (2001)
Joachims, T.: Transductive learning via spectral graph partitioning. In: Proceedings of the International Conference on Machine Learning, pp. 290–297 (2003)
Karlgren, J., Cutting, D.: Recognizing text genres with simple metrics using discriminant analysis. In: Proceedings of the 15th. International Conference on Computational Linguistics (COLING 94), vol. II, pp. 1071–1075. Kyoto (1994)
Kukich, K.: Techniques for automatically correcting words in texts. ACM Comput. Surv. pp. 377–439 (1992)
Platt, J.: Machines using sequential minimal optimization. In: Schoelkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods—Support Vector Learning. MIT, Cambridge (1998)
Quinlan J.R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo
Google Scholar
Ringlstetter C., Schulz K.U. and Mihov S. (2006). Orthographic errors in web pages: Towards cleaner web corpora. Comput. Lingusit. 32(3): 295–340
Article Google Scholar
Rosenfeld R. (2000). Two decades of statistical language modeling: where do we go from here?. Proc. IEEE 88(8): 1270–1278
Article Google Scholar
Santini, M.: Common criteria for genre classification: Annotation and granularity. In: Workshop on Text-based Information Retrieval (TIR-06). Riva del Garda, Italy (2006)
Stubbe, A., Ringlstetter, C., Goebel, R.: Elements of a learning interface for genre qualified search. In: Proceedings of the Workshop Towards Genre-Enabled Search Engines:The Impact of NLP (RANLP-2007). Borovets, Bulgaria (2007)
Wahlster, W., (ed.): Verbmobil: Foundations of Speech-to-Speech Translation. Springer, Heidelberg (2000)
Wastholm, P., Kusma, A.: Using linguistic data for genre classification. In: Proceedings of the Swedish Artificial Intelligence and Learning Systems Event SAIS-SSLS. Mälardalen University, Schweden (2005)
Witten, I.H., Eibe, F.: Data mining: practical machine learning tools and techniques. 2nd edn. Morgan Kaufmann, San Francisco. http://www.cs.waikato.ac.nz/ml/weka (2005)

Download references

Author information

Authors and Affiliations

CIS, University of Munich, Oettingenstr 67, 80538, Munich, Germany
Andrea Stubbe & Klaus U. Schulz
AICML, Department of Computing Science, University of Alberta, Edmonton, Canada, T6G 2E8
Christoph Ringlstetter

Authors

Andrea Stubbe
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Ringlstetter
View author publications
You can also search for this author in PubMed Google Scholar
Klaus U. Schulz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christoph Ringlstetter.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Stubbe, A., Ringlstetter, C. & Schulz, K.U. Genre as noise: noise in genre. IJDAR 10, 199–209 (2007). https://doi.org/10.1007/s10032-007-0060-2

Download citation

Received: 20 March 2007
Revised: 16 July 2007
Accepted: 27 August 2007
Published: 30 November 2007
Issue Date: December 2007
DOI: https://doi.org/10.1007/s10032-007-0060-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Genre as noise: noise in genre

Abstract

Access this article

Similar content being viewed by others

Natural Language Processing

Siamese Neural Networks: An Overview

Autoencoders and their applications in machine learning: a survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Genre as noise: noise in genre

Abstract

Access this article

Similar content being viewed by others

Natural Language Processing

Siamese Neural Networks: An Overview

Autoencoders and their applications in machine learning: a survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation