Skip to main content
Log in

Abstract

Given a specific information need, documents of the wrong genre can be considered as noise. From this perspective, genre classification helps to separate relevant documents from noise. Orthographic errors represent a second, finer notion of noise. Since specific genres often include documents with many errors, an interesting question is whether this “micro-noise” can help to classify genre. In this paper we consider both problems. After introducing a comprehensive hierarchy of genres, we present an intuitive method to build specialized and distinctive classifiers that also work for very small training corpora. Special emphasis is given to the selection of intelligent high-level features. We then investigate the correlation between genre and micro noise. Using special error dictionaries, we estimate the typical error rates for each genre. Finally, we test if the error rate of a document represents a useful feature for genre classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Arning, A.: Fehlersuche in großen Datenmengen unter Verwendung der in den Daten vorhandenen Redundanz. Ph.D. thesis, University of Osnabrück (1995)

  2. Crowston, K., Williams, M.: Reproduced and emergent genres of communication on the world-wide web. In: 30th Hawaii International Conference on System Sciences (HICSS) (6), pp. 30–39 (1997)

  3. Dewdney, N., VanEss-Dykema, C., MacMillan, R.: The form is the substance: classification of genres in text. In: Proceedings of the workshop on Human Language Technology and Knowledge Management, pp. 1–8. Association for Computational Linguistics, Morristown (2001)

  4. Dewe, J., Karlgren, J., Bretan, I.: Assembling a balanced corpus from the internet. In: Proceedings of 11th Nordic Conference of Computational Linguistics. Copenhagen (1998)

  5. Hastie, T., Tibshirani, R.: Classification by pairwise coupling. In: M.I. Jordan, M.J. Kearns, S.A. Solla (eds.) Advances in Neural Information Processing Systems, vol. 10. MIT, Cambridge (1998)

  6. Huang, Y., Suen, C.: The behavior-knowledge space method for combination of multiple classifiers. In: Proceedings of Computer Vision and Pattern Recognition CVPR ’93, pp. 347–352 (1993)

  7. Joachims, T.: A statistical learning learning model of text classification for support vector machines. In: SIGIR ’01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 128–136. ACM Press, New York (2001)

  8. Joachims, T.: Transductive learning via spectral graph partitioning. In: Proceedings of the International Conference on Machine Learning, pp. 290–297 (2003)

  9. Karlgren, J., Cutting, D.: Recognizing text genres with simple metrics using discriminant analysis. In: Proceedings of the 15th. International Conference on Computational Linguistics (COLING 94), vol. II, pp. 1071–1075. Kyoto (1994)

  10. Kukich, K.: Techniques for automatically correcting words in texts. ACM Comput. Surv. pp. 377–439 (1992)

  11. Platt, J.: Machines using sequential minimal optimization. In: Schoelkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods—Support Vector Learning. MIT, Cambridge (1998)

  12. Quinlan J.R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo

    Google Scholar 

  13. Ringlstetter C., Schulz K.U. and Mihov S. (2006). Orthographic errors in web pages: Towards cleaner web corpora. Comput. Lingusit. 32(3): 295–340

    Article  Google Scholar 

  14. Rosenfeld R. (2000). Two decades of statistical language modeling: where do we go from here?. Proc. IEEE 88(8): 1270–1278

    Article  Google Scholar 

  15. Santini, M.: Common criteria for genre classification: Annotation and granularity. In: Workshop on Text-based Information Retrieval (TIR-06). Riva del Garda, Italy (2006)

  16. Stubbe, A., Ringlstetter, C., Goebel, R.: Elements of a learning interface for genre qualified search. In: Proceedings of the Workshop Towards Genre-Enabled Search Engines:The Impact of NLP (RANLP-2007). Borovets, Bulgaria (2007)

  17. Wahlster, W., (ed.): Verbmobil: Foundations of Speech-to-Speech Translation. Springer, Heidelberg (2000)

  18. Wastholm, P., Kusma, A.: Using linguistic data for genre classification. In: Proceedings of the Swedish Artificial Intelligence and Learning Systems Event SAIS-SSLS. Mälardalen University, Schweden (2005)

  19. Witten, I.H., Eibe, F.: Data mining: practical machine learning tools and techniques. 2nd edn. Morgan Kaufmann, San Francisco. http://www.cs.waikato.ac.nz/ml/weka (2005)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christoph Ringlstetter.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Stubbe, A., Ringlstetter, C. & Schulz, K.U. Genre as noise: noise in genre. IJDAR 10, 199–209 (2007). https://doi.org/10.1007/s10032-007-0060-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-007-0060-2

Keywords

Navigation