Skip to main content

Can Statistical Tests Be Used for Feature Selection in Diachronic Text Classification?

  • Conference paper
Book cover Statistical Language and Speech Processing (SLSP 2013)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7978))

Included in the following conference series:

  • 2679 Accesses

Abstract

In spite of the great number of diachronic studies in various languages, the methodology for investigating language change has not evolved much in the last fifty years. Following the progressive trends in other fields, in this paper, we argue for the adoption of a machine learning approach in diachronic studies, which could offer a more efficient analysis of a large number of features and easier comparison of the results across different genres, languages and language varieties. We suggest the use of statistical tests as an initial step for feature selection in an approach which uses the F-measure of the classification algorithms as a measure of the extent of diachronic changes. Furthermore, we compare the performance of the classification task after the feature selection made by statistical tests and the CfsSubsetEval attribute selection algorithm. The experiments were conducted on the British part of the biggest existing diachronic corpora of 20th century written English language – the ‘Brown family’ of corpora, using 23 different stylistic features. The results demonstrated that the use of the statistical tests for feature selection can significantly increase the accuracy of the classification algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adolph, R.: The Rise of Modern Prose Style. M.I.T. Press, Cambridge (1966)

    Google Scholar 

  2. Aldrich, J., Nelson, F.: Linear probability, logit, and probit models. Quantitative applications in the social sciences. Sage, London (1984)

    Google Scholar 

  3. Altmann, G., von Buttlar, H., Rott, W., Strau, U.: A law of change in language. In: Brainerd, B. (ed.) Historical Linguistics, pp. 104–115. Brockmeye, Bochum (1983)

    Google Scholar 

  4. Bennett, J.R.: Prose Style: A Historical Approach through Studies. Chandler, San Francisco (1971)

    Google Scholar 

  5. Biber, D.: Investigating Macroscopic Textual Variation through Multifeature/Multidimensional Analyses. Linguistics 23, 337–360 (1985)

    Article  Google Scholar 

  6. Biber, D.: Variation across speech and writing. Cambridge University Press, Cambridge (1988)

    Book  Google Scholar 

  7. Biber, D., Finegan, E.: An Initial Typology of English Text Types. In: Aarts, J., Meijs, W. (eds.) Corpus Linguistics H: New Studies in the Analysis and Exploitation of Computer Corpora, pp. 19–46. Rodopi, Amsterdam (1986)

    Google Scholar 

  8. Biber, D., Finegan, E.: Drift and the evolution of English style: A history of three genres. Language 65, 487–517 (1989)

    Article  Google Scholar 

  9. le Cessie, S., van Houwelingen, J.: Ridge Estimators in Logistic Regression. Applied Statistics 41(1), 191–201 (1992)

    Article  MATH  Google Scholar 

  10. Connexor: Machinese language analysers (2006)

    Google Scholar 

  11. Corpas Pastor, G., Mitkov, R., Afzal, N., Pekar, V.: Translation Universals: Do they exist? A corpus-based NLP study of convergence and simplification. In: Proceedings of the AMTA, Waikiki, Hawaii (2008)

    Google Scholar 

  12. Geisler, C.: Relativization in Ulster English. In: Poussa, P. (ed.) Relativisation on the North Sea Littoral (LINCOM Studies in Language Typology 07), pp. 135–146. Lincom Europa, München (2002)

    Google Scholar 

  13. Geisler, C.: Statistical reanalysis of corpus data. ICAME Journal 32, 35–46 (2008)

    Google Scholar 

  14. Gordon, I.A.: The Movement of English Prose. Indiana University Press, Bloomington (1966)

    Google Scholar 

  15. Hall, M.A., Smith, L.A.: Practical feature subset selection for machine learning. In: McDonald, C. (ed.) Computer Science 1998 Proceedings of the 21st Australasian Computer Science Conference, ACSC 1998, pp. 181–191. Springer, Berlin (1998)

    Google Scholar 

  16. John, G.H., Langley, P.: Estimating Continuous Distributions in Bayesian Classifiers. In: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345 (1995)

    Google Scholar 

  17. Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., Murthy, K.R.K.: Improvements to Platt’s SMO Algorithm for SVM Classifier Design. Neural Computation 13(3), 637–649 (2001)

    Article  MATH  Google Scholar 

  18. Kroch, A.: Function and grammar in the history of English: Periphrastic “do”. In: Fasold, R. (ed.) Language Change and Variation, pp. 133–172. Benjamins, Amsterdam (1989)

    Google Scholar 

  19. Kroch, A.: Reflexes of grammar in patterns of language change. In: Language Variation and Change, vol. 1, pp. 199–244 (1989)

    Google Scholar 

  20. Landwehr, N., Hall, M., Frank, E.: Logistic Model Trees. Machine Learning 59, 161–205 (2005)

    Article  MATH  Google Scholar 

  21. Leech, G., Smith, N.: Extending the possibilities of corpus-based research on English in the twentieth century: a prequel to LOB and FLOB. ICAME Journal 29, 83–98 (2005)

    Google Scholar 

  22. Leech, G., Smith, N.: Recent grammatical change in written English 1961-1992: some preliminary findings of a comparison of American with British English. In: Renouf, A., Kehoe, A. (eds.) The Changing Face of Corpus Linguistics, pp. 186–204. Rodopi, Amsterdam (2006)

    Google Scholar 

  23. Mair, C., Hundt, M., Leech, G., Smith, N.: Short term diachronic shifts in part-of-speech frequencies: a comparison of the tagged LOB and F-LOB corpora. International Journal of Corpus Linguistics 7, 245–264 (2002)

    Article  Google Scholar 

  24. Mair, C., Leech, G.: Current change in English syntax. In: Aarts, B., MacMahon, A. (eds.) The Handbook of English Linguistics, ch. 14. Blackwell, Oxford (2006)

    Google Scholar 

  25. Platt, J.C.: Fast Training of Support Vector Machines using Sequential Minimal Optimization. In: Schoelkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning. The MIT Press, London (1998)

    Google Scholar 

  26. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  27. Senter, R.J., Smith, E.A.: Automated readability index. Tech. rep., University of Cincinnati. Ohio, Cincinnati (1967)

    Google Scholar 

  28. Sumner, M., Frank, E., Hall, M.: Speeding up Logistic Model Tree Induction. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 675–683. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  29. Tukey, J.: Exploratory data analysis. Addison-Wesley, Reading (1977)

    MATH  Google Scholar 

  30. Štajner, S., Mitkov, R.: Diachronic Stylistic Changes in British and American Varieties of 20th Century Written English Language. In: Proceedings of the RANLP 2011 Workshop “Language Technologies for Digital Humanities and Cultural Heritage”, pp. 78–85 (2011)

    Google Scholar 

  31. Štajner, S., Mitkov, R.: Diachronic Changes in Text Complexity in 20th Century English Language: An NLP Approach. In: Calzolari, N., Choukri, K., Declerck, T., Dogan, M.U., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S. (eds.) Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey (May 2012)

    Google Scholar 

  32. Westin, I.: Language Change in English Newspaper Editorials. Rodopi, Amsterdam (2002)

    Google Scholar 

  33. Westin, I., Geisler, C.: A multi-dimensional study of diachronic variation in British newspaper editorials. ICAME Journal 26, 133–152 (2002)

    Google Scholar 

  34. Witten, I.H., Frank, E.: Data mining: practical machine learning tools and techniques. Morgan Kaufmann Publishers (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Štajner, S., Evans, R. (2013). Can Statistical Tests Be Used for Feature Selection in Diachronic Text Classification?. In: Dediu, AH., Martín-Vide, C., Mitkov, R., Truthe, B. (eds) Statistical Language and Speech Processing. SLSP 2013. Lecture Notes in Computer Science(), vol 7978. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39593-2_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-39593-2_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-39592-5

  • Online ISBN: 978-3-642-39593-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics