Skip to main content

Building Corpora for Stylometric Research

  • Conference paper
  • First Online:
Book cover Text, Speech, and Dialogue (TSD 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9924))

Included in the following conference series:

Abstract

Authorship recognition, machine translation detection, pedophile identification and other stylometry techniques are daily used in applications for the most widely used languages. On the other hand, under-represented languages lack data sources usable for stylometry research. In this paper, we propose novel algorithm to build corpora containing meta-information required for stylometry experiments (author information, publication time, document heading, document borders) and introduce our tool Authorship Corpora Builder (ACB). We modify data-cleaning techniques for purposes of stylometry field and add a heuristic layer to detect and extract valuable meta-information.

The system was evaluated on Czech and Slovak web domains. Collected data have been published and we are planning to build collections for other languages and gradually extend existing ones.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Word-per-line (WPL) text, as defined at the University of Stuttgart in the 1990s.

  2. 2.

    Data are available at http://nlp.fi.muni.cz/projekty/acb/preview.

  3. 3.

    We have experimentally set the size limit to 20 after analysis of 5 different data sources.

  4. 4.

    More detailed description is in next subsections.

References

  1. Peersman, C., Vaassen, F., Asch, V.V., Daelemans, W.: Conversation level constraints on pedophile detection in chat rooms. In: CLEF 2012 Evaluation Labs and Workshop, Online Working Notes, Rome, Italy, 17–20 September 2012

    Google Scholar 

  2. Chaski, C.E.: Who wrote it? Steps toward a science of authorship identification. Natl. Inst. Justice J. 233, 15–22 (1997)

    Google Scholar 

  3. Joula, P.: Authorship attribution. Found. Trends Inf. Retr. 1, 233–334 (2008)

    Article  Google Scholar 

  4. Klimt, B., Yang, Y.: Introducing the Enron corpus. In: CEAS 2004 – First Conference on Email and Anti-Spam, Mountain View, California, USA, 30–31 July 2004

    Google Scholar 

  5. Koppel, M., Schler, J., Argamon, S., Pennebaker, J.: Effects of age and gender on blogging. In: AAAI 2006 Spring Symposium on Computational Approaches to Analysing Weblogs (2006)

    Google Scholar 

  6. Suchomel, V.: Recent Czech web corpora. In: Horák, A., Rychl, P. (eds.) Proceedings of 6th Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2012, Brno, Czech Republic, Tribun EU, pp. 77–83 (2012)

    Google Scholar 

  7. Medved’, M., Jakubíček, M., Kovář, V., Němčík, V: Adaptation of Czech parsers for Slovak. In: Horák, A., Rychlý, P. (eds.) Proceedings of 6th Workshop on Recent Advances in Slavonic Natural Language Processing, Brno, Czech Republic, Tribun EU, pp. 23–30 (2012)

    Google Scholar 

  8. Csendes, D., Csirik, J.A., Gyimóthy, T.: The Szeged corpus: a POS tagged and syntactically annotated hungarian natural language corpus. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp. 41–47. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  9. ÚČNK FF UK: SKRIPT2012: akviziční korpus psané češtiny– přepisy písemných prací žáků základních a středních škol v ČR (in English: acquisition corpus of Czech written language - transcripts of the written work of pupils in primary and secondary schools in the Czech Republic) (2013)

    Google Scholar 

  10. Koppel, M., Argamon, S., Shimoni, A.: Automatically categorizing written texts by author gender (2003)

    Google Scholar 

  11. Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artif. Intell. 118(1–2), 15–68 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  12. Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proceedings of the 14th International Conference on World Wide Web, WWW 2005, pp. 76–85. ACM, New York (2005)

    Google Scholar 

  13. Chang, C.H., Lui, S.C.: IEPAD: information extraction based on pattern discovery. In: Proceedings of the 10th International Conference on World Wide Web, WWW 2001, pp. 681–688. ACM, New York (2001)

    Google Scholar 

  14. Simon, K., Lausen, G.: ViPER: augmenting automatic information extraction with visual perceptions. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, CIKM 2005, pp. 381–388. ACM, New York (2005)

    Google Scholar 

  15. Deepa, R., Nirmala, D.R.: Noisy elimination for web mining based on style tree approach. Int. J. Eng. Technol. Comput. Res. (IJETCR) 3, 23–26 (2013)

    Google Scholar 

  16. Yi, L., Liu, B., Li, X.: Eliminating noisy information in web pages for data mining. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2003, pp. 296–305. ACM, New York (2003)

    Google Scholar 

Download references

Acknowledgments

This work has been partly supported by the Ministry of Education of CR within the LINDAT-Clarin project LM2015071 and by the national COST-CZ project LD15066.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jan Švec .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Švec, J., Rygl, J. (2016). Building Corpora for Stylometric Research. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2016. Lecture Notes in Computer Science(), vol 9924. Springer, Cham. https://doi.org/10.1007/978-3-319-45510-5_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-45510-5_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-45509-9

  • Online ISBN: 978-3-319-45510-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics