Abstract
Authorship recognition, machine translation detection, pedophile identification and other stylometry techniques are daily used in applications for the most widely used languages. On the other hand, under-represented languages lack data sources usable for stylometry research. In this paper, we propose novel algorithm to build corpora containing meta-information required for stylometry experiments (author information, publication time, document heading, document borders) and introduce our tool Authorship Corpora Builder (ACB). We modify data-cleaning techniques for purposes of stylometry field and add a heuristic layer to detect and extract valuable meta-information.
The system was evaluated on Czech and Slovak web domains. Collected data have been published and we are planning to build collections for other languages and gradually extend existing ones.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Word-per-line (WPL) text, as defined at the University of Stuttgart in the 1990s.
- 2.
Data are available at http://nlp.fi.muni.cz/projekty/acb/preview.
- 3.
We have experimentally set the size limit to 20 after analysis of 5 different data sources.
- 4.
More detailed description is in next subsections.
References
Peersman, C., Vaassen, F., Asch, V.V., Daelemans, W.: Conversation level constraints on pedophile detection in chat rooms. In: CLEF 2012 Evaluation Labs and Workshop, Online Working Notes, Rome, Italy, 17–20 September 2012
Chaski, C.E.: Who wrote it? Steps toward a science of authorship identification. Natl. Inst. Justice J. 233, 15–22 (1997)
Joula, P.: Authorship attribution. Found. Trends Inf. Retr. 1, 233–334 (2008)
Klimt, B., Yang, Y.: Introducing the Enron corpus. In: CEAS 2004 – First Conference on Email and Anti-Spam, Mountain View, California, USA, 30–31 July 2004
Koppel, M., Schler, J., Argamon, S., Pennebaker, J.: Effects of age and gender on blogging. In: AAAI 2006 Spring Symposium on Computational Approaches to Analysing Weblogs (2006)
Suchomel, V.: Recent Czech web corpora. In: Horák, A., Rychl, P. (eds.) Proceedings of 6th Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2012, Brno, Czech Republic, Tribun EU, pp. 77–83 (2012)
Medved’, M., Jakubíček, M., Kovář, V., Němčík, V: Adaptation of Czech parsers for Slovak. In: Horák, A., Rychlý, P. (eds.) Proceedings of 6th Workshop on Recent Advances in Slavonic Natural Language Processing, Brno, Czech Republic, Tribun EU, pp. 23–30 (2012)
Csendes, D., Csirik, J.A., Gyimóthy, T.: The Szeged corpus: a POS tagged and syntactically annotated hungarian natural language corpus. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp. 41–47. Springer, Heidelberg (2004)
ÚČNK FF UK: SKRIPT2012: akviziční korpus psané češtiny– přepisy písemných prací žáků základních a středních škol v ČR (in English: acquisition corpus of Czech written language - transcripts of the written work of pupils in primary and secondary schools in the Czech Republic) (2013)
Koppel, M., Argamon, S., Shimoni, A.: Automatically categorizing written texts by author gender (2003)
Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artif. Intell. 118(1–2), 15–68 (2000)
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proceedings of the 14th International Conference on World Wide Web, WWW 2005, pp. 76–85. ACM, New York (2005)
Chang, C.H., Lui, S.C.: IEPAD: information extraction based on pattern discovery. In: Proceedings of the 10th International Conference on World Wide Web, WWW 2001, pp. 681–688. ACM, New York (2001)
Simon, K., Lausen, G.: ViPER: augmenting automatic information extraction with visual perceptions. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, CIKM 2005, pp. 381–388. ACM, New York (2005)
Deepa, R., Nirmala, D.R.: Noisy elimination for web mining based on style tree approach. Int. J. Eng. Technol. Comput. Res. (IJETCR) 3, 23–26 (2013)
Yi, L., Liu, B., Li, X.: Eliminating noisy information in web pages for data mining. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2003, pp. 296–305. ACM, New York (2003)
Acknowledgments
This work has been partly supported by the Ministry of Education of CR within the LINDAT-Clarin project LM2015071 and by the national COST-CZ project LD15066.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Švec, J., Rygl, J. (2016). Building Corpora for Stylometric Research. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2016. Lecture Notes in Computer Science(), vol 9924. Springer, Cham. https://doi.org/10.1007/978-3-319-45510-5_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-45510-5_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45509-9
Online ISBN: 978-3-319-45510-5
eBook Packages: Computer ScienceComputer Science (R0)