Text Punctuation: An Inter-annotator Agreement Study

Boháč, Marek; Rott, Michal; Kovář, Vojtěch

doi:10.1007/978-3-319-64206-2_14

Marek Boháč¹⁵,
Michal Rott¹⁵ &
Vojtěch Kovář¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10415))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1508 Accesses
3 Citations

Abstract

Spoken language is a phenomenon which is hard to be annotated accurately. One of the most ambiguous tasks is to fill in the punctuation marks into the spoken language transcription. Used punctuation marks are often dependent on how annotators understand the transcription content. This may differ as the spoken language often lacks clear structure (inherent to written language) due to the utterance spontaneity or due to skipping between ideas.

Therefore we suspect that filling commas into the spoken language transcription is a very ambiguous task with low inter-annotator agreement (IAA). Low IAA also means that application of Gold Truth (GT) annotations for automatic algorithm evaluation is questionable as already discussed in [7, 8].

In this paper we analyze the IAA within group of annotators and we propose methods to increase it. We also propose and evaluate a reformulation of classical GT annotations for cases with multiple annotations available.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
All the used data are accessible at http://nlp.ite.tul.cz/punctuation.

References

Boháč, M., Blavka, K., Kuchařová, M., Škodová, S.: Post-processing of the recognized speech for web presentation of large audio archive. In: 2012 35th International Conference on Telecommunications and Signal Processing (TSP), pp. 441–445, July 2012
Google Scholar
Boháč, M., Nouza, J., Blavka, K.: Investigation on most frequent errors in large-scale speech recognition applications. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS, vol. 7499, pp. 520–527. Springer, Heidelberg (2012). doi:10.1007/978-3-642-32790-2_63
Chapter Google Scholar
Kolář, J., Švec, J., Psutka, J.: Automatic punctuation annotation in Czech broadcast news speech. In: 9th Conference Speech and Computer (2004)
Google Scholar
Kovář, V.: Partial grammar checking for Czech using the SET parser. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2014. LNCS (LNAI), vol. 8655, pp. 308–314. Springer, Cham (2014). doi:10.1007/978-3-319-10816-2_38
Google Scholar
Kovář, V., Horák, A., Jakubíček, M.: Syntactic analysis as pattern matching: the SET parsing system. In: Proceedings of 4th Language and Technology Conference, Wydawnictwo Poznańskie, Poznań, Poland, pp. 978–983 (2009)
Google Scholar
Kovář, V., Machura, J., Zemková, K., Rott, M.: Evaluation and improvements in punctuation detection for Czech. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2016. LNCS, vol. 9924, pp. 287–294. Springer, Cham (2016). doi:10.1007/978-3-319-45510-5_33
Google Scholar
Kovář, V.: Evaluating natural language processing tasks with low inter-annotator agreement: the case of corpus applications. In: Recent Advances in Slavonic Natural Language Processing, RASLAN 2016, pp. 127–134 (2016)
Google Scholar
Kovář, V., Jakubíček, M., Horák, A.: On evaluation of natural language processing tasks - is gold standard evaluation methodology a good solution? In: Proceedings of the ICAART 2016, vol. 2, pp. 540–545. SCITEPRESS (2016)
Google Scholar
Mihajlik, P., Fegyó, T., Németh, B., Tüske, Z., Trón, V.: Towards automatic transcription of large spoken archives in agglutinating languages – Hungarian ASR for the MALACH Project. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS, vol. 4629, pp. 342–349. Springer, Heidelberg (2007). doi:10.1007/978-3-540-74628-7_45
Chapter Google Scholar
Nouza, J., Červa, P., Ždánský, J., et al.: Speech-to-text technology to transcribe and disclose 100, 000+ hours of bilingual documents from historical Czech and Czechoslovak radio archive. In: INTERSPEECH 2014, pp. 964–968 (2014)
Google Scholar
Petkevič, V.: Kontrola české gramatiky (český grammar checker). Studie z aplikované lingvistiky-Studies in Applied Linguistics 5(2), 48–66 (2014)
Google Scholar

Download references

Acknowledgment

We are very grateful to the students doing the annotation work, thank you. This work was supported by the Student’s Grant Scheme at the Technical University of Liberec (SGS 2016), by the Ministry of Education of CR within the LINDAT-Clarin project LM2015071 and by the Grant Agency of CR within the project 15-13277S.

Author information

Authors and Affiliations

Institute of Information Technology and Electronics, Technical University of Liberec, Studentská 2, Liberec, Czech Republic
Marek Boháč & Michal Rott
Natural Language Processing Centre, Masaryk University, Botanická 68a, Brno, Czech Republic
Vojtěch Kovář

Authors

Marek Boháč
View author publications
You can also search for this author in PubMed Google Scholar
Michal Rott
View author publications
You can also search for this author in PubMed Google Scholar
Vojtěch Kovář
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michal Rott .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein
University of West Bohemia, Pilsen, Czech Republic
Václav Matoušek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Boháč, M., Rott, M., Kovář, V. (2017). Text Punctuation: An Inter-annotator Agreement Study. In: Ekštein, K., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2017. Lecture Notes in Computer Science(), vol 10415. Springer, Cham. https://doi.org/10.1007/978-3-319-64206-2_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-64206-2_14
Published: 29 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64205-5
Online ISBN: 978-3-319-64206-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics