Detecting Commas in Slovak Legal Texts

Sabo, Róbert; Beňuš, Štefan

doi:10.1007/978-3-319-10816-2_8

Róbert Sabo²¹ &
Štefan Beňuš^21,22

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8655))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1585 Accesses

Abstract

This paper reports on initial experiments with automatic comma recovery in legal texts. In deciding whether to insert a comma or not, we propose to use the value of the probability of a bigram of two words without a comma and a trigram of the words with the comma. The probability is determined by the language model trained on sentences with commas labeled as separate words. In the training database one sentence corresponds to one line. The thresholds of bigrams and trigrams probability were experimentally determined to achieve the best balance of precision and recall. The advantage of the proposed method is its high precision (95%) at a relatively satisfactory recall (49%). For judges as potential users of an ASR system with an automatic comma insertion function, precision is particularly important.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Combining Natural Language Processing Approaches for Rule Extraction from Legal Documents

Building a corpus of legal argumentation in Japanese judgement documents: towards structure-based summarisation

Article Open access 15 February 2019

Segmentation Model for Judgments of the Portuguese Supreme Court of Justice

References

Rusko, M., Juhár, J., Trnka, M., Staš, J., Darjaa, S., et al.: Slovak automatic transcription and dictation system for the judicial domain. In: 5th Language & Technology Conference on Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 365–369. Fundacja Uniwersytetu Im, A. Miczkiewicza (2011)
Google Scholar
Kolář, J., Švec, J., Psutka, J.: Automatic punctuation annotation in Czech broadcast news speech. In: SPECOM 2004, Saint-Petersburg, pp. 319–325 (2004)
Google Scholar
Batista, F., Caseiro, D., Mamede, N., Trancoso, I.: Recovering capitalization and punctuation marks for automatic speech recognition: Case study for the Portuguese broadcast news. Speech Communication 50(10), 847–862 (2008)
Article Google Scholar
Huang, J., Zweig, G.: Maximum entropy model for punctuation annotation from speech. In: Proceedings of International Conference on Spoken Language Processing, Denver, pp. 917–920, (2002)
Google Scholar
Christensen, H., Gotoh, Y., Renals, S.: Punctuation annotation using statistical prosody models. In: Proc. ISCA Workshop on Prosody in Speech Recognition and Understanding, pp. 35–40 (2001)
Google Scholar
Wei, L., Hwee, T.N.: Better punctuation prediction with dynamic conditional random fields. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Cambridge, pp. 177–186 (2010)
Google Scholar
Gravano, A., Jansche, M., Bacchiani, M.: Restoring punctuation and capitalization in transcribed speech. In: Proceedings of The International Conference on Acoustics, Speech, and Signal Processing, Dallas, pp. 4741–4744 (2009)
Google Scholar
Stolcke, A., Shriberg, E., Bates, R., Ostendorf, M., Hakkani, D., Plauche, M., Tur, G., Lu, Y.: Automatic detection of sentence boundaries and disfluencies based on recognized words. In: Proc. of ICSLP 1998 (1998)
Google Scholar
Jakubíček, M., Horák, A.: Punctuation Detection with Full Syntactic Parsing. Research in Computing Science, Special issue: Natural Language Processing and its Applications 46, 335–343 (2010)
Google Scholar
Stolcke, A.: SRILM – An Extensible Language Modeling Toolkit. In: Proc. of ICSLP 2002, Denver, pp. 901–904 (2002)
Google Scholar
http://www.speech.sri.com/projects/srilm/manpages/ngram-format.5.html

Download references

Author information

Authors and Affiliations

Institute of Informatics of Slovak Academy of Sciences, Bratislava, Slovakia
Róbert Sabo & Štefan Beňuš
Constantine the Philosopher University in Nitra, Nitra, Slovakia
Štefan Beňuš

Authors

Róbert Sabo
View author publications
You can also search for this author in PubMed Google Scholar
Štefan Beňuš
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Botanicá 6a, 60200, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Department of Information Technologies, Masaryk University, 602 00, Brno, Czech Republic
Aleš Horák , Ivan Kopeček & Karel Pala , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sabo, R., Beňuš, Š. (2014). Detecting Commas in Slovak Legal Texts. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2014. Lecture Notes in Computer Science(), vol 8655. Springer, Cham. https://doi.org/10.1007/978-3-319-10816-2_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-10816-2_8
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10815-5
Online ISBN: 978-3-319-10816-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics