Skip to main content

Detecting Commas in Slovak Legal Texts

  • Conference paper
Text, Speech and Dialogue (TSD 2014)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8655))

Included in the following conference series:

  • 1585 Accesses

Abstract

This paper reports on initial experiments with automatic comma recovery in legal texts. In deciding whether to insert a comma or not, we propose to use the value of the probability of a bigram of two words without a comma and a trigram of the words with the comma. The probability is determined by the language model trained on sentences with commas labeled as separate words. In the training database one sentence corresponds to one line. The thresholds of bigrams and trigrams probability were experimentally determined to achieve the best balance of precision and recall. The advantage of the proposed method is its high precision (95%) at a relatively satisfactory recall (49%). For judges as potential users of an ASR system with an automatic comma insertion function, precision is particularly important.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Rusko, M., Juhár, J., Trnka, M., Staš, J., Darjaa, S., et al.: Slovak automatic transcription and dictation system for the judicial domain. In: 5th Language & Technology Conference on Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 365–369. Fundacja Uniwersytetu Im, A. Miczkiewicza (2011)

    Google Scholar 

  2. Kolář, J., Švec, J., Psutka, J.: Automatic punctuation annotation in Czech broadcast news speech. In: SPECOM 2004, Saint-Petersburg, pp. 319–325 (2004)

    Google Scholar 

  3. Batista, F., Caseiro, D., Mamede, N., Trancoso, I.: Recovering capitalization and punctuation marks for automatic speech recognition: Case study for the Portuguese broadcast news. Speech Communication 50(10), 847–862 (2008)

    Article  Google Scholar 

  4. Huang, J., Zweig, G.: Maximum entropy model for punctuation annotation from speech. In: Proceedings of International Conference on Spoken Language Processing, Denver, pp. 917–920, (2002)

    Google Scholar 

  5. Christensen, H., Gotoh, Y., Renals, S.: Punctuation annotation using statistical prosody models. In: Proc. ISCA Workshop on Prosody in Speech Recognition and Understanding, pp. 35–40 (2001)

    Google Scholar 

  6. Wei, L., Hwee, T.N.: Better punctuation prediction with dynamic conditional random fields. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Cambridge, pp. 177–186 (2010)

    Google Scholar 

  7. Gravano, A., Jansche, M., Bacchiani, M.: Restoring punctuation and capitalization in transcribed speech. In: Proceedings of The International Conference on Acoustics, Speech, and Signal Processing, Dallas, pp. 4741–4744 (2009)

    Google Scholar 

  8. Stolcke, A., Shriberg, E., Bates, R., Ostendorf, M., Hakkani, D., Plauche, M., Tur, G., Lu, Y.: Automatic detection of sentence boundaries and disfluencies based on recognized words. In: Proc. of ICSLP 1998 (1998)

    Google Scholar 

  9. Jakubíček, M., Horák, A.: Punctuation Detection with Full Syntactic Parsing. Research in Computing Science, Special issue: Natural Language Processing and its Applications 46, 335–343 (2010)

    Google Scholar 

  10. Stolcke, A.: SRILM – An Extensible Language Modeling Toolkit. In: Proc. of ICSLP 2002, Denver, pp. 901–904 (2002)

    Google Scholar 

  11. http://www.speech.sri.com/projects/srilm/manpages/ngram-format.5.html

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Sabo, R., Beňuš, Š. (2014). Detecting Commas in Slovak Legal Texts. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2014. Lecture Notes in Computer Science(), vol 8655. Springer, Cham. https://doi.org/10.1007/978-3-319-10816-2_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-10816-2_8

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-10815-5

  • Online ISBN: 978-3-319-10816-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics