Word-Order Analysis Based Upon Treebank Data

Kuboň, Vladislav; Lopatková, Markéta

doi:10.1007/978-3-319-27060-9_4

Vladislav Kuboň¹⁵ &
Markéta Lopatková¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9413))

Included in the following conference series:

Mexican International Conference on Artificial Intelligence

1116 Accesses

Abstract

The paper describes an experiment consisting in the attempt to quantify word-order properties of three Indo-European languages (Czech, English and Farsi). The investigation is driven by the endeavor to find an objective way how to compare natural languages from the point of view of the degree of their word-order freedom. Unlike similar studies which concentrate either on purely linguistic or purely statistical approach, our experiment tries to combine both – the observations are verified against large samples of sentences from available treebanks, and, at the same time, we exploit the ability of our tools to analyze selected important phenomena (as, e.g., the differences of the word order of a main and a subordinate clause) more deeply.

The quantitative results of our research are collected from the syntactically annotated treebanks available for all three languages. Thanks to the HamleDT project, it is possible to search all treebanks in a uniform way by means of a universal query tool PML-TQ. This is also a secondary goal of this paper – to demonstrate the research potential provided by language resources which are to a certain extent unified.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://lindat.mff.cuni.cz/.
2.
http://ufal.mff.cuni.cz/hamledt.
3.
http://ufal.mff.cuni.cz/pdt3.0.
4.
https://www.cis.upenn.edu/~treebank.
5.
http://dadegan.ir/en/perdt.
6.
https://lindat.mff.cuni.cz/services/pmltq/#!/treebanks.
7.
https://lindat.mff.cuni.cz/services/pmltq/pdt30/.
8.
https://lindat.mff.cuni.cz/services/pmltq/hamledt_en/.
9.
The number of subject-less subordinated clauses is inadequately high due to the same reasons as for main clauses: annotation scheme for coordination and analytical verb forms.
10.
https://lindat.mff.cuni.cz/services/pmltq/hamledt_fa/.

References

Bejček, E., Hajičová, E., Hajič, J., Jínová, P., Kettnerová, V., Kolářová, V., Mikulová, M., Mírovský, J., Nedoluzhko, A., Panevová, J., Poláková, L., Ševčíková, M., Štěpánek, J., Zikánová, Š.: Prague Dependency Treebank 3.0 (2013)
Google Scholar
Dryer, M.S., Haspelmath, M.: The World Atlas of Language Structures Online. Harcourt, Brace and company, Leipzig (2005–2013). http://wals.info, Accessed on 28 June 2015
Futrell, R., Mahowald, K., Gibson, E.: Quantifying Word order freedom in dependency corpora. In: Proceedings of the International Conference on Dependency Linguistics (Depling 2015), Uppsala University, Uppsala, Sweden (2015)
Google Scholar
Holan, T., Kuboň, V., Oliva, K., Plátek, M.: On complexity of word order. Les grammaires de dépendance - Traitement automatique des langues (TAL) 41(1), 273–300 (2000)
Google Scholar
Kuboň, V., Lopatková, M., Plátek, M.: On formalization of word order properties. In: Gelbukh, A. (ed.) CICLing 2012, Part I. LNCS, vol. 7181, pp. 130–141. Springer, Heidelberg (2012)
Chapter Google Scholar
Lopatková, M., Homola, P., Klyueva, N.: Annotation of sentence structure: capturing the relationship between clauses in Czech sentences. Lang. Res. Eval. 46(1), 25–36 (2012)
Article Google Scholar
Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: the penn treebank. Comput. Linguist. 19, 313–330 (1993)
Google Scholar
Oepen, S., Netter, K., Klein, J.: TSNLP - Test suites for natural language processing. CSLI Lecture Notes (1998)
Google Scholar
Pajas, P., Štěpánek, J.: System for querying syntactically annotated corpora. In: Proceedings of the ACL-IJCNLP 2009 Software Demonstrations, pp. 33–36. Association for Computational Linguistics, Suntec, Singapore, August 2009
Google Scholar
Rosa, R., Žabokrtský, Z.: \(KL_{cpos^3}\) - a Language Similarity Measure for Delexicalized Parser Transfer (2015)
Google Scholar
Rosa, R., Žabokrtský, Z.: MSTParser Model interpolation for multi-source delexicalized transfer. In: Proceedings of the 14th International Conference on Parsing Technologies, pp. 71–75. Association for Computational Linguistics, Stroudsburg (2015)
Google Scholar
Sapir, E.: Language: An Introduction to the Study of Speech. Harcourt Brace and Company, New York (1921). http://www.gutenberg.org/files/12629/12629-h/12629-h.htm
Google Scholar
Saussure, F.: Course in General Linguistics. Open Court, La Salle (1983). (prepared by C. Bally and A. Sechehaye, translated by R. Harris)
Google Scholar
Skalička, V.: Vývoj jazyka. Soubor statí. Státní pedagogické nakladatelství, Praha (1960)
Google Scholar
Čermák, F.: Jazyk a jazykověda. Pražská imaginace, Ptraha (1994)
Google Scholar
Zeman, D., Dušek, O., Mareček, D., Popel, M., Ramasamy, L., Štěpánek, J., Žabokrtský, Z., Hajič, J.: HamleDT: harmonized multi-language dependency treebank. Lang. Res. Eval. 48(4), 601–637 (2014)
Article Google Scholar
Zeman, D., Resnik, P.: Cross-language parser adaptation between related languages. In: IJCNLP 2008 Workshop on NLP for Less Privileged Languages, pp. 35–42. Asian Federation of Natural Language Processing, Hyderabad (2008)
Google Scholar

Download references

Acknowledgments

This work has been using language resources developed, stored and distributed by the LINDAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2010013).

Author information

Authors and Affiliations

Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, Charles University in Prague, Prague, Czech Republic
Vladislav Kuboň & Markéta Lopatková

Authors

Vladislav Kuboň
View author publications
You can also search for this author in PubMed Google Scholar
Markéta Lopatková
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Vladislav Kuboň or Markéta Lopatková .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Centro de Investigación en Computación, Mexico City, Mexico
Grigori Sidorov
Facultad de ciencias, Universidad Autónoma Nacional, México, Distrito Federal, Mexico
Sofía N. Galicia-Haro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kuboň, V., Lopatková, M. (2015). Word-Order Analysis Based Upon Treebank Data. In: Sidorov, G., Galicia-Haro, S. (eds) Advances in Artificial Intelligence and Soft Computing. MICAI 2015. Lecture Notes in Computer Science(), vol 9413. Springer, Cham. https://doi.org/10.1007/978-3-319-27060-9_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-27060-9_4
Published: 30 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27059-3
Online ISBN: 978-3-319-27060-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics