Abstract
This paper describes experiments in applying statistical classification algorithms for the detection of converbs – rare word forms found in historical texts in New Indo-Aryan languages. The digitized texts were first manually tagged with the help of a custom made tool called IA Tagger enabling semi-automatic tagging of the texts. One of the features of the system is the generation of statistical data on occurrences of words and phrases in various contexts, which helps perform historical linguistic analysis at the levels of morphosyntax, semantics and pragmatics. The experiments carried out on data annotated with the use of IA Tagger involved the training of multi-class and binary POS-classifiers.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
This paper is a part of a research project funded by Polish National Centre for Science Grant 2013/10/M/HS2/00553.
- 2.
Optical recognition of Rajasthani texts was supported by a Hindi OCR program [11].
- 3.
We accept here Haspelmath’s [9, 3] definition of the converb: “a nonfinite verb form whose main function is to mark adverbial subordination".
- 4.
The following is a brief characteristics of an absolute constuction: “The head noun and its participle form a special type of a subordinate clause which could express an event contemporary with or anterior to that in main clause.” [4].
References
Aissen, J.: Differential object marking: iconicity vs. economy. Nat. Lang. Linguist. Theory 21, 435–483 (2003)
Bhanavat, N., Kamal, L.: Rajasthani gadya: vikas aur prakash. Shriram Mehra and Company, Agra (1997–1998)
Bickel, B.: Capturing particulars and universals in clause linkage: a multivariate analysis. In: Bril, I. (ed.) Clause Linking and Clause Hierarchy : Syntax and Pragmatics, No. 121 in Studies in Language Companion Series, pp. 51–102. John Benjamins, Amsterdam (2010). https://doi.org/10.5167/uzh-48989
Bubeník, V.: A historical syntax of late Middle Indo-Aryan (Apabhramśa). Amsterdam studies in the theory and history of linguistic science: Current issues in linguistic theory. John Benjamins, Amsterdam (1998). https://books.google.pl/books?id=abJjAAAAMAAJ
Daumé III, H.: Notes on CG and LM-BFGS optimization of logistic regression, August 2004
Davison, A.: Syntactic and semantic indeterminacy resolved: a mostly pragmatic analysis for the hindi conjunctive participle. In: Peter, C. (ed.) Radical pragmatics, pp. 101–128. Academic Press, New York (1981)
Dixon, R.M.: Ergativity. Cambridge Studies in Linguistics. Cambridge University Press, Cambridge (1994). https://books.google.pl/books?id=fKfSAu6v5LYC
Hardie, A.: Automated part-of-speech analysis of Urdu: conceptual and technical issues. Contemporary Issues in Nepalese Linguistics, pp. 48–72 (2005)
Haspelmath, M.: The converb as a cross-linguistically valid category. In: Haspelmath, M., König, E. (eds.) Converbs in cross-linguistic perspective: structure and meaning of adverbial verb forms - adverbial participles, gerunds, pp. 1–55. No. 13 in Empirical approaches to language typology, Mouton de Gruyter, Berlin (1995)
Hellwig, O.: A stochastic lexical and POS tagger for sanskrit. In: Huet, G., Kulkarni, A., Scharf, P. (eds.) ISCLS 2007-2008. LNCS (LNAI), vol. 5402, pp. 266–277. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-00155-0_11
Hellwig, O.: ind.senz - OCR software for Hindi, Marathi, Tamil, and Sanskrit (2015). http://www.indsenz.com
Kachru, Y.: On the syntax, semantics and pragmatics of the conjunctive participle in Hindi-Urdu. Stud. Linguist. Sci. 11(2), 35–49 (1981)
Khokhlova, L.: Ergativity attrition in the history of western new Indo-Aryan languages (Punjabi, Gujarati and Rajastahani). The Yearbook of South Asian Languages and Linguistics, pp. 159–184 (2001)
Langford, J., Li, L., Zhang, T.: Sparse online learning via truncated gradient. In: Advances in Neural Information Processing Systems, pp. 905–912 (2009)
Loper, E., Bird, S.: NLTK: The natural language toolkit. In: Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, ETMTNLP 2002, vol. 1. pp. 63–70. Association for Computational Linguistics, Stroudsburg, PA, USA (2002). https://doi.org/10.3115/1118108.1118117
Makhoul, J., Kubala, F., Schwartz, R., Weischedel, R.: Performance measures for information extraction. In: Proceedings of DARPA Broadcast News Workshop, pp. 249–252 (1999)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013). http://arxiv.org/abs/1301.3781
Peterson, J.: The Nepali converbs: a holistic approach. In: Singh, R., Dasgupta, P. (eds.) The Yearbook of South Asian Languages and Linguistics (2002), pp. 93–134. Walter de Gruyter, Berlin (2002)
Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 16 April 1996
Tou, J.T.: Information systems. In: von Brauer, W. (ed.) GI 1973. LNCS, vol. 1, pp. 489–507. Springer, Heidelberg (1973). https://doi.org/10.1007/3-540-06473-7_52
Stroński, K., Tokaj, J.: The diachrony of cosubordination - lessons from Indo-Aryan. In: Proceedings of the 31st South Asian Languages Analysis Roundtable (SALA-31), pp. 59–62 (2015). Extended abstract. http://ucrel.lancs.ac.uk/sala-31/doc/ABSTRACTBOOK-maincontent.pdf
Subbārāo, K.: South Asian languages. A Syntactic Typology. Cambridge University Press, New York (2012). https://books.google.pl/books?id=ZCfiGYvpLOQC
Tikkanen, B.: The Sanskrit gerund: a synchronic, diachronic, and typological analysis. Studia Orientalia, Finnish Oriental Society (1987). https://books.google.pl/books?id=XTkqAQAAIAAJ
Tikkanen, B.: Burushaski converbs in their south and central Asian areal context. In: Haspelmath, M., König, E. (eds.) Converbs in cross-linguistic perspective: structure and meaning of adverbial verb forms - adverbial participles, gerunds. (Empirical approaches to language typology 13.), pp. 487–528. Mouton de Gruyter, Berlin (1981)
Tokaj, J.: A comparative study of participles, converbs and absolute constructions in Hindi and medieval Rajasthani. Lingua Posnaniensis, pp. 105–120 (2016)
Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of HLT-NAACL, pp. 252–259 (2003)
Toutanova, K., Manning, C.D.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), Hong Kong, pp. 63–70, October 2000
Van Valin, R.D., LaPolla, R.J.: Syntax: Structure, Meaning, and Function. Cambridge University Press, Cambridge (1997)
Van Valin, R.J.: A synopsis of role and reference grammar. Advances in Role and Reference Grammar, pp. 1–164 (1993)
Van Valin, R.J.: Exploring the Syntax-Semantics Interface. Cambridge University Press, Cambridge (2005)
Wallace, W.D.: Object-marking in the history of Nepali: a case of syntactic diffusion. Stud. Linguist. Sci. 11(2), 107–128 (1981)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Jaworski, R., Jassem, K., Stroński, K. (2018). Binary Classification Algorithms for the Detection of Sparse Word Forms in New Indo-Aryan Languages. In: Vetulani, Z., Mariani, J., Kubis, M. (eds) Human Language Technology. Challenges for Computer Science and Linguistics. LTC 2015. Lecture Notes in Computer Science(), vol 10930. Springer, Cham. https://doi.org/10.1007/978-3-319-93782-3_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-93782-3_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93781-6
Online ISBN: 978-3-319-93782-3
eBook Packages: Computer ScienceComputer Science (R0)