Skip to main content

Design and Development of Media-Corpus of the Kazakh Language

  • Conference paper
  • First Online:
Computational Collective Intelligence (ICCCI 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10449))

Included in the following conference series:

Abstract

The aim of this work was design and development of a media-corpus of the Kazakh language. The media-corpus is hosted by the al-Farabi Kazakh National University and serves linguists as an empirical basis for research on contemporary written Kazakh. The information system for media-corpus was built on the basis of component software architecture. To make the processes of collection, storage and analysis of media-texts in the Kazakh language automatic, four components of the information system were designed and developed. The text files are saved in XML format. At the stage of analysis such tasks as text normalization, removing stop words, adding metadata and morphological analysis are performed. The morphological analyzer receives an input of a plain text, and at the output gives the text in XML format, which is further convenient to work with as it is easily converted to JSON format. The XML format is defined using XML Schema Definition (XSD). XSD allows to convert data into any other format, which simplifies the data exchange between the systems. For the case of incomplete morphological markup and the presence of homonymy, a special interface to perform manual markup is developed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  1. Bekmanova, G.T.: Some approaches to the problems of automatic word changes and morphological analysis in the Kazakh language. Bulletin of the East Kazakhstan State Technical University Named by D. Serikbayev, vol. 1, pp. 192–197 (2009) (In Russian)

    Google Scholar 

  2. Zhubanov, A.H.: Basic principles of formalization of the Kazakh text content. Almaty (2002) (In Russian)

    Google Scholar 

  3. Turkish National Corpus. http://www.tnc.org.tr/index.php/en/

  4. Bashkir poetic corpus. http://web-corpora.net/bashcorpus/search/?interface_language=ru

  5. Written corpus of the Tatar language. http://corpus.tatar/

  6. Portal of the state language of the Committee on languages of the Ministry of culture and information of the Republic of Kazakhstan. http://til.gov.kz/wps/portal/!ut/p/

  7. Corpus of the Kazakh language created by the workers of National laboratory of Astana of L. Gumilev Eurasian University. http://kazcorpus.kz/klcweb/en/

  8. Kaldybekov, T.E.: The Anglo-Kazakh parallel corpus for statistical machine translation. J. Young Sci. 6, 92–95 (2014). (In Russian)

    Google Scholar 

  9. Portal of a state language of the Republic of Kazakhstan. http://dawhois.com/www/til.gov.kz.html

  10. Makazhanov, O.A., Makhambetov, O.E., et al.: Development of morphological, syntactic and lexical sets of tags for tagging of texts in Kazakh. Philol. Cult. 2(36), 37–39. Kazan University, Kazan (2014) (In Russian)

    Google Scholar 

  11. Almaty corpus of the Kazakh language. http://web-corpora.net/KazakhCorpus/search/?interface_language=ru

  12. Szyperski, C.: Component Software: Beyond Object Oriented Programming. Addison-Wesley Professional, Reading (1997)

    Google Scholar 

  13. Aubakirov, S.S., Akhmed-Zaki, D.Z., Trigo, P.S.: News classification using apache Lucene. KazNU Bull. Math. Mech. Comput. Sci. Ser. 3(91), 59–65 (2016)

    Google Scholar 

  14. Bass, L., Clements, P., Kazman, R.: Software Architecture in Practice (SEI Series in Software Engineering), 3rd edn. Addison Wesley, Boston (2012)

    Google Scholar 

  15. Azarova, I.V.: Morphological markup of the texts in Russian, using the formal grammar AGFL. Department of mathematical linguistics of St. Petersburg State University. http://www.dialog-21.ru/Archive/2003/AzarovaAFGL.htm

Download references

Acknowledgments

This work was supported in part under grant of Foundation of Ministry of Education and Science of the Republic of Kazakhstan “Development of intellectual high-performance information-analytical search system of processing of semi-structured data” (2015–2017).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Madina Mansurova .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Mansurova, M., Madiyeva, G., Aubakirov, S., Yermekov, Z., Alimzhanov, Y. (2017). Design and Development of Media-Corpus of the Kazakh Language. In: Nguyen, N., Papadopoulos, G., Jędrzejowicz, P., Trawiński, B., Vossen, G. (eds) Computational Collective Intelligence. ICCCI 2017. Lecture Notes in Computer Science(), vol 10449. Springer, Cham. https://doi.org/10.1007/978-3-319-67077-5_49

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67077-5_49

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67076-8

  • Online ISBN: 978-3-319-67077-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics