Skip to main content

Language Technology for Digital Linguistics: Turning the Linguistic Survey of India into a Rich Source of Linguistic Information

  • Conference paper
  • First Online:
Computational Linguistics and Intelligent Text Processing (CICLing 2017)

Abstract

We present our work aiming at turning the linguistic material available in Grierson’s classical Linguistic Survey of India (LSI) from a printed discursive textual description into a formally structured digital language resource, a database suitable for a broad array of linguistic investigations of the languages of South Asia. While doing so, we develop state-of-the-art language technology for automatically extracting the relevant grammatical information from the text of the LSI, and interactive linguistic information visualization tools for better analysis and comparisons of languages based on their structural and functional features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In linguistic works, South Asia is defined as the seven countries Pakistan, India, Nepal, Bhutan, Bangladesh, Sri Lanka, and the Maldives, plus some immediately adjacent areas (e.g., Tibet).

  2. 2.

    wals.info.

  3. 3.

    apics.org.

  4. 4.

    sails.clld.org.

  5. 5.

    phoible.org.

  6. 6.

    https://spraakbanken.gu.se/eng/korp-info.

  7. 7.

    For instance, location data come mainly from the Glottolog: http://glottolog.org.

  8. 8.

    http://dsal.uchicago.edu/books/lsi/ (Page images, no text search available.).

  9. 9.

    http://www.geonames.org/.

  10. 10.

    A Tibeto-Burman language spoken in southern Tedim township, Chin State, Burma.

  11. 11.

    sails.clld.org.

  12. 12.

    http://clld.org/.

References

  1. Borin, L., Forsberg, M., Roxendal, J.: Korp – the corpus infrastructure of Språkbanken. In: Proceedings of LREC 2012, pp. 474–478. ELRA, Istanbul (2012). http://www.lrec-conf.org/proceedings/lrec2012/pdf/248_Paper.pdf

  2. Broadwell, P.M., Tangherlini, T.R.: TrollFinder: geo-semantic exploration of a very large corpus of Danish folklore. In: The Third Workshop on Computational Models of Narrative, pp. 50–57. ELRA, Istanbul (2012)

    Google Scholar 

  3. Chuang, J., Ramage, D., Manning, C.D., Heer, J.: Interpretation and trust: designing model-driven visualizations for text analysis. In: ACM Human Factors in Computing Systems (CHI) (2012). http://vis.stanford.edu/papers/designing-model-driven-vis

  4. Dryer, M.S., Haspelmath, M. (eds.): WALS Online. Max Planck Institute for Evolutionary Anthropology, Leipzig (2013). http://wals.info/

  5. Ebert, K.: South Asia as a linguistic area. In: Brown, K. (ed.) Encyclopedia of Languages and Linguistics, 2nd edn. Elsevier, Oxford (2006)

    Google Scholar 

  6. Evert, S., Hardie, A.: Twenty-first century corpus workbench: updating a query architecture for the new millennium. In: Proceedings of the Corpus Linguistics 2011 Conference. University of Birmingham, Birmingham (2011)

    Google Scholar 

  7. Grierson, G.A.: A Linguistic Survey of India, vol. I-XI. Government of India, Central Publication Branch, Calcutta (1903–1927)

    Google Scholar 

  8. Hammarström, H., Forkel, R., Haspelmath, M., Bank, S.: Glottolog 2.7. Jena: Max Planck Institute for the Science of Human History (2016). http://glottolog.org

  9. Havre, S., Hetzler, B., Nowell, L.: ThemeRiver: visualizing theme changes over time. IEEE Symposium on Information Visualization 2000. InfoVis 2000, pp. 115–123. IEEE, Salt Lake City (2000)

    Google Scholar 

  10. Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of ACL 2003, pp. 423–430. ACL, Sapporo (2003). http://dx.doi.org/10.3115/1075096.1075150

  11. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: ACL System Demonstrations, pp. 55–60. ACL, Portland (2014). http://www.aclweb.org/anthology/P/P14/P14-5010

  12. Masica, C.P.: Defining a Linguistic Area: South Asia. Chicago University Press, Chicago (1976)

    Google Scholar 

  13. Michaelis, S.M., Maurer, P., Haspelmath, M., Huber, M. (eds.): APiCS Online. Max Planck Institute for Evolutionary Anthropology, Leipzig (2013). http://apics-online.info/

  14. Recasens, M., Marneffe, M.C.D., Potts, C.: The life and death of discourse entities: identifying singleton mentions. In: Proceedings of NAACL-HLT 2013. ACL, Atlanta (2013)

    Google Scholar 

  15. Schilit, B.N., Kolak, O.: Exploring a digital library through key ideas. In: Proceedings of JCDL 2008, pp. 177–186. ACM, Pittsburgh (2008)

    Google Scholar 

  16. Smith, D.A.: Detecting and browsing events in unstructured text. In: SIGIR 2002. ACM, Tampere (2002)

    Google Scholar 

  17. Sun, G.D., Wu, Y.C., Liang, R.H., Liu, S.X.: A survey of visual analytics techniques and applications: state-of-the-art research and future challenges. J. Comput. Sci. Technol. 28(5), 852–867 (2013). http://dx.doi.org/10.1007/s11390-013-1383-8

  18. Versley, Y., Moschitti, A., Poesio, M., Yang, X.: Coreference systems based on kernels methods. In: Proceedings of COLING 2008. ACL, Manchester (2008)

    Google Scholar 

Download references

Acknowledgments

The work presented here was funded by the Swedish Research Council as part of the project South Asia as a linguistic area? Exploring big-data methods in areal and genetic linguistics (2015–2019, contract no. 421-2014-969), and by the University of Gothenburg as part of its funding of the Språkbanken language technology and digital humanities infrastructure.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shafqat Mumtaz Virk .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Borin, L., Virk, S.M., Saxena, A. (2018). Language Technology for Digital Linguistics: Turning the Linguistic Survey of India into a Rich Source of Linguistic Information. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2017. Lecture Notes in Computer Science(), vol 10761. Springer, Cham. https://doi.org/10.1007/978-3-319-77113-7_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-77113-7_42

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-77112-0

  • Online ISBN: 978-3-319-77113-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics