Language Technology for Digital Linguistics: Turning the Linguistic Survey of India into a Rich Source of Linguistic Information

Borin, Lars; Virk, Shafqat Mumtaz; Saxena, Anju

doi:10.1007/978-3-319-77113-7_42

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10761))

Included in the following conference series:

International Conference on Computational Linguistics and Intelligent Text Processing

922 Accesses
2 Citations

Abstract

We present our work aiming at turning the linguistic material available in Grierson’s classical Linguistic Survey of India (LSI) from a printed discursive textual description into a formally structured digital language resource, a database suitable for a broad array of linguistic investigations of the languages of South Asia. While doing so, we develop state-of-the-art language technology for automatically extracting the relevant grammatical information from the text of the LSI, and interactive linguistic information visualization tools for better analysis and comparisons of languages based on their structural and functional features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In linguistic works, South Asia is defined as the seven countries Pakistan, India, Nepal, Bhutan, Bangladesh, Sri Lanka, and the Maldives, plus some immediately adjacent areas (e.g., Tibet).
2.
wals.info.
3.
apics.org.
4.
sails.clld.org.
5.
phoible.org.
6.
https://spraakbanken.gu.se/eng/korp-info.
7.
For instance, location data come mainly from the Glottolog: http://glottolog.org.
8.
http://dsal.uchicago.edu/books/lsi/ (Page images, no text search available.).
9.
http://www.geonames.org/.
10.
A Tibeto-Burman language spoken in southern Tedim township, Chin State, Burma.
11.
sails.clld.org.
12.
http://clld.org/.

References

Borin, L., Forsberg, M., Roxendal, J.: Korp – the corpus infrastructure of Språkbanken. In: Proceedings of LREC 2012, pp. 474–478. ELRA, Istanbul (2012). http://www.lrec-conf.org/proceedings/lrec2012/pdf/248_Paper.pdf
Broadwell, P.M., Tangherlini, T.R.: TrollFinder: geo-semantic exploration of a very large corpus of Danish folklore. In: The Third Workshop on Computational Models of Narrative, pp. 50–57. ELRA, Istanbul (2012)
Google Scholar
Chuang, J., Ramage, D., Manning, C.D., Heer, J.: Interpretation and trust: designing model-driven visualizations for text analysis. In: ACM Human Factors in Computing Systems (CHI) (2012). http://vis.stanford.edu/papers/designing-model-driven-vis
Dryer, M.S., Haspelmath, M. (eds.): WALS Online. Max Planck Institute for Evolutionary Anthropology, Leipzig (2013). http://wals.info/
Ebert, K.: South Asia as a linguistic area. In: Brown, K. (ed.) Encyclopedia of Languages and Linguistics, 2nd edn. Elsevier, Oxford (2006)
Google Scholar
Evert, S., Hardie, A.: Twenty-first century corpus workbench: updating a query architecture for the new millennium. In: Proceedings of the Corpus Linguistics 2011 Conference. University of Birmingham, Birmingham (2011)
Google Scholar
Grierson, G.A.: A Linguistic Survey of India, vol. I-XI. Government of India, Central Publication Branch, Calcutta (1903–1927)
Google Scholar
Hammarström, H., Forkel, R., Haspelmath, M., Bank, S.: Glottolog 2.7. Jena: Max Planck Institute for the Science of Human History (2016). http://glottolog.org
Havre, S., Hetzler, B., Nowell, L.: ThemeRiver: visualizing theme changes over time. IEEE Symposium on Information Visualization 2000. InfoVis 2000, pp. 115–123. IEEE, Salt Lake City (2000)
Google Scholar
Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of ACL 2003, pp. 423–430. ACL, Sapporo (2003). http://dx.doi.org/10.3115/1075096.1075150
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: ACL System Demonstrations, pp. 55–60. ACL, Portland (2014). http://www.aclweb.org/anthology/P/P14/P14-5010
Masica, C.P.: Defining a Linguistic Area: South Asia. Chicago University Press, Chicago (1976)
Google Scholar
Michaelis, S.M., Maurer, P., Haspelmath, M., Huber, M. (eds.): APiCS Online. Max Planck Institute for Evolutionary Anthropology, Leipzig (2013). http://apics-online.info/
Recasens, M., Marneffe, M.C.D., Potts, C.: The life and death of discourse entities: identifying singleton mentions. In: Proceedings of NAACL-HLT 2013. ACL, Atlanta (2013)
Google Scholar
Schilit, B.N., Kolak, O.: Exploring a digital library through key ideas. In: Proceedings of JCDL 2008, pp. 177–186. ACM, Pittsburgh (2008)
Google Scholar
Smith, D.A.: Detecting and browsing events in unstructured text. In: SIGIR 2002. ACM, Tampere (2002)
Google Scholar
Sun, G.D., Wu, Y.C., Liang, R.H., Liu, S.X.: A survey of visual analytics techniques and applications: state-of-the-art research and future challenges. J. Comput. Sci. Technol. 28(5), 852–867 (2013). http://dx.doi.org/10.1007/s11390-013-1383-8
Versley, Y., Moschitti, A., Poesio, M., Yang, X.: Coreference systems based on kernels methods. In: Proceedings of COLING 2008. ACL, Manchester (2008)
Google Scholar

Download references

Acknowledgments

The work presented here was funded by the Swedish Research Council as part of the project South Asia as a linguistic area? Exploring big-data methods in areal and genetic linguistics (2015–2019, contract no. 421-2014-969), and by the University of Gothenburg as part of its funding of the Språkbanken language technology and digital humanities infrastructure.

Author information

Authors and Affiliations

Språkbanken, University of Gothenburg, Gothenburg, Sweden
Lars Borin & Shafqat Mumtaz Virk
Linguistics and Philology, Uppsala University, Uppsala, Sweden
Anju Saxena

Authors

Lars Borin
View author publications
You can also search for this author in PubMed Google Scholar
Shafqat Mumtaz Virk
View author publications
You can also search for this author in PubMed Google Scholar
Anju Saxena
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shafqat Mumtaz Virk .

Editor information

Editors and Affiliations

CIC, Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Borin, L., Virk, S.M., Saxena, A. (2018). Language Technology for Digital Linguistics: Turning the Linguistic Survey of India into a Rich Source of Linguistic Information. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2017. Lecture Notes in Computer Science(), vol 10761. Springer, Cham. https://doi.org/10.1007/978-3-319-77113-7_42

Download citation

DOI: https://doi.org/10.1007/978-3-319-77113-7_42
Published: 10 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77112-0
Online ISBN: 978-3-319-77113-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics