Language Geometry Using Random Indexing

Joshi, Aditya; Halseth, Johan T.; Kanerva, Pentti

doi:10.1007/978-3-319-52289-0_21

Language Geometry Using Random Indexing

Aditya Joshi¹⁶,
Johan T. Halseth¹⁷ &
Pentti Kanerva¹⁸

Conference paper
First Online: 26 January 2017

1174 Accesses
32 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10106))

Abstract

Random Indexing is a simple implementation of Random Projections with a wide range of applications. It can solve a variety of problems with good accuracy without introducing much complexity. Here we demonstrate its use for identifying the language of text samples, based on a novel method of encoding letter N-grams into high-dimensional Language Vectors. Further, we show that the method is easily implemented and requires little computational power and space. As proof of the method’s statistical validity, we show its success in a language-recognition task. On a difficult data set of 21,000 short sentences from 21 different languages, we achieve 97.4% accuracy, comparable to state-of-the-art methods.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Shannon, C.E.: A mathematical theory of communication. Bell Syst. Techn. J. 27(4), 623–656 (1948)
Article MathSciNet MATH Google Scholar
McCandless, M.: Accuracy, performance of Google’s Compact Language Detector (2011). http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
Landauer, T., Dumais, S.: A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychol. Rev. 104(2), 211–240 (1997)
Article Google Scholar
Papadimitriou, C.H., et al.: Latent semantic indexing: a probabilistic analysis. In: Proceedings of 17th ACM Symposium on the Principles of Database Systems, pp. 159–168 (1998)
Google Scholar
Kaski, S.: Dimensionality reduction by random mapping: fast similarity computation for clustering. In: Proceedings of International Joint Conference on Neural Networks, vol. 1, pp. 413–418 (1998)
Google Scholar
Kanerva, P., Kristoferson, J., Holst, A.: Random indexing of text samples for latent semantic analysis. In: Gleitman, L.R., Josh, A.K. (eds.) Proceedings of 22nd Annual Conference of the Cognitive Science Society, p. 1036 (2000)
Google Scholar
Sahlgren, M.: An introduction to random indexing. In: Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering (2005)
Google Scholar
Mikolov, T., et al.: Efficient estimation of word representations in vector space, p. 12, 7 September 2013. arXiv:1301.3781v3 [cs.CL]
Kanerva, P.: Sparse Distributed Memory. MIT Press, Cambridge (1988)
MATH Google Scholar
Levy, S.D., Gayler, R.W.: Lateral inhibition in a fully distributed connectionist architecture. In: Proceedings of the Ninth International Conference on Cognitive Modeling (2009)
Google Scholar
Kanerva, P.: Computing with 10,000-bit words. In: Proceedings of 52nd Annual Allerton Conference on Communication, Control, and Computing (2014)
Google Scholar
van der Maaten, L.: Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
MATH Google Scholar
Quasto, U., Richter, M., Biemann, C.: Corpus portal for search in monolingual corpora. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation, LREC, pp. 1799–1802 (2006)
Google Scholar
Nakatani, S.: Langdetect is updated (added profiles of Estonian/Lithuanian/Latvian/Slovene, and so on. http://shuyo.wordpress.com/2011/09/29/langdetect-is-updatedadded-profiles-of-estonian-lithuanian-latvian-slovene-and-so-on/. Accessed 16 Dec 2014
Gayler, R.W.: Multiplicative binding, representation operators, analogy. In: Kokinov, B., Holyoak, K., Sofia, G.D. (eds.) Advances in Analogy Research, p. 405. New Bulgarian University (1998)
Google Scholar
Hinton, G.E.: Mapping part-whole hierarchies into connectionist networks. Artif. Intell. 46(1–2), 47–75 (1990)
Article Google Scholar
Smolensky, P.: Tensor product variable binding, the representation of symbolic structures in connectionist networks. Artif. Intell. 46(1–2), 159–216 (1990)
Article MathSciNet MATH Google Scholar
Plate, T.A.: Holographic reduced representations: convolution algebra for compositional distributed representations. In: Mylopoulos, R.R., Mateo, J.S. (eds.) Proceedings of 12th International Joint Conference on Articial Intelligence (IJCAI), pp. 30–35. Kaufmann, CA (1991)
Google Scholar
Plate, T.A.: Holographic Reduced Representation: Distributed Representation of Cognitive Structure. CSLI, Stanford (2003)
Google Scholar
Gayler, R.W.: Vector symbolic architectures are a viable alternative for Jackendo’s challenges. Behav. Brain Sci. 29, 78–79 (2006)
Article Google Scholar

Download references

Acknowledgments

We thank Professor Bruno Olshausen for providing the setting for this work in his class on Neural Computation, and two anonymous reviewers for their comments that helped us improve the paper. Pentti Kanerva’s work was supported by Systems On Nanoscale Information fabriCs (SONIC), one of the six SRC STARnet Centers, sponsored by MARCO and DARPA.

Author information

Authors and Affiliations

Department of Mathematics, University of California–Berkeley, Berkeley, USA
Aditya Joshi
Department of Computer Science, University of California–Berkeley, Berkeley, USA
Johan T. Halseth
Redwood Center for Theoretical Neuroscience, Berkeley, USA
Pentti Kanerva

Authors

Aditya Joshi
View author publications
You can also search for this author in PubMed Google Scholar
Johan T. Halseth
View author publications
You can also search for this author in PubMed Google Scholar
Pentti Kanerva
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aditya Joshi .

Editor information

Editors and Affiliations

San Francisco State University, San Francisco, California, USA
Jose Acacio de Barros
University of Oxford, Oxford, United Kingdom
Bob Coecke
City University of London, London, United Kingdom
Emmanuel Pothos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Joshi, A., Halseth, J.T., Kanerva, P. (2017). Language Geometry Using Random Indexing. In: de Barros, J., Coecke, B., Pothos, E. (eds) Quantum Interaction. QI 2016. Lecture Notes in Computer Science(), vol 10106. Springer, Cham. https://doi.org/10.1007/978-3-319-52289-0_21

Download citation

DOI: https://doi.org/10.1007/978-3-319-52289-0_21
Published: 26 January 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-52288-3
Online ISBN: 978-3-319-52289-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics