Abstract
Random Indexing is a simple implementation of Random Projections with a wide range of applications. It can solve a variety of problems with good accuracy without introducing much complexity. Here we demonstrate its use for identifying the language of text samples, based on a novel method of encoding letter N-grams into high-dimensional Language Vectors. Further, we show that the method is easily implemented and requires little computational power and space. As proof of the method’s statistical validity, we show its success in a language-recognition task. On a difficult data set of 21,000 short sentences from 21 different languages, we achieve 97.4% accuracy, comparable to state-of-the-art methods.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Shannon, C.E.: A mathematical theory of communication. Bell Syst. Techn. J. 27(4), 623–656 (1948)
McCandless, M.: Accuracy, performance of Google’s Compact Language Detector (2011). http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
Landauer, T., Dumais, S.: A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychol. Rev. 104(2), 211–240 (1997)
Papadimitriou, C.H., et al.: Latent semantic indexing: a probabilistic analysis. In: Proceedings of 17th ACM Symposium on the Principles of Database Systems, pp. 159–168 (1998)
Kaski, S.: Dimensionality reduction by random mapping: fast similarity computation for clustering. In: Proceedings of International Joint Conference on Neural Networks, vol. 1, pp. 413–418 (1998)
Kanerva, P., Kristoferson, J., Holst, A.: Random indexing of text samples for latent semantic analysis. In: Gleitman, L.R., Josh, A.K. (eds.) Proceedings of 22nd Annual Conference of the Cognitive Science Society, p. 1036 (2000)
Sahlgren, M.: An introduction to random indexing. In: Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering (2005)
Mikolov, T., et al.: Efficient estimation of word representations in vector space, p. 12, 7 September 2013. arXiv:1301.3781v3 [cs.CL]
Kanerva, P.: Sparse Distributed Memory. MIT Press, Cambridge (1988)
Levy, S.D., Gayler, R.W.: Lateral inhibition in a fully distributed connectionist architecture. In: Proceedings of the Ninth International Conference on Cognitive Modeling (2009)
Kanerva, P.: Computing with 10,000-bit words. In: Proceedings of 52nd Annual Allerton Conference on Communication, Control, and Computing (2014)
van der Maaten, L.: Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Quasto, U., Richter, M., Biemann, C.: Corpus portal for search in monolingual corpora. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation, LREC, pp. 1799–1802 (2006)
Nakatani, S.: Langdetect is updated (added profiles of Estonian/Lithuanian/Latvian/Slovene, and so on. http://shuyo.wordpress.com/2011/09/29/langdetect-is-updatedadded-profiles-of-estonian-lithuanian-latvian-slovene-and-so-on/. Accessed 16 Dec 2014
Gayler, R.W.: Multiplicative binding, representation operators, analogy. In: Kokinov, B., Holyoak, K., Sofia, G.D. (eds.) Advances in Analogy Research, p. 405. New Bulgarian University (1998)
Hinton, G.E.: Mapping part-whole hierarchies into connectionist networks. Artif. Intell. 46(1–2), 47–75 (1990)
Smolensky, P.: Tensor product variable binding, the representation of symbolic structures in connectionist networks. Artif. Intell. 46(1–2), 159–216 (1990)
Plate, T.A.: Holographic reduced representations: convolution algebra for compositional distributed representations. In: Mylopoulos, R.R., Mateo, J.S. (eds.) Proceedings of 12th International Joint Conference on Articial Intelligence (IJCAI), pp. 30–35. Kaufmann, CA (1991)
Plate, T.A.: Holographic Reduced Representation: Distributed Representation of Cognitive Structure. CSLI, Stanford (2003)
Gayler, R.W.: Vector symbolic architectures are a viable alternative for Jackendo’s challenges. Behav. Brain Sci. 29, 78–79 (2006)
Acknowledgments
We thank Professor Bruno Olshausen for providing the setting for this work in his class on Neural Computation, and two anonymous reviewers for their comments that helped us improve the paper. Pentti Kanerva’s work was supported by Systems On Nanoscale Information fabriCs (SONIC), one of the six SRC STARnet Centers, sponsored by MARCO and DARPA.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Joshi, A., Halseth, J.T., Kanerva, P. (2017). Language Geometry Using Random Indexing. In: de Barros, J., Coecke, B., Pothos, E. (eds) Quantum Interaction. QI 2016. Lecture Notes in Computer Science(), vol 10106. Springer, Cham. https://doi.org/10.1007/978-3-319-52289-0_21
Download citation
DOI: https://doi.org/10.1007/978-3-319-52289-0_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-52288-3
Online ISBN: 978-3-319-52289-0
eBook Packages: Computer ScienceComputer Science (R0)