Abstract
In this paper, we analyze the behaviour of Singular Value Decomposition in a number of word similarity extraction tasks, namely acquisition of translation equivalents from comparable corpora. Special attention is paid to two different aspects: computational efficiency and extraction quality. The main objective of the paper is to describe several experiments comparing methods based on Singular Value Decomposition (SVD) to other strategies. The results lead us to conclude that SVD makes the extraction less computationally efficient and much less precise than other more basic models for the task of extracting translation equivalents from comparable corpora.
Similar content being viewed by others
Notes
DepPattern is a linguistic toolkit, with GPL licence, which is available at: http://gramatica.usc.es/pln/tools/deppattern.html.
References
Armentano-Oller, C., Carrasco, R. C., Corbí-Bellot, A. M., Forcada, M. L., Ginestí-Rosell, M., Ortiz-Rojas, S. et al. (2006). Open-source Portuguese-Spanish machine translation. In Lecture notes in computer science, 3960 (pp. 50–59).
Baroni, M., & Lenci, A. (2008). Concepts and properties in word space. Italian Journal of Linguistics, 20(1), 55–88.
Biemann, C., Bordag, S., & Quasthoff, U. (2004). Automatic Acquisition of paradigmatic relations using iterated co-occurrences. In LREC 2004, Lisbon, Portugal.
Bordag, S. (2007). Elements of knowledge-free and unsupervised lexicon acquisition. PhD thesis, University of Leipzig.
Bordag, S. (2008). A comparison of co-occurrence and similarity measures as simulations of context. In 9th CICLing (pp. 52–63).
Bradford, R. (2008). An empirical study of required dimensionality for large-scale latent semantic indexing applications. In 17th ACM conference on information and knowledge management (pp. 153–162). Napa Valley, California.
Budiu, R., & Pirolli, P. (2006). Navigation in degree-of-interest trees. In Advance visual interface conference.
Carreras, X., Chao, I., Padró, L., & Padró, M. (2004). An open-source suite of language analyzers. In 4th international conference on language resources and evaluation (LREC’04), Lisbon, Portugal.
Chiao, Y.-C., & Zweigenbaum, P. (2002). Looking for candidate translational equivalents in specialized, comparable corpora. In 19th COLING’02.
Curran, J. R., & Moens, M. (2002). Improvements in automatic thesaurus extraction. In ACL workshop on unsupervised lexical acquisition (pp. 59–66). Philadelphia.
Deerwester, S., Dumais, S. T., Furmas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.
Dejean, H., Gaussier, E., & Sadat, F. (2002). Bilingual terminology extraction: An approach based on a multilingual thesaurus applicable to comparable corpora. In COLING 2002, Tapei, Taiwan.
Ferguson, G. A., & Takane, Y. (2005). Statistical analysis in psychology and education. Montreal, Quebec: McGraw-Hill Ryerson Limited.
Fung, P., & McKeown, K. (1997). Finding terminology translation from non-parallel corpora. In 5th annual workshop on very large corpora (pp. 192–202). Hong Kong.
Fung, P., & Yee, L.Y. (1998). An IR approach for translating new words from nonparallel, comparable texts. In Coling’98 (pp. 414–420). Montreal, Canada.
Gamallo, P. (2007). Learning bilingual lexicons from comparable English and Spanish corpora. In Machine translation SUMMIT XI, Copenhagen, Denmark.
Gamallo, P. (2008) Evaluating two different methods for the task of extracting bilingual lexicons from comparable corpora. In LREC 2008 workshop on comparable corpora (pp. 19–26). Marrakech, Marroco.
Gamallo, P. (2009). Comparing different properties involved in word similarity extraction. In 14th Portuguese conference on artificial Intelligence (EPIA’09), LNCS, Vol. 5816 (pp. 634–645). Aveiro, Portugal. Springer-Verlag.
Gamallo, P., Agustini, A., & Lopes, G. (2005). Clustering syntactic positions with similar semantic requirements. Computational Linguistics, 31(1), 107–146.
Gorrel, G. (2005). Generalized Hebbian algorithm for incremental singular value decomposition in natural language processing. In EACL 2005.
Grefenstette, G. (1994). Explorations in automatic thesaurus discovery. USA: Kluwer Academic Publishers.
Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (pp. 50–57). Berkeley, California.
Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1–2), 177–196.
Holmes, M. P., Gray, A. G., & Isbell, C. L. Jr. (2008). QUIC-SVD: Fast SVD using cosine trees. In NIPS-2008 (pp. 673–680).
Kaji, H. (2005). Extracting translation equivalents from bilingual comparable corpora. In IEICE Transactions 88-D(2) (pp. 313–323).
Kaji, H., & Aizono, T. (1996). Extracting word correspondences from bilingual corpora based on word co-occurrence information. In 16th conference on computational linguistics (Coling’96) (pp. 23–28). Copenhagen, Denmark.
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquision, induction and representation of knowledge. Psychological Review, 10(2), 211–240.
Lemaire, B., & Denhière, G. (2006). Effects of high-order co-occurrences on word semantic similarity. Current Psychology Letters, 18(1), 1–12.
Levin, E., Sharifi, M., & Ball, J. T. (2006). Evaluation of utility of LSA for word sense discrimination. In HLT-NAACL.
Lin, D. (1998). Automatic retrieval and clustering of similar words. In COLING-ACL’98, Montreal.
Masuichi, H., Flournoy, R., Kaufmann, S., & Peters, S. (1999). Query translation method for cross language information retrieval. In Proceedings of the workshop on machine translation for cross language information retrieval, MT Summit VII (pp. 30–34). Singapore.
Matveeva, I., Levow, G., Farahat, A., & Royer, C. (2005). Terms representation with generalized latent semantic analysis. In RANLP-2005.
Pedersen, T., & Kulkarni, A. (2007). Discovering identities in web contexts with unsupervised clustering. In IJCAI-2007 workshop on analytics for noisy unstructured text data (pp. 23–30). Hyderabad, India.
Price, R., & Zukas, A. (2005). Application of latent semantic indexing to processing of noisy text. In Intelligence and security informatics, LNCS 3495 (pp. 602–603).
Pustejovsky, J. (1995). The generative lexicon. Cambridge: MIT Press.
Rapp, R. (1999). Automatic identification of word translations from unrelated English and German corpora. In ACL’99 (pp. 519–526).
Rapp, R. (2003). Word sense discovery based on sense descriptor dissimilarity. In 9th machine translation summit.
Rapp, R. (2004). A freely available automatically generated thesaurus of related words. In LREC-2004 (pp. 395–398). Lisbon, Portugal.
Saralegui, X., San Vicente, I., & Gurrutxaga, A. (2008). Automatic generation of bilingual lexicons from comparable corpora in a popular science domain. In LREC 2008 workshop on building and using comparable corpora.
Schütze, H. (1992). Dimensions of meaning. In Proceedings of supercomputing-92 (pp. 787–796). Minneapolis, MN.
Schütze, H. (1997). Ambiguity resolution in langugage learning. In CSLI Publications, Standford, CA.
Schütze, H. (1998). Automatic word sense discrimination. Computational Linguistics, 24(1), 97–124.
Terra, E., & Clarke, C. L. (2003). Frequency estimates for statistical word similarity measures. In Conference of the North American chapter of the association for computational linguistics on human language technology (NAACL’03) (pp. 165–172). NJ, USA.
Turney, P. (2001). Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In 12th european conference of machine learning (pp. 491–502).
Umemura, K., & Xu, Y. (2003). Very low-dimensional latent semantic indexing for local query regions. In Annual meeting of the ACL archive proceedings of the sixth international workshop on information retrieval with Asian languages (pp. 84–91). Saporo, Japan.
van der Plas, L., & Bouma, G. (2004). Syntactic contexts for finding semantically related words. In Meeting of computational linguistics in the Netherlands (CLIN2004) .
Wang, J., Duan, L., Xu, L., Lu, H., & Jin, J. S. (2007). TV a,d video categorization with probabilistic latent concept learning. In Workshop on multimedia information retrieval (pp. 24–29). Augsburg, Bavaria, Germany.
Wiemer-Hastings, P., Wiemer-Hastings, K., & Graesser, A. (1999). Improving an intelligent tutor’s comprehension of students with latent semantic analysis. In S. Lajoie & M. Vivet (Eds.), Artificial intelligence in education (pp. 535–542). Amsterdam: IOS Press.
Zhuang, Y., Lu, W., & Wu, J. (2009). Latent style model: Discovering writing styles for calligraphy works. Journal of Visual Communication and Image Representation, 20(2), 84–96.
Acknowledgments
This work has been supported by the Galician Government (projects with reference: PGIDIT07PXIB204015PR and 2008/101), and by the Natural Language Engineering Department at the University of Leipzig.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gamallo, P., Bordag, S. Is singular value decomposition useful for word similarity extraction?. Lang Resources & Evaluation 45, 95–119 (2011). https://doi.org/10.1007/s10579-010-9129-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-010-9129-5