Skip to main content
Log in

Is singular value decomposition useful for word similarity extraction?

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

In this paper, we analyze the behaviour of Singular Value Decomposition in a number of word similarity extraction tasks, namely acquisition of translation equivalents from comparable corpora. Special attention is paid to two different aspects: computational efficiency and extraction quality. The main objective of the paper is to describe several experiments comparing methods based on Singular Value Decomposition (SVD) to other strategies. The results lead us to conclude that SVD makes the extraction less computationally efficient and much less precise than other more basic models for the task of extracting translation equivalents from comparable corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. DepPattern is a linguistic toolkit, with GPL licence, which is available at: http://gramatica.usc.es/pln/tools/deppattern.html.

  2. http://tedlab.mit.edu/~dr/svdlibc/.

References

  • Armentano-Oller, C., Carrasco, R. C., Corbí-Bellot, A. M., Forcada, M. L., Ginestí-Rosell, M., Ortiz-Rojas, S. et al. (2006). Open-source Portuguese-Spanish machine translation. In Lecture notes in computer science, 3960 (pp. 50–59).

  • Baroni, M., & Lenci, A. (2008). Concepts and properties in word space. Italian Journal of Linguistics, 20(1), 55–88.

    Google Scholar 

  • Biemann, C., Bordag, S., & Quasthoff, U. (2004). Automatic Acquisition of paradigmatic relations using iterated co-occurrences. In LREC 2004, Lisbon, Portugal.

  • Bordag, S. (2007). Elements of knowledge-free and unsupervised lexicon acquisition. PhD thesis, University of Leipzig.

  • Bordag, S. (2008). A comparison of co-occurrence and similarity measures as simulations of context. In 9th CICLing (pp. 52–63).

  • Bradford, R. (2008). An empirical study of required dimensionality for large-scale latent semantic indexing applications. In 17th ACM conference on information and knowledge management (pp. 153–162). Napa Valley, California.

  • Budiu, R., & Pirolli, P. (2006). Navigation in degree-of-interest trees. In Advance visual interface conference.

  • Carreras, X., Chao, I., Padró, L., & Padró, M. (2004). An open-source suite of language analyzers. In 4th international conference on language resources and evaluation (LREC’04), Lisbon, Portugal.

  • Chiao, Y.-C., & Zweigenbaum, P. (2002). Looking for candidate translational equivalents in specialized, comparable corpora. In 19th COLING’02.

  • Curran, J. R., & Moens, M. (2002). Improvements in automatic thesaurus extraction. In ACL workshop on unsupervised lexical acquisition (pp. 59–66). Philadelphia.

  • Deerwester, S., Dumais, S. T., Furmas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.

    Article  Google Scholar 

  • Dejean, H., Gaussier, E., & Sadat, F. (2002). Bilingual terminology extraction: An approach based on a multilingual thesaurus applicable to comparable corpora. In COLING 2002, Tapei, Taiwan.

  • Ferguson, G. A., & Takane, Y. (2005). Statistical analysis in psychology and education. Montreal, Quebec: McGraw-Hill Ryerson Limited.

    Google Scholar 

  • Fung, P., & McKeown, K. (1997). Finding terminology translation from non-parallel corpora. In 5th annual workshop on very large corpora (pp. 192–202). Hong Kong.

  • Fung, P., & Yee, L.Y. (1998). An IR approach for translating new words from nonparallel, comparable texts. In Coling’98 (pp. 414–420). Montreal, Canada.

  • Gamallo, P. (2007). Learning bilingual lexicons from comparable English and Spanish corpora. In Machine translation SUMMIT XI, Copenhagen, Denmark.

  • Gamallo, P. (2008) Evaluating two different methods for the task of extracting bilingual lexicons from comparable corpora. In LREC 2008 workshop on comparable corpora (pp. 19–26). Marrakech, Marroco.

  • Gamallo, P. (2009). Comparing different properties involved in word similarity extraction. In 14th Portuguese conference on artificial Intelligence (EPIA’09), LNCS, Vol. 5816 (pp. 634–645). Aveiro, Portugal. Springer-Verlag.

  • Gamallo, P., Agustini, A., & Lopes, G. (2005). Clustering syntactic positions with similar semantic requirements. Computational Linguistics, 31(1), 107–146.

    Article  Google Scholar 

  • Gorrel, G. (2005). Generalized Hebbian algorithm for incremental singular value decomposition in natural language processing. In EACL 2005.

  • Grefenstette, G. (1994). Explorations in automatic thesaurus discovery. USA: Kluwer Academic Publishers.

    Google Scholar 

  • Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (pp. 50–57). Berkeley, California.

  • Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1–2), 177–196.

    Article  Google Scholar 

  • Holmes, M. P., Gray, A. G., & Isbell, C. L. Jr. (2008). QUIC-SVD: Fast SVD using cosine trees. In NIPS-2008 (pp. 673–680).

  • Kaji, H. (2005). Extracting translation equivalents from bilingual comparable corpora. In IEICE Transactions 88-D(2) (pp. 313–323).

  • Kaji, H., & Aizono, T. (1996). Extracting word correspondences from bilingual corpora based on word co-occurrence information. In 16th conference on computational linguistics (Coling’96) (pp. 23–28). Copenhagen, Denmark.

  • Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquision, induction and representation of knowledge. Psychological Review, 10(2), 211–240.

    Article  Google Scholar 

  • Lemaire, B., & Denhière, G. (2006). Effects of high-order co-occurrences on word semantic similarity. Current Psychology Letters, 18(1), 1–12.

    Google Scholar 

  • Levin, E., Sharifi, M., & Ball, J. T. (2006). Evaluation of utility of LSA for word sense discrimination. In HLT-NAACL.

  • Lin, D. (1998). Automatic retrieval and clustering of similar words. In COLING-ACL’98, Montreal.

  • Masuichi, H., Flournoy, R., Kaufmann, S., & Peters, S. (1999). Query translation method for cross language information retrieval. In Proceedings of the workshop on machine translation for cross language information retrieval, MT Summit VII (pp. 30–34). Singapore.

  • Matveeva, I., Levow, G., Farahat, A., & Royer, C. (2005). Terms representation with generalized latent semantic analysis. In RANLP-2005.

  • Pedersen, T., & Kulkarni, A. (2007). Discovering identities in web contexts with unsupervised clustering. In IJCAI-2007 workshop on analytics for noisy unstructured text data (pp. 23–30). Hyderabad, India.

  • Price, R., & Zukas, A. (2005). Application of latent semantic indexing to processing of noisy text. In Intelligence and security informatics, LNCS 3495 (pp. 602–603).

  • Pustejovsky, J. (1995). The generative lexicon. Cambridge: MIT Press.

    Google Scholar 

  • Rapp, R. (1999). Automatic identification of word translations from unrelated English and German corpora. In ACL’99 (pp. 519–526).

  • Rapp, R. (2003). Word sense discovery based on sense descriptor dissimilarity. In 9th machine translation summit.

  • Rapp, R. (2004). A freely available automatically generated thesaurus of related words. In LREC-2004 (pp. 395–398). Lisbon, Portugal.

  • Saralegui, X., San Vicente, I., & Gurrutxaga, A. (2008). Automatic generation of bilingual lexicons from comparable corpora in a popular science domain. In LREC 2008 workshop on building and using comparable corpora.

  • Schütze, H. (1992). Dimensions of meaning. In Proceedings of supercomputing-92 (pp. 787–796). Minneapolis, MN.

  • Schütze, H. (1997). Ambiguity resolution in langugage learning. In CSLI Publications, Standford, CA.

  • Schütze, H. (1998). Automatic word sense discrimination. Computational Linguistics, 24(1), 97–124.

    Google Scholar 

  • Terra, E., & Clarke, C. L. (2003). Frequency estimates for statistical word similarity measures. In Conference of the North American chapter of the association for computational linguistics on human language technology (NAACL’03) (pp. 165–172). NJ, USA.

  • Turney, P. (2001). Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In 12th european conference of machine learning (pp. 491–502).

  • Umemura, K., & Xu, Y. (2003). Very low-dimensional latent semantic indexing for local query regions. In Annual meeting of the ACL archive proceedings of the sixth international workshop on information retrieval with Asian languages (pp. 84–91). Saporo, Japan.

  • van der Plas, L., & Bouma, G. (2004). Syntactic contexts for finding semantically related words. In Meeting of computational linguistics in the Netherlands (CLIN2004) .

  • Wang, J., Duan, L., Xu, L., Lu, H., & Jin, J. S. (2007). TV a,d video categorization with probabilistic latent concept learning. In Workshop on multimedia information retrieval (pp. 24–29). Augsburg, Bavaria, Germany.

  • Wiemer-Hastings, P., Wiemer-Hastings, K., & Graesser, A. (1999). Improving an intelligent tutor’s comprehension of students with latent semantic analysis. In S. Lajoie & M. Vivet (Eds.), Artificial intelligence in education (pp. 535–542). Amsterdam: IOS Press.

    Google Scholar 

  • Zhuang, Y., Lu, W., & Wu, J. (2009). Latent style model: Discovering writing styles for calligraphy works. Journal of Visual Communication and Image Representation, 20(2), 84–96.

    Article  Google Scholar 

Download references

Acknowledgments

This work has been supported by the Galician Government (projects with reference: PGIDIT07PXIB204015PR and 2008/101), and by the Natural Language Engineering Department at the University of Leipzig.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pablo Gamallo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gamallo, P., Bordag, S. Is singular value decomposition useful for word similarity extraction?. Lang Resources & Evaluation 45, 95–119 (2011). https://doi.org/10.1007/s10579-010-9129-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-010-9129-5

Keywords

Navigation