Abstract
When acquiring synonyms from large corpora, it is important to deal not only with such surface information as the context of the words but also their latent semantics. This paper describes how to utilize a latent semantic model PLSI to acquire synonyms automatically from large corpora. PLSI has been shown to achieve a better performance than conventional methods such as tf·idf and LSI, making it applicable to automatic thesaurus construction. Also, various PLSI techniques have been shown to be effective including: (1) use of Skew Divergence as a distance/similarity measure; (2) removal of words with low frequencies, and (3) multiple executions of PLSI and integration of the results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bilmes, J.: A gentle tutorial on the EM algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. Technical Report ICSI-TR-97-021, International Computer Science Institute (ICSI), Berkeley, CA (1997)
Charniak, E.: A maximum-entropy-inspired parser. NAACL 1, 132–139 (2000)
Collins: Collins Cobuild Major New Edition CD-ROM. HarperCollins Publishers, New York (2002)
Collins, M.: A new statistical parser based on bigram lexical dependencies. In: Proc. of 34th ACL, pp. 184–191 (1996)
Deerwester, S., et al.: Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science 41(6), 391–407 (1990)
Fellbaum, C.: WordNet: an electronic lexical database. MIT Press, Cambridge (1998)
Hindle, D.: Noun classification from predicate-argument structures. In: Proc. of the 28th Annual Meeting of the ACL, pp. 268–275 (1990)
Hofmann, T.: Probabilistic Latent Semantic Indexing. In: Proc. of the 22nd International Conference on Research and Development in Information Retrieval (SIGIR 1999), pp. 50–57 (1999)
Hofmann, T.: Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning 42, 177–196 (2001)
Kojima, K., et al.: Existence and Application of Common Threshold of the Degree of Association. In: Proc. of the Forum on Information Technology (FIT 2004) F-003 (2004)
Lee, L.: On the Effectiveness of the Skew Divergence for Statistical Language Analysis. Artificial Intelligence and Statistics 2001, 65–72 (2001)
Lin, J.: Divergence measures based on the shannon entropy. IEEE Transactions on Information Theory 37(1), 140–151 (1991)
Mochihashi, D., Matsumoto, Y.: Probabilistic Representation of Meanings. IPSJ SIGNotes Natural Language, 2002-NL-147:77–84 (2002)
The National Institute of Japanese Language, Bunruigoihyo. Dainippontosho (2004)
Santorini, B.: Part-of-Speech Tagging Guidelines for the Penn Treebank Project (1990), ftp://ftp.cis.upenn.edu/pub/treebank/doc/tagguide.ps.gz
Schmid, H.: Probabilistic Part-of-Speech Tagging Using Decision Trees. In: Proc. of the First International Conference on New Methods in Natural Language Processing (NemLap-1994), pp. 44–49 (1994)
Ueda, N., Nakano, R.: Deterministic annealing EM algorithm. Neural Networks 11, 271–282 (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hagiwara, M., Ogawa, Y., Toyama, K. (2005). PLSI Utilization for Automatic Thesaurus Construction. In: Dale, R., Wong, KF., Su, J., Kwong, O.Y. (eds) Natural Language Processing – IJCNLP 2005. IJCNLP 2005. Lecture Notes in Computer Science(), vol 3651. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562214_30
Download citation
DOI: https://doi.org/10.1007/11562214_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29172-5
Online ISBN: 978-3-540-31724-1
eBook Packages: Computer ScienceComputer Science (R0)