ABSTRACT
In this paper, we propose a new approach to unsupervised phrase detection that is based on a sentence segmentation. Unlike the existing approach that examines only word-based statistics, the proposed method detects phrases by considering the most likely segmentation for each sentence. We develop a Bayesian model that estimates phrase boundaries and the grammatical roles of each phrase at the same time, which can be trained in an unsupervised manner by using Gibbs sampling. The experimental results show that the phrase detection by using the proposed model can recognize about 30 times more phrases than the existing popular method in the same precision because of the successful detection of infrequent phrases.
- Fazli Can, Rabia Nuray, and Ayisigi B. Sevdik. 2004. Automatic performance evaluation of Web search engines. Information Processing & Management 40, 3 (2004), 495--514. Google ScholarDigital Library
- Jianfeng Gao and Mark Johnson. 2008. A comparison of Bayesian estimators for unsupervised Hidden Markov Model POS taggers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '08). 344--352. Google ScholarDigital Library
- Thomas L. Griffiths, Mark Steyvers, David M. Blei, and Joshua B. Tenenbaum. 2004. Automatic Keyphrase Extraction: A Survey of the State of the Art. In Advances in Neural Information Processing Systems 17 (NIPS '04). 537--544.Google Scholar
- Kazi Saidul Hasan and Vincent Ng. 2014. Automatic Keyphrase Extraction: A Survey of the State of the Art. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL '14). 1262--1273.Google ScholarCross Ref
- I. Korkontzelos. 2010. Unsupervised Learning of Multiword Expressions. Ph.D. Dissertation. Department of Computer Science, University of York.Google Scholar
- Robert V. Lindsey, III William P. Headden, and Michael J. Stipicevic. 2012. A phrase-discovering topic model using hierarchical Pitman-Yor processes. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 12). 214--222. Google ScholarDigital Library
- Christopher Manning and Hinrich Schuetze. 1999. Foundations of Statistical Natural Language Processing. The MIT Press. Google ScholarDigital Library
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26 (NIPS '13). 3111--3119. Google ScholarDigital Library
- Kevin P. Murphy and Mark A. Paskin. 2001. Linear-time inference in Hierarchical HMMs. In Advances in Neural Information Processing Systems 14 (NIPS 01). 833--840. Google ScholarDigital Library
- John K. Pate and Sharon Goldwater. 2011. Unsupervised syntactic chunking with acoustic cues: computational models for prosodic bootstrapping. In Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics (CMCL 11). 20--29. Google ScholarDigital Library
- Elias Ponvert, Jason Baldridge, and Katrin Erk. 2011. Simple unsupervised grammar induction from raw text with cascaded finite state models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT '11). 1077--1086. Google ScholarDigital Library
- Steven L. Scott. 2002. Bayesian Methods for Hidden Markov Models: Recursive Computing in the 21st Century. J. Amer. Statist. Assoc. 97, 457 (2002), 337--351.Google ScholarCross Ref
- Xuerui Wang, Andrew McCallum, and Xing Wei. 2007. Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval. In Proceedings of the 2007 Seventh IEEE International Conference on Data Mining (ICDM '07). 697--702. Google ScholarDigital Library
Index Terms
- Segmentation-based Unsupervised Phrase Detection
Recommendations
Topic-Based Unsupervised and Supervised Dictionary Induction
Word translation is a natural language processing task that provides translation between the words of a source and a target language. As a task, it reduces to the induction of a bilingual dictionary, which is typically performed by aligning word ...
Adaptive cross-contextual word embedding for word polysemy with unsupervised topic modeling
AbstractBecause of its efficiency, word embedding has been widely used in many natural language processing and text modeling tasks. It aims to represent each word by a vector so such that the geometry between these vectors can capture the ...
Pruning-Based Unsupervised Segmentation for Korean
Compound noun segmentation is a key component for Korean language processing. Supervised approaches require some types of human intervention such as maintaining lexicons, manually segmenting the corpora, or devising heuristic rules. Thus, they suffer ...
Comments