Abstract
Previous works have suggested that the uncertainty of tokens coming after a sequence helps determine whether a given position is at a context boundary. This feature of language has been applied to unsupervised text segmentation and term extraction. In this paper, we fundamentally verify this feature. An experiment was performed using a web search engine, in order to clarify the extent to which this assumption holds. The verification was applied to Chinese and Japanese.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Kyoto University Text Corpus Version 3.0 (2003), http://www.kc.t.u-tokyo.ac.jp/nl-resource/corpus.html
Ando, R.K., Lee, L.: Mostly-unsupervised statistical segmentation of japanese: Applications to kanji. In: ANLP-NAACL (2000)
Bell, T.C., Cleary, J.G., Witten, I.H.: Text Compression. Prentice-Hall, Englewood Cliffs (1990)
Creutz, M., Lagus, K.: Unsupervised discovery of morphemes. In: Workshop of the ACL Special Interest Group in Computational Phonology, pp. 21–30 (2002)
Frantzi, T.K., Ananiadou, S.: Extracting nested collocations. In: 16th COLING, pp. 41–46 (1996)
Harris, S.Z.: From phoneme to morpheme. Language, 190–222 (1955)
ICL. People daily corpus, beijing university, Institute of Computational Linguistics, Beijing University (1999), http://162.105.203.93/Introduction/~corpustagging.htm
Kempe, A.: Experiments in unsupervised entropy-based corpus segmentation. In: Workshop of EACL in Computational Natural Language Learning, pp. 7–13 (1999)
Nakagawa, H., Mori, T.: A simple but powerful automatic termextraction method. In: Computerm2: 2nd International Workshop on Computational Terminology, pp. 29–35 (2002)
Nobesawa, S., Tsutsumi, J., Jang, D.S., Sano, T., Sato, K., Nakanishi, M.: Segmenting sentences into linky strings using d-bigram statistics. In: COLING, pp. 586–591 (1998)
Saffran, J.R.: Words in a sea of sounds: The output of statistical learning. Cognition 81, 149–169 (2001)
Sun, M., Dayang, S., Tsou, B.K.: Chinese word segmentation without using lexicon and hand-crafted training data. In: COLING-ACL (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Tanaka-Ishii, K. (2005). Entropy as an Indicator of Context Boundaries: An Experiment Using a Web Search Engine. In: Dale, R., Wong, KF., Su, J., Kwong, O.Y. (eds) Natural Language Processing – IJCNLP 2005. IJCNLP 2005. Lecture Notes in Computer Science(), vol 3651. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562214_9
Download citation
DOI: https://doi.org/10.1007/11562214_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29172-5
Online ISBN: 978-3-540-31724-1
eBook Packages: Computer ScienceComputer Science (R0)