Abstract
We scientifically test Harris’s hypothesis that morpheme/ word boundaries can be detected from changes in the complexity of phoneme sequences. We re-formulate his hypothesis from a more information theoretic viewpoint and use a corpus to test whether the hypothesis holds. We found that his hypothesis holds for morphemes, with an F-score of about 80%, in both English and Chinese. However, we obtained contrary results for English and Chinese with regard to word boundaries; this reflects a difference in the nature of the two languages.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Harris, S.: From phoneme to morpheme. Language, 190–222 (1955)
Imai, K.: Dictionary of Chomsky. Taishukan (1986) (in Japanese)
Martinet, A.: Elements de linguistique generale. Colin (1960)
Jin, Z., Tanaka-Ishii, K.: Unsupervised segmentation of chinese text by use of braching entropy. In: COLLING/ACL (2006)
Huang, H., Powers, D.: Chinese word segmentation based on contexual entropy. In: Pacific Asian Conference on Language, Information and Computation (2003)
Frantzi, T., Ananiadou, S.: Extracting nested collocations. In: 16th COLING, pp. 41–46 (1996)
Tanaka-Ishii, K., Nakagawa, H.: A multilingual usage consultation tool based on internet searching -More than a search engine, less than QA. In: WWW Conference, pp. 363–371 (2005)
Tanaka-Ishii, K.: Entropy as an indicator of context boundaries —an experiment using a web search engine. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS, vol. 3651, pp. 93–105. Springer, Heidelberg (2005)
Carnegie Mellon University: CMU pronouncing dictionary version 0.6 (2006) (visited 2006), http://www.speech.cs.cmu.edu/cgi-bin/cmudict
SIL: Pc-kimmo version 2, a morphologial parser (1995), http://www.sil.org/pckimmo/
ICL: People’s daily corpus, Beijing university (1999), http://www.icl.pku.edu.cn/icl_res/
NJStar Software Corp: Njstar, chinese word processing software (2006), http://www.njstar.com
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Tanaka-Ishii, K., Jin, Z. (2006). From Phoneme to Morpheme: Another Verification Using a Corpus. In: Matsumoto, Y., Sproat, R.W., Wong, KF., Zhang, M. (eds) Computer Processing of Oriental Languages. Beyond the Orient: The Research Challenges Ahead. ICCPOL 2006. Lecture Notes in Computer Science(), vol 4285. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11940098_25
Download citation
DOI: https://doi.org/10.1007/11940098_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49667-0
Online ISBN: 978-3-540-49668-7
eBook Packages: Computer ScienceComputer Science (R0)