Unsupervised Text Normalization Approach for Morphological Analysis of Blog Documents

Ikeda, Kazushi; Yanagihara, Tadashi; Matsumoto, Kazunori; Takishima, Yasuhiro

doi:10.1007/978-3-642-10439-8_41

Unsupervised Text Normalization Approach for Morphological Analysis of Blog Documents

Kazushi Ikeda²¹,
Tadashi Yanagihara²¹,
Kazunori Matsumoto²¹ &
…
Yasuhiro Takishima²¹

Conference paper

1610 Accesses
1 Citations
1 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5866))

Abstract

In this paper, we propose an algorithm for reducing the number of unknown words on blog documents by replacing peculiar expressions with formal expressions. Japanese blog documents contain many peculiar expressions regarded as unknown sequences by morphological analyzers. Reducing these unknown sequences improves the accuracy of morphological analysis for blog documents. Manual registration of peculiar expressions to the morphological dictionaries is a conventional solution, which is costly and requires specialized knowledge. In our algorithm, substitution candidates of peculiar expressions are automatically retrieved from formally written documents such as newspapers and stored as substitution rules. For the correct replacement, a substitution rule is selected based on three criteria; its appearance frequency in retrieval process, the edit distance between substituted sequences and the original text, and the estimated accuracy improvements of word segmentation after the substitution. Experimental results show our algorithm reduces the number of unknown words by 30.3%, maintaining the same segmentation accuracy as the conventional methods, which is twice the reduction rate of the conventional methods.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Nakajima, S., Tatemura, J., Hino, Y., Hara, Y., Tanaka, K.: Discovering important bloggers based on analyzing blog threads. In: Proc. of the 2nd Annual Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics. Workshop at the WWW 2005 (2005)
Google Scholar
Ni, X., Xue, G.-R., Ling, X., Yu, Y., Yang, Q.: Exploring in the weblog space by detecting informative and affective articles. In: Proc. of the 16th International World Wide Web Conference (WWW 2007), pp. 281–290 (2007)
Google Scholar
Kudo, T.: Mecab: Yet another part-of-speech and morphological analyzer, http://mecab.sourceforge.net/
Kazama, J., Mitsuishi, Y., Makino, T., Kentaro, Torisawa, T., Tsujii, J.: Morphological analysis for Japanese Web chat. In: Proc. of NLP 1999, pp. 509–512 (1999) (in Japanese)
Google Scholar
Ikeda, K., Yanagihara, T., Matsumoto, K., Takishima, Y.: An automatic rule generation method for modifying informal expression in blog documents. DBSJ Journal, The Database Society of Japan 8(1), 23–28 (2009)
Google Scholar
Takemoto, Y., Fukushima, S.: Implementation and evaluation of a morphological analysis method for colloquial japanese text. Proc. of IPSJ SIG Notes 94(77), 105–112 (1994)
Google Scholar
Takeshita, A., Fukunaga, H.: Morphological analysis for spoken language. In: Proc. of the 42nd National Convention of IPSJ, pp. 1–3 (1991)
Google Scholar
Matsumoto, Y., Den, Y.: Morphological analysis of spoken japanese. Proc. of IPSJ SIG Notes 2001(55), 9–14 (2001)
Google Scholar
Masuyama, T., Sekine, S., Nakagawa, H.: Automatic construction of japanese katakana variant list from large corpus. In: Proc. of the 20th International Conference on Computational Linguistics (COLING), pp. 1214–1219 (2004)
Google Scholar
Murawaki, Y., Kurohashi, S.: Online acquisition of japanese unknown morphemes using morphological constraints. In: 2008 Conference on Empirical Methods in Natural Language Processing (EMNLP 2008), pp. 429–437 (2008)
Google Scholar
Mori, S., Nagao, M.: Word extraction from corpora and its part-of-speech estimation using distributional analysis. In: Proc. of the 11th International Conference on Computational Linguistics (COLING), pp. 1119–1122 (1996)
Google Scholar
Yanagihara, T., Matsumoto, K., Ikeda, K., Takishima, Y.: Word segmentation estimation using information criteria. In: Proc. of IPSJ NLP 190, pp. 43–47 (2009)
Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Journal of Soviet Physics, Doklady, 707–710 (1966)
Google Scholar
Kudo, T., Yamamoto, K., Matsumoto, Y.: Applying conditional random fields to japanese morphological analysis. In: Proc. of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), pp. 230–237 (2004)
Google Scholar
Nagata, M.: A stochastic japanese morphological analyzer using a forward-dp backward-a n-best search algorithm. In: Proc. of the 15th International Conference on Computational Linguistics (COLING), pp. 201–207 (1994)
Google Scholar

Download references

Author information

Authors and Affiliations

KDDI R&D Laboratories, Inc., 2-1-15 Ohara Fujimino, Saitama, 356-8502, Japan
Kazushi Ikeda, Tadashi Yanagihara, Kazunori Matsumoto & Yasuhiro Takishima

Authors

Kazushi Ikeda
View author publications
You can also search for this author in PubMed Google Scholar
Tadashi Yanagihara
View author publications
You can also search for this author in PubMed Google Scholar
Kazunori Matsumoto
View author publications
You can also search for this author in PubMed Google Scholar
Yasuhiro Takishima
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Clayton School of Information Technology, Monash University, 3800, Clayton, VIC, Australia
Ann Nicholson
School of Computer Science and Information Technology, RMIT University, 3001, Melbourne, VIC, Australia
Xiaodong Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ikeda, K., Yanagihara, T., Matsumoto, K., Takishima, Y. (2009). Unsupervised Text Normalization Approach for Morphological Analysis of Blog Documents. In: Nicholson, A., Li, X. (eds) AI 2009: Advances in Artificial Intelligence. AI 2009. Lecture Notes in Computer Science(), vol 5866. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10439-8_41

Download citation

DOI: https://doi.org/10.1007/978-3-642-10439-8_41
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-10438-1
Online ISBN: 978-3-642-10439-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics