Hybrid method for modeless Japanese input using N-gram based binary classification and dictionary

Ikegami, Yukino; Tsuruta, Setsuo

doi:10.1007/s11042-013-1805-1

Hybrid method for modeless Japanese input using N-gram based binary classification and dictionary

Published: 11 January 2014

Volume 74, pages 3933–3946, (2015)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Yukino Ikegami¹ &
Setsuo Tsuruta¹

312 Accesses
1 Citation
Explore all metrics

Abstract

The rapid growth of globalization requires handling a large number of multilingual documents, where Japanese input co-exist with English and other languages, which use the Roman alphabet. Conventional methods for Japanese input require Japanese users to switch the input mode between Japanese and the Latin alphabet. As current solution, there is a modeless Japanese input method that automatically switches the input mode. However, those need training with a large amount of text data for improving the performance. This paper proposes a hybrid modeless Japanese input method that is based on the non-Japanese word dictionary and n-gram character sequence features to decide whether to convert and switch to Kana input or not. The aim of using the non-Japanese word dictionary is decreasing false positive against non-Japanese language words. This dictionary is composed by text data available on the Web. The n-gram based discriminative model are learned by a Support Vector Machine from a balanced corpus, which contains various domain texts. The evaluation of our method has shown that its statistical accuracy according to F-measure for prediction of non-Kana characters improves 7.7 % compared to n-gram only based method. In addition, the real user test has shown the average value of inputted time was agreeside for our method, against disagree side for conventional Japanese input method that requires switching input mode.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Telugu Text Classification Using Supervised Machine Learning Algorithm

Performance Evaluation of Text Categorization Algorithms Using an Albanian Corpus

Automatic Kurdish Text Classification Using KDC 4007 Dataset

Notes

Japanese has numerous Kanji characters that have multiple readings, which gives rise to a large number of homographs. Unlike English homographs, which differ in meaning, the meanings of Japanese homographs can be totally different, or partially synonymous.
http://punto.yandex.ru/
http://www.keyboard-ninja.com/
http://office.microsoft.com/ja-jp/support/HA101867251.aspx
A n-gram is simply a substring of length n. Character n-grams have been applied to language identification in may fields, including language modeling [1], frequency profile matching [3], spam detection[6] and compression [27].
http://www.csie.ntu.edu.tw/%7Ecjlin/liblinear
http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html
http://code.google.com/p/mozc/
http://www.wordfrequency.info/
http://dumps.wikimedia.org/jawiki/20130216/jawiki-20130216-all-titles-in-ns0.gz
http://sourceforge.jp/projects/unidic/

References

Beesley KR (1988) Language identifier: a computer program for automatic natural-language identification of on-line text. In: Proceedings of the 29th ATA annual conference. pp 47–54
Bellandi V, Ceravolo P, Damiani E, Frati F, Maggesi J (2012) Towards a Collaborative Innovation Catalyst. In: Proceedings of SITIS 2012. IEEE Computer Society, pp 637–643
Cavnar WB, Trenkle JM (1994) N-gram-based text categorization. In: Proceedings of SDAIR’94. pp 161–175
Chang CC, Lin CJ (2001) LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/%7Ecjlin/libsvm
Chen Z, Lee K (2000) A new statistical approach to Chinese Pinyin input. In: Proceedings of the 38th annual meeting on association for computational linguistics. pp 241–247
Damiani E, di Vimercati SDC, Paraboschi S, Samarati P (2004) An open digest-based technique for spam detection. In: ISCA PDCS 2004. pp 559–564
Davies M (2009) The 385+ million word corpus of contemporary american english (19902008+): design, architecture, and linguistic insights. Int J Corpus Linguis 14(2):159–190
Article Google Scholar
Dumais S (1998) Using SVMs for text categorization. IEEE Intell Syst 13(4):21–23
Google Scholar
Ehara Y, Tanaka-Ishii K (2008) Multilingual text entry using automatic language detection. In: Proceedings of international joint conference on natural language processing. pp 441–448
Fan RE, Chang KW, Hsieh C-J, Wang X-R, Lin C-J (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874
MATH Google Scholar
Grudin JT (1983) Error patterns in novice and skilled transcription typing. In: Cognitive aspects of skilled typewriting. Springer, Verlag, pp 121–143
Hakkani-T’́ur DZ, Oflazer K, T’́ur G (2002) Statistical morphological disambiguation for agglutinative languages. Comput Humanit 36(4):381–410
Article Google Scholar
Ikegami Y, Sakurai Y, Tsuruta S (2012) Modeless Japanese input method using multiple character sequence features. In: Proceedings of eighth international conference on signal image technology and internet based systems. IEEE Computer Society, pp 613–618
Internet.com K.K. (Japan) (2009) Roma to Kana input users are 90 %, direct Kana input users are 10 % - survey about typing - (in Japanese), http://japan.internet.com/research/20090611/1.html. Accessed 3 July 2013
Japanese Ministry of Internal Affairs and Communications (2009) Utilization situation of Internet (in Japanese). http://www.soumu.go.jp/johotsusintokei/whitepaper/ja/h24/html/nc.243120.html. Accessed 10 October 2013
Joachims T (2006) Training linear SVMs in linear time. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 217–226
Kasahara S, Komachi M, Nagata M, Matsumoto Y (2011) Error correcting Romaji-kana conversion for Japanese language education. In: Proceedings of the workshop on advances in text input methods. pp 38–42
Kerkhofs R, Dijkstra T, Chwilla DJ, de Bruijn ER (2006) Testing a model for bilingual semantic priming with interlingual homographs: RT and N400 effects. In: Brain research, vol 1068. Elsevier, pp 170–813
Kudo T, Yamamoto K, Matsumoto Y (2004) Applying conditional random fields to Japanese morphologiaical analysis. In: Proceedings of the EMNLP-2004. pp 230–237
Maekawa K (2008) Balanced corpus of contemporary written Japanese. In: Proceedings of the 6th workshop on asian language resources. pp 101–102
Neubig G, Duh K (2013) How much is said in a tweet? A multilingual, in-formation-theoretic perspective. In: Proceedings of the AAAI’13 spring symposium on analyzing microtext. Stanford
Neubig G, Nakata Y, Mori S (2011) Pointwise prediction for robust, adaptable Japanese morphological analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 2. pp 529–533
Pouliquen B, Steinberger R, Ignat C (2006) Automatic annotation of multilingual text collections with a conceptual thesaurus. arXiv:preprint cs/0609059
Roeber H, Bacus J, Tomasi C (2003) Typing in thin air: the canesta projection keyboard - a new method of interaction with electronic devices. In: Proceedings of CHI extended abstracts. pp 712–713
Shalev-Shwartz S, Singer Y, Srebro N (2007) Pegasos Primal estimated sub-gradient solver for SVM. In: Proceedings of the 24th international conference on machine learning. ACM, pp 807–814
Suzumegano F, Amano J, Maruyama Y, Hayakawa E, Namiki M, Takahashi N (1995) The evaluation environment for a Kana to Kanji transliteration system and an evaluation of the modeless input method. In: IPSJ SIG technical report, vol 1995-HI-42. pp 9–16
Teahan WJ (2000) Text classification and segmentation using minimum cross-entropy. In: Proceedings of RIAO’00. pp 943–961
Zheng Y, Liu C, Ding X (2001) Single-character type identification. In: Electronic imaging 2002, International society for optics and photonics. pp 49–56

Download references

Acknowledgments

This work was supported by Grant-in-Aid for Scientific Research of the government of Japan (KAKENHI 24700214). The authors are also thankful to Ms. Yukiko Yamamoto, a student of Tokyo Denki University, for her miscellaneous help such as English translation and check.

Author information

Authors and Affiliations

Tokyo Denki University, 2-1200, MuzaiGakuendai, Inzai-shi, Chiba, Japan
Yukino Ikegami & Setsuo Tsuruta

Authors

Yukino Ikegami
View author publications
You can also search for this author inPubMed Google Scholar
Setsuo Tsuruta
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Yukino Ikegami.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ikegami, Y., Tsuruta, S. Hybrid method for modeless Japanese input using N-gram based binary classification and dictionary. Multimed Tools Appl 74, 3933–3946 (2015). https://doi.org/10.1007/s11042-013-1805-1

Download citation

Published: 11 January 2014
Issue Date: June 2015
DOI: https://doi.org/10.1007/s11042-013-1805-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hybrid method for modeless Japanese input using N-gram based binary classification and dictionary

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Telugu Text Classification Using Supervised Machine Learning Algorithm

Performance Evaluation of Text Categorization Algorithms Using an Albanian Corpus

Automatic Kurdish Text Classification Using KDC 4007 Dataset

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now