Skip to main content
Log in

Hybrid method for modeless Japanese input using N-gram based binary classification and dictionary

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The rapid growth of globalization requires handling a large number of multilingual documents, where Japanese input co-exist with English and other languages, which use the Roman alphabet. Conventional methods for Japanese input require Japanese users to switch the input mode between Japanese and the Latin alphabet. As current solution, there is a modeless Japanese input method that automatically switches the input mode. However, those need training with a large amount of text data for improving the performance. This paper proposes a hybrid modeless Japanese input method that is based on the non-Japanese word dictionary and n-gram character sequence features to decide whether to convert and switch to Kana input or not. The aim of using the non-Japanese word dictionary is decreasing false positive against non-Japanese language words. This dictionary is composed by text data available on the Web. The n-gram based discriminative model are learned by a Support Vector Machine from a balanced corpus, which contains various domain texts. The evaluation of our method has shown that its statistical accuracy according to F-measure for prediction of non-Kana characters improves 7.7 % compared to n-gram only based method. In addition, the real user test has shown the average value of inputted time was agreeside for our method, against disagree side for conventional Japanese input method that requires switching input mode.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. Japanese has numerous Kanji characters that have multiple readings, which gives rise to a large number of homographs. Unlike English homographs, which differ in meaning, the meanings of Japanese homographs can be totally different, or partially synonymous.

  2. http://punto.yandex.ru/

  3. http://www.keyboard-ninja.com/

  4. http://office.microsoft.com/ja-jp/support/HA101867251.aspx

  5. A n-gram is simply a substring of length n. Character n-grams have been applied to language identification in may fields, including language modeling [1], frequency profile matching [3], spam detection[6] and compression [27].

  6. http://www.csie.ntu.edu.tw/%7Ecjlin/liblinear

  7. http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html

  8. http://code.google.com/p/mozc/

  9. http://www.wordfrequency.info/

  10. http://dumps.wikimedia.org/jawiki/20130216/jawiki-20130216-all-titles-in-ns0.gz

  11. http://sourceforge.jp/projects/unidic/

References

  1. Beesley KR (1988) Language identifier: a computer program for automatic natural-language identification of on-line text. In: Proceedings of the 29th ATA annual conference. pp 47–54

  2. Bellandi V, Ceravolo P, Damiani E, Frati F, Maggesi J (2012) Towards a Collaborative Innovation Catalyst. In: Proceedings of SITIS 2012. IEEE Computer Society, pp 637–643

  3. Cavnar WB, Trenkle JM (1994) N-gram-based text categorization. In: Proceedings of SDAIR’94. pp 161–175

  4. Chang CC, Lin CJ (2001) LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/%7Ecjlin/libsvm

  5. Chen Z, Lee K (2000) A new statistical approach to Chinese Pinyin input. In: Proceedings of the 38th annual meeting on association for computational linguistics. pp 241–247

  6. Damiani E, di Vimercati SDC, Paraboschi S, Samarati P (2004) An open digest-based technique for spam detection. In: ISCA PDCS 2004. pp 559–564

  7. Davies M (2009) The 385+ million word corpus of contemporary american english (19902008+): design, architecture, and linguistic insights. Int J Corpus Linguis 14(2):159–190

    Article  Google Scholar 

  8. Dumais S (1998) Using SVMs for text categorization. IEEE Intell Syst 13(4):21–23

    Google Scholar 

  9. Ehara Y, Tanaka-Ishii K (2008) Multilingual text entry using automatic language detection. In: Proceedings of international joint conference on natural language processing. pp 441–448

  10. Fan RE, Chang KW, Hsieh C-J, Wang X-R, Lin C-J (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874

    MATH  Google Scholar 

  11. Grudin JT (1983) Error patterns in novice and skilled transcription typing. In: Cognitive aspects of skilled typewriting. Springer, Verlag, pp 121–143

  12. Hakkani-T’́ur DZ, Oflazer K, T’́ur G (2002) Statistical morphological disambiguation for agglutinative languages. Comput Humanit 36(4):381–410

    Article  Google Scholar 

  13. Ikegami Y, Sakurai Y, Tsuruta S (2012) Modeless Japanese input method using multiple character sequence features. In: Proceedings of eighth international conference on signal image technology and internet based systems. IEEE Computer Society, pp 613–618

  14. Internet.com K.K. (Japan) (2009) Roma to Kana input users are 90 %, direct Kana input users are 10 % - survey about typing - (in Japanese), http://japan.internet.com/research/20090611/1.html. Accessed 3 July 2013

  15. Japanese Ministry of Internal Affairs and Communications (2009) Utilization situation of Internet (in Japanese). http://www.soumu.go.jp/johotsusintokei/whitepaper/ja/h24/html/nc.243120.html. Accessed 10 October 2013

  16. Joachims T (2006) Training linear SVMs in linear time. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 217–226

  17. Kasahara S, Komachi M, Nagata M, Matsumoto Y (2011) Error correcting Romaji-kana conversion for Japanese language education. In: Proceedings of the workshop on advances in text input methods. pp 38–42

  18. Kerkhofs R, Dijkstra T, Chwilla DJ, de Bruijn ER (2006) Testing a model for bilingual semantic priming with interlingual homographs: RT and N400 effects. In: Brain research, vol 1068. Elsevier, pp 170–813

  19. Kudo T, Yamamoto K, Matsumoto Y (2004) Applying conditional random fields to Japanese morphologiaical analysis. In: Proceedings of the EMNLP-2004. pp 230–237

  20. Maekawa K (2008) Balanced corpus of contemporary written Japanese. In: Proceedings of the 6th workshop on asian language resources. pp 101–102

  21. Neubig G, Duh K (2013) How much is said in a tweet? A multilingual, in-formation-theoretic perspective. In: Proceedings of the AAAI’13 spring symposium on analyzing microtext. Stanford

  22. Neubig G, Nakata Y, Mori S (2011) Pointwise prediction for robust, adaptable Japanese morphological analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 2. pp 529–533

  23. Pouliquen B, Steinberger R, Ignat C (2006) Automatic annotation of multilingual text collections with a conceptual thesaurus. arXiv:preprint cs/0609059

  24. Roeber H, Bacus J, Tomasi C (2003) Typing in thin air: the canesta projection keyboard - a new method of interaction with electronic devices. In: Proceedings of CHI extended abstracts. pp 712–713

  25. Shalev-Shwartz S, Singer Y, Srebro N (2007) Pegasos Primal estimated sub-gradient solver for SVM. In: Proceedings of the 24th international conference on machine learning. ACM, pp 807–814

  26. Suzumegano F, Amano J, Maruyama Y, Hayakawa E, Namiki M, Takahashi N (1995) The evaluation environment for a Kana to Kanji transliteration system and an evaluation of the modeless input method. In: IPSJ SIG technical report, vol 1995-HI-42. pp 9–16

  27. Teahan WJ (2000) Text classification and segmentation using minimum cross-entropy. In: Proceedings of RIAO’00. pp 943–961

  28. Zheng Y, Liu C, Ding X (2001) Single-character type identification. In: Electronic imaging 2002, International society for optics and photonics. pp 49–56

Download references

Acknowledgments

This work was supported by Grant-in-Aid for Scientific Research of the government of Japan (KAKENHI 24700214). The authors are also thankful to Ms. Yukiko Yamamoto, a student of Tokyo Denki University, for her miscellaneous help such as English translation and check.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yukino Ikegami.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ikegami, Y., Tsuruta, S. Hybrid method for modeless Japanese input using N-gram based binary classification and dictionary. Multimed Tools Appl 74, 3933–3946 (2015). https://doi.org/10.1007/s11042-013-1805-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-013-1805-1

Keywords

Navigation