Abstract
The rapid growth of globalization requires handling a large number of multilingual documents, where Japanese input co-exist with English and other languages, which use the Roman alphabet. Conventional methods for Japanese input require Japanese users to switch the input mode between Japanese and the Latin alphabet. As current solution, there is a modeless Japanese input method that automatically switches the input mode. However, those need training with a large amount of text data for improving the performance. This paper proposes a hybrid modeless Japanese input method that is based on the non-Japanese word dictionary and n-gram character sequence features to decide whether to convert and switch to Kana input or not. The aim of using the non-Japanese word dictionary is decreasing false positive against non-Japanese language words. This dictionary is composed by text data available on the Web. The n-gram based discriminative model are learned by a Support Vector Machine from a balanced corpus, which contains various domain texts. The evaluation of our method has shown that its statistical accuracy according to F-measure for prediction of non-Kana characters improves 7.7 % compared to n-gram only based method. In addition, the real user test has shown the average value of inputted time was agreeside for our method, against disagree side for conventional Japanese input method that requires switching input mode.


Similar content being viewed by others
Notes
Japanese has numerous Kanji characters that have multiple readings, which gives rise to a large number of homographs. Unlike English homographs, which differ in meaning, the meanings of Japanese homographs can be totally different, or partially synonymous.
References
Beesley KR (1988) Language identifier: a computer program for automatic natural-language identification of on-line text. In: Proceedings of the 29th ATA annual conference. pp 47–54
Bellandi V, Ceravolo P, Damiani E, Frati F, Maggesi J (2012) Towards a Collaborative Innovation Catalyst. In: Proceedings of SITIS 2012. IEEE Computer Society, pp 637–643
Cavnar WB, Trenkle JM (1994) N-gram-based text categorization. In: Proceedings of SDAIR’94. pp 161–175
Chang CC, Lin CJ (2001) LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/%7Ecjlin/libsvm
Chen Z, Lee K (2000) A new statistical approach to Chinese Pinyin input. In: Proceedings of the 38th annual meeting on association for computational linguistics. pp 241–247
Damiani E, di Vimercati SDC, Paraboschi S, Samarati P (2004) An open digest-based technique for spam detection. In: ISCA PDCS 2004. pp 559–564
Davies M (2009) The 385+ million word corpus of contemporary american english (19902008+): design, architecture, and linguistic insights. Int J Corpus Linguis 14(2):159–190
Dumais S (1998) Using SVMs for text categorization. IEEE Intell Syst 13(4):21–23
Ehara Y, Tanaka-Ishii K (2008) Multilingual text entry using automatic language detection. In: Proceedings of international joint conference on natural language processing. pp 441–448
Fan RE, Chang KW, Hsieh C-J, Wang X-R, Lin C-J (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874
Grudin JT (1983) Error patterns in novice and skilled transcription typing. In: Cognitive aspects of skilled typewriting. Springer, Verlag, pp 121–143
Hakkani-T’́ur DZ, Oflazer K, T’́ur G (2002) Statistical morphological disambiguation for agglutinative languages. Comput Humanit 36(4):381–410
Ikegami Y, Sakurai Y, Tsuruta S (2012) Modeless Japanese input method using multiple character sequence features. In: Proceedings of eighth international conference on signal image technology and internet based systems. IEEE Computer Society, pp 613–618
Internet.com K.K. (Japan) (2009) Roma to Kana input users are 90 %, direct Kana input users are 10 % - survey about typing - (in Japanese), http://japan.internet.com/research/20090611/1.html. Accessed 3 July 2013
Japanese Ministry of Internal Affairs and Communications (2009) Utilization situation of Internet (in Japanese). http://www.soumu.go.jp/johotsusintokei/whitepaper/ja/h24/html/nc.243120.html. Accessed 10 October 2013
Joachims T (2006) Training linear SVMs in linear time. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 217–226
Kasahara S, Komachi M, Nagata M, Matsumoto Y (2011) Error correcting Romaji-kana conversion for Japanese language education. In: Proceedings of the workshop on advances in text input methods. pp 38–42
Kerkhofs R, Dijkstra T, Chwilla DJ, de Bruijn ER (2006) Testing a model for bilingual semantic priming with interlingual homographs: RT and N400 effects. In: Brain research, vol 1068. Elsevier, pp 170–813
Kudo T, Yamamoto K, Matsumoto Y (2004) Applying conditional random fields to Japanese morphologiaical analysis. In: Proceedings of the EMNLP-2004. pp 230–237
Maekawa K (2008) Balanced corpus of contemporary written Japanese. In: Proceedings of the 6th workshop on asian language resources. pp 101–102
Neubig G, Duh K (2013) How much is said in a tweet? A multilingual, in-formation-theoretic perspective. In: Proceedings of the AAAI’13 spring symposium on analyzing microtext. Stanford
Neubig G, Nakata Y, Mori S (2011) Pointwise prediction for robust, adaptable Japanese morphological analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 2. pp 529–533
Pouliquen B, Steinberger R, Ignat C (2006) Automatic annotation of multilingual text collections with a conceptual thesaurus. arXiv:preprint cs/0609059
Roeber H, Bacus J, Tomasi C (2003) Typing in thin air: the canesta projection keyboard - a new method of interaction with electronic devices. In: Proceedings of CHI extended abstracts. pp 712–713
Shalev-Shwartz S, Singer Y, Srebro N (2007) Pegasos Primal estimated sub-gradient solver for SVM. In: Proceedings of the 24th international conference on machine learning. ACM, pp 807–814
Suzumegano F, Amano J, Maruyama Y, Hayakawa E, Namiki M, Takahashi N (1995) The evaluation environment for a Kana to Kanji transliteration system and an evaluation of the modeless input method. In: IPSJ SIG technical report, vol 1995-HI-42. pp 9–16
Teahan WJ (2000) Text classification and segmentation using minimum cross-entropy. In: Proceedings of RIAO’00. pp 943–961
Zheng Y, Liu C, Ding X (2001) Single-character type identification. In: Electronic imaging 2002, International society for optics and photonics. pp 49–56
Acknowledgments
This work was supported by Grant-in-Aid for Scientific Research of the government of Japan (KAKENHI 24700214). The authors are also thankful to Ms. Yukiko Yamamoto, a student of Tokyo Denki University, for her miscellaneous help such as English translation and check.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ikegami, Y., Tsuruta, S. Hybrid method for modeless Japanese input using N-gram based binary classification and dictionary. Multimed Tools Appl 74, 3933–3946 (2015). https://doi.org/10.1007/s11042-013-1805-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-013-1805-1