Skip to main content

Handling Orthographic Varieties in Japanese IR: Fusion of Word-, N-Gram-, and Yomi-Based Indices Across Different Document Collections

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3689))

Abstract

Orthographic varieties are common in the Japanese language and represent a serious problem for Japanese information retrieval (IR), as IR systems run the risk of missing documents that contain variant forms of the search term. We propose two different strategies for handling orthographic varieties: pronunciation or yomi-based indexing and “Fuzzy Querying”, comparing katakana terms based on edit distance. Both strategies were integrated into our multiple index and fusion system [1] and tested using two different test collections, newspaper articles (Mainichi Shimbun ’98) and scientific abstracts (NTCIR-1), to compare their performance across text genres.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Womser-Hacker, C.: An Information Retrieval Prototype for Research and Teaching. In: Eibl, M., Wolff, C., Womser-Hacker, C. (eds.) To appear in Designing Information Systems. Festschrift für Jürgen Krause. Konstanz: Universitätsverlag [Schriften zur Informationswissenschaft] (2005)

    Google Scholar 

  2. Halpern, J.: Lexicon-Based Orthographic Disambiguation in CJK Intelligent Information Retrieval. In: Proceedings of the 19th Conference on Computational Linguistics, COLING 2002, Taipei, Taiwan, August 24–September 1 (2002)

    Google Scholar 

  3. Halpern, J.: The Challenges of Intelligent Japanese Searching. In: Working paper. The CJK Dictionary Institute, Saitama (2000), www.cjk.org/cjk/joa/joapaper.htm (revised 2003)

  4. Kummer, N., Womser-Hacker, C., Kando, N.: Handling Orthographic Varieties in Japanese Information Retrieval: Fusion of Word-, N-gram-, and Yomi-Based Indices across Different Document Collections. NII Technical Report (2005)

    Google Scholar 

  5. Gospodnetić, O., Hatcher, E.: Lucene in Action. Manning, Canada (2004)

    Google Scholar 

  6. Yoshioka, M., Kuriyama, K., Kando, N.: Analysis of the Usage of Japanese Segmented Texts in NTCIR Workshop 2. In: Proceedings of the Second NTCIR Workshop on Research in Chinese and Japanese Text Retrieval and Text Summarization, pp. 291–296. National Institute of Informatics, Tokyo (2002)

    Google Scholar 

  7. Ozawa, T., Yamamoto, M., Umemura, K., Church, K.W.: Japanese Word Segmentation Using Similarity Measure for IR. In: Proceedings of the First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition, Tokyo, Japan, August 30–September 1, pp. 89–96 (1999)

    Google Scholar 

  8. Jones, G.J.F., Sakai, T., Kajiura, M., Sumita, K.: Experiments in Japanese Text Retrieval and Routing Using the NEAT System. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, pp. 197–205 (1998)

    Google Scholar 

  9. Sakai, T., Shibazaki, Y., Suzuki, M., Kajiura, M., Manabe, T., Sumita, K.: Cross-Language Information Retrieval for NTCIR at Toshiba. In: Proceedings of the First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition, Tokyo, Japan, August 30–September 1, pp. 137–144 (1999)

    Google Scholar 

  10. Vines, P., Wilkinson, R.: Experiments with Japanese Text Retrieval Using mg. In: Proceedings of the First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition, Tokyo, Japan, August 30–September 1, pp. 97–100 (1999)

    Google Scholar 

  11. Chow, K.C.W., Luk, R.W.P., Wong, K.-F., Kwok, K.-L.: Hybrid Term Indexing for Different IR Models. In: Proceedings of the Fifth International Workshop on Information Retrieval with Asian Languages, Hong Kong, China, pp. 49–54 (2000)

    Google Scholar 

  12. Luk, R.W.P., Wong, K.-F., Kwok, K.-L.: Hybrid Term Indexing: An Evaluation. In: Proceedings of the Second NTCIR Workshop on Research in Chinese and Japanese Text Retrieval and Text Summarization, pp. 130–136. National Institute of Informatics, Tokyo (2001)

    Google Scholar 

  13. Savoy, J.: Report on CLIR Task for the NTCIR-4 Evaluation Campaign. In: Proceedings of the Fourth NTCIR Workshop on Research in Information Retrieval, Automatic Text Summarization and Question Answering, pp. 178–185 (2004)

    Google Scholar 

  14. Kummer, N., Womser-Hacker, C., Kando, N.: Re-Examination of Japanese Indexing: Fusion of Word-, N-gram- and Yomi-Based Indices. In: Proceedings of the 11th Annual Meeting of The Association for Natural Language Processing, March 14–18, pp. 221–224. University of Kagawa, Kagawa Prefecture (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kummer, N., Womser-Hacker, C., Kando, N. (2005). Handling Orthographic Varieties in Japanese IR: Fusion of Word-, N-Gram-, and Yomi-Based Indices Across Different Document Collections. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.H. (eds) Information Retrieval Technology. AIRS 2005. Lecture Notes in Computer Science, vol 3689. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562382_65

Download citation

  • DOI: https://doi.org/10.1007/11562382_65

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-29186-2

  • Online ISBN: 978-3-540-32001-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics