skip to main content
10.1145/355214.355235acmconferencesArticle/Chapter ViewAbstractPublication PagesiralConference Proceedingsconference-collections
Article
Free Access

On the use of words and n-grams for Chinese information retrieval

Published:01 November 2000Publication History

ABSTRACT

In the processing of Chinese documents and queries in information retrieval (IR), one has to identify the units that are used as indexes. Words and n-grams have been used as indexes in several previous studies, which showed that both kinds of indexes lead to comparable IR performances. In this study, we carry out more experiments on different ways to segment documents and queries, and to combine words with n-grams. Our experiments show that a combination of the longest-matching algorithm with single characters is the best choice.

References

  1. 1.Buckley, C. Implementation of the SMART information retrieval system, Technical report, #85- 686, Cornell University, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. 2.Chen, K.-J. and Kiu, S.-H. Word identification for Mandarin Chinese sentences. 5th International Conference on Computational Linguistics, 1992. pp. 101-107. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. 3.Harman, D. K. and Voorhees, E. M., Eds. Information Technology: The Fifth Text REtrieval Conference (TREC-5), NIST SP 500-238. Gaithersburg, National Institute of Standards and Technology, 1996.Google ScholarGoogle Scholar
  4. 4.Kwok, K. L. Comparing representations in Chinese information retrieval. Conference on Research and Development in Information Retrieval, ACM-SIGIR, 1997, pp. 34-41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. 5.Kwok, K.L. and Grunfeld, L. TREC-5 English and Chinese retrieval experiments using PIRCS, The Fifth Text Retrieval Conference (TREC-5), NIST special publication 500-238, 1997, pp. 133-142.Google ScholarGoogle Scholar
  6. 6.Lee, J. H. Combining multiple evidence from different properties of weighting schemes. Conference on Research and Development in Information Retrieval, ACM-SIGIR, Seattle, 1995, pp. 180-188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. 7.Lee, J. H. and Ahn, J. S. Using n-grams for Korean text retrieval. Conference on Research and Development in Information Retrieval, ACM-SIGIR, Zurich, (996, pp. 216-224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. 8.Leong, M.-K. and Zhou, H. Preliminary qualitative analysis of segmented vs. bigram indexing in Chinese, The Sixth Text Retrieval Conference (TREC-6), NIST special publication 500-240, 1998, pp. 551-557.Google ScholarGoogle Scholar
  9. 9.Li, B.-Y., Lien, S., Sun, C.-F. and Sun, M.-S. A maximal matching automatic Chinese word segmentation algorithm using corpus tagging for ambiguity resolution. R.O.C. Computational Linguistics Conference (ROCLING-1V), Taiwan, 1991, pp. 135-146.Google ScholarGoogle Scholar
  10. 10.Nie, J.-Y., Brisebois, M. and Ren, X. On Chinese text retrieval. Conference on Research and Development in Information Retrieval, ACM-SIGIR, Zurich, 1996, pp. 225-233. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. 11.Nie, J.-Y., Ren, F. Chinese information retrieval: using characters or words? Information Processing and Management, 1999, 35: 443-462.Google ScholarGoogle ScholarCross RefCross Ref
  12. 12.Ogawa, Y. A new character-based indexing organization using frequency data for Japanese documents. Conference on Research and Development in Information Retrieval, ACM-SIGIR, Seattle, 1995, pp. 121-129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. 13.Strzalkowski, T., Lin, F. and Perez-CarbaUo, J. Natural language information retrieval TREC-6 report, The Sixth Text Retrieval Conference (TREC- 6), NIST special publication 500-240, 1998, pp. 347- 366.Google ScholarGoogle Scholar
  14. 14.Yao, T.-S., Zhang, G.-P. and Wu, Y.-M. A rulebased Chinese automatic segmentation system. Journal of Chinese Information Processing, 1990, 4(1): 37-43.Google ScholarGoogle Scholar
  15. 15.Yeh, C.-L. and Lee, H.-J. Rule-based word identification for Mandarin Chinese sentences - A unification approach. Computer processing of Chinesse and Oriental Languages, 1991, 5(2): 97-118.Google ScholarGoogle Scholar
  1. On the use of words and n-grams for Chinese information retrieval

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          IRAL '00: Proceedings of the fifth international workshop on on Information retrieval with Asian languages
          November 2000
          220 pages
          ISBN:1581133006
          DOI:10.1145/355214
          • Chairmen:
          • Kam-Fai Wong,
          • Dik L. Lee,
          • Jong-Hyeok Lee

          Copyright © 2000 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 November 2000

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader