Chinese word segmentation and its effect on information retrieval

https://doi.org/10.1016/S0306-4573(02)00079-1Get rights and content

Abstract

A set of IR experiments was carried out to study the impact of Chinese word segmentation and its effect on information retrieval (IR) at the Division of Information Studies, Nanyang Technological University, Singapore. A total of four automatic character-based segmentation approaches and a manual word segmentation approach was first carried out to obtain the word segments for indexing and to evaluate the segmentation accuracy of these automatic approaches. The IR experiments study both the influence of different document segmentation approaches on IR effectiveness and the methods used for query segmentation. Traditional data recall and precision measures were used to gauge IR effectiveness. A number of queries were selected and subjected to further detailed analysis to further explore the influence of word segmentation on IR.

The findings reveal that the segmentation approach has an effect on IR effectiveness. Better IR results are obtained by using the same method for query and document processing as this increase the probability of the query-document match. The recognition of a higher number of 2-character words generally contributes to the improvement of IR effectiveness. However, manual segmentation does not always work better than character-based segmentation as a result of the existence of longer words with more than two characters. No evidence is found that ambiguous words resulting from the segmentation process significantly affect IR.

Introduction

Research interest in Chinese information retrieval (CIR) has increased as a result of the large growth rate of online Chinese literature. Typically, an IR system determines the relevant documents according to the frequency of occurrence of the words of a query within the documents and corpus (Nie, Brisebois, & Ren, 1996). For English and other western languages, the identification of distinct words in the documents is trivial. However, this is much more difficult for the Chinese language, as well as many non-English languages, since Chinese text appears as a string of ideographic characters without any obvious boundary between words except for punctuation signs at the end of each sentence, and occasional commas within sentences.

Chinese text information processing therefore undergoes an essential segmentation process to break up the text into smaller linguistic units or segments, normally words (Nie et al., 1996; Wu and Tseng, 1993, Wu and Tseng, 1995). These are subsequently used to create the index for query and retrieval operations. Numerous different segmentation approaches have been proposed for CIR. As the review of the related literature will show, these approaches can be basically divided into character-based and word-based approaches. Under these two basic groups, there are many alternatives, such as single-character or multiple-character segmentation, use of dictionary or statistics, or introducing linguistic knowledge for segmentation.

When applied to the information retrieval (IR) problem, the existing literature on CIR studies have been consistently shown that the IR result using single-character indexing is significantly worse than those using other segmentation techniques (Tong, Zai, Milic-Frayling, & Evans, 1996). However, no firm conclusions or agreements on the performance of multi-character approaches and word-based approaches have so far been reported. Some researchers obtained better results using bigram (2-character) methods while the others obtained better results using word-based approaches (Wilkinson, 1997). Therefore, some researchers believe that a better segmentation approach will be able to yield superior IR results (Nie et al., 1996), while others have not found any direct relationship between the segmentation approach and IR results from their experimental results (Kwok, 1997a, Kwok, 1997b). It is also evident that there has been no such systematic study that is been carried out to investigate this relationship.

Thus, this research aims to systematically investigate the relationship between the segmentation accuracy and its effect on CIR. Manual segmentation along with four types of automatic character-based segmentation approaches were used to process and index a set of test corpus comprising one month’s economic news from the online Chinese People’s Daily newspaper. A number of accuracy measures were defined and computed to gauge the quality of the four automatic segmentation approach in comparison with the segmented words arising from manual segmentation which was taken to represent the ideal segmentation case.

Using these approaches, a total of five different indices were therefore created for the IR experiments. A total of 20 queries were used in the experiments. Similarly to the segmentation carried out individual documents of the corpus, the queries were segmented prior to the matching and retrieval process. These were segmented using both the manual and corresponding automatic approaches so that two main groups of IR experiments were actually conducted. This allowed the relationship between query segmentation and document segmentation on IR effectiveness to be assessed. The traditional IR effectiveness measures of data recall and data precision was computed and contrasted against segmentation approaches. Statistical analysis was applied to explore the correlation between the segmentation accuracy and IR effectiveness. In order to probe further, a number of queries were identified and examined in detail to reveal the cause and effect of the retrieval results thereby providing a more thorough understanding of the retrieved results. From all these experimental data results, it became possible to derive a set of conclusions.

The rest of the paper is organised as follows. Following a review of related literature, the methodology used for the study is presented. This describes the automatic approaches that were used to create the index and query segmentation for the IR experiments. A number of accuracy measures were defined and computed for these different approaches. The setting of the IR experiments and results from the experiments are subsequently reported using the measures of data recall and precision. A number of queries were used and analysed in detail to aid the explanation of the results arising from different document segmentation approaches, query segmentation approaches, and the effect of the existence of ambiguous word segments. The paper concludes with a summary of the pertinent findings and suggestions for future work.

Section snippets

Chinese segmentation for IR

The basic approaches of Chinese segmentation can be roughly divided into two groups, namely, character-based approaches and word-based approaches as shown in Fig. 1.

Segmentation experiments

The following sections report on the research to systematically investigate the relationship between segmentation accuracy and its effect on CIR.

Test corpus and query formulation

The same set of documents used for the segmentation experiments were used for the IR experiments. Based on overall economic subject contents of these documents, a set of queries was first proposed by four native Chinese speakers from the People Republic of China (PRC) who were also graduate research students in Nanyang Technological University (NTU). A total of 41 queries were initially elicited. Queries were expressed either as complete sentences or phrases. A final set of 20 queries was

Analysis and discussion of results

In this research, an attempt was made to carry out an in-depth analysis of a number of results with the aim to investigate how segmentation results account for differences in IR performance. The analysis was restricted to a number of cases where the results demonstrate significant difference according to the paired t-test result obtained from average IR performance. At the same time, a new set of paired t-test is conducted on individual queries to identify queries that exhibited significant

Conclusions and future work

Based on the work on this research, it can be concluded that the segmentation approaches used for document and query processing do have an influence on the IR results although there is no direct relationship between segmentation accuracy and IR results. The following important observations can be concluded from the research:

  • The IR performance is affected by both document segmentation and query segmentation. If one of them is kept constant and the other is varied using different approaches,

References (55)

  • Allan, J., et al. (1996). INQUERY at TREC-5. Available: http://trec.nist.gov/pubs/trec5/t5_proceedings.html,...
  • Allan, J., et al. (1997). INQUERY does battle with TREC-6. Available:...
  • Beaulieu, M. et al. (1996). Okapi at TREC-5. Available: http://trec.nist.gov/pubs/trec5/t5_proceedings.html,...
  • Chen, A., Jiang, H., & Gey, F. (2000). English–Chinese cross-language IR using bilingual dictionaries. Available:...
  • City University. (1996). Chinese results. Available: http://trec.nist.gov/pubs/trec5/t5_proceedings.html,...
  • Claritech Corporation. (1996). Chinese results. Available: http://trec.nist.gov/pubs/trec5/t5_proceedings.html,...
  • Cornell University. (1996). Chinese results. Available: http://trec.nist.gov/pubs/trec5/t5_proceedings.html,...
  • Cuny (1996). Chinese results. Available: http://trec.nist.gov/pubs/trec5/t5_proceedings.html,...
  • Dai, Y.B. (1997). Developing a new statistical method for Chinese text segmentation. First year report of Master of...
  • S.B. Foo et al.

    An integrated bigram approach with single-character word list for Chinese word segmentation

    TEXT Technology

    (1998)
  • Franz, M., McCarley, J. S., & Zhu, W.-J. (2000) English–Chinese information retrieval at IBM. Available:...
  • Fuller, M., et al. (1997). MDS TREC6 report. Available: http://trec.nist.gov/pubs/trec6/t6_proceedings.html,...
  • Gao, J., Xun, E., Zhou, M., Huang, H., Nie, J.-Y., Zhang, J.-Y., & Su, Y. (2000). TREC-9 CLIR experiments at MSRCN....
  • George Manson University. (1996). Chinese results. Available: http://trec.nist.gov/pubs/trec5/t5_proceedings.html,...
  • Harman, D. (1993). Overview of the second text retrieval conference. Available:...
  • He, J., Xu, J., Chen, A., Meggs, J., & Gey, F. C. (1996). Berkeley Chinese information retrieval at TREC-5: Technical...
  • Huang, X. J., & Robertson, S. E. (1997). Okapi Chinese text retrieval experiments at TREC-6. Available:...
  • Hull, D. (1993). Using statistical testing in the evaluation of retrieval experiments. In Proceedings of SIGIR’93 (pp....
  • Information Technology Institute. (1997). Chinese results. Available:...
  • C.Y. Jie et al.

    The design and realization Chinese automatic segmenting system CASS

    Journal of Chinese Information Processing

    (1991)
  • Kwok, K. L. (1997a). Comparing representations in Chinese information retrieval. Available:...
  • Kwok, K. L. (1997b). Lexicon effects on Chinese information retrieval. Available:...
  • Kwok, K. L., & Grunfeld, L. (1996). TREC-5 English and Chinese retrieval experiments using PIRCS. Available:...
  • Lane, D. (1997). MG pages. Available:...
  • Leong, M. K., & Zhou, H. (1997). Preliminary qualitative analysis of segmented vs bigram indexing in Chinese....
  • C.H. Leung et al.

    Parallel Chinese word segmentation algorithm based on maximum matching

    Neural, Parallel and Science Computations

    (1996)
  • Lim, H. K. (1999). Chinese text retrieval system. Thesis of Master of Applied Science, School of Applied Science,...
  • Cited by (95)

    View all citing articles on Scopus
    View full text