Chinese word segmentation and its effect on information retrieval
Introduction
Research interest in Chinese information retrieval (CIR) has increased as a result of the large growth rate of online Chinese literature. Typically, an IR system determines the relevant documents according to the frequency of occurrence of the words of a query within the documents and corpus (Nie, Brisebois, & Ren, 1996). For English and other western languages, the identification of distinct words in the documents is trivial. However, this is much more difficult for the Chinese language, as well as many non-English languages, since Chinese text appears as a string of ideographic characters without any obvious boundary between words except for punctuation signs at the end of each sentence, and occasional commas within sentences.
Chinese text information processing therefore undergoes an essential segmentation process to break up the text into smaller linguistic units or segments, normally words (Nie et al., 1996; Wu and Tseng, 1993, Wu and Tseng, 1995). These are subsequently used to create the index for query and retrieval operations. Numerous different segmentation approaches have been proposed for CIR. As the review of the related literature will show, these approaches can be basically divided into character-based and word-based approaches. Under these two basic groups, there are many alternatives, such as single-character or multiple-character segmentation, use of dictionary or statistics, or introducing linguistic knowledge for segmentation.
When applied to the information retrieval (IR) problem, the existing literature on CIR studies have been consistently shown that the IR result using single-character indexing is significantly worse than those using other segmentation techniques (Tong, Zai, Milic-Frayling, & Evans, 1996). However, no firm conclusions or agreements on the performance of multi-character approaches and word-based approaches have so far been reported. Some researchers obtained better results using bigram (2-character) methods while the others obtained better results using word-based approaches (Wilkinson, 1997). Therefore, some researchers believe that a better segmentation approach will be able to yield superior IR results (Nie et al., 1996), while others have not found any direct relationship between the segmentation approach and IR results from their experimental results (Kwok, 1997a, Kwok, 1997b). It is also evident that there has been no such systematic study that is been carried out to investigate this relationship.
Thus, this research aims to systematically investigate the relationship between the segmentation accuracy and its effect on CIR. Manual segmentation along with four types of automatic character-based segmentation approaches were used to process and index a set of test corpus comprising one month’s economic news from the online Chinese People’s Daily newspaper. A number of accuracy measures were defined and computed to gauge the quality of the four automatic segmentation approach in comparison with the segmented words arising from manual segmentation which was taken to represent the ideal segmentation case.
Using these approaches, a total of five different indices were therefore created for the IR experiments. A total of 20 queries were used in the experiments. Similarly to the segmentation carried out individual documents of the corpus, the queries were segmented prior to the matching and retrieval process. These were segmented using both the manual and corresponding automatic approaches so that two main groups of IR experiments were actually conducted. This allowed the relationship between query segmentation and document segmentation on IR effectiveness to be assessed. The traditional IR effectiveness measures of data recall and data precision was computed and contrasted against segmentation approaches. Statistical analysis was applied to explore the correlation between the segmentation accuracy and IR effectiveness. In order to probe further, a number of queries were identified and examined in detail to reveal the cause and effect of the retrieval results thereby providing a more thorough understanding of the retrieved results. From all these experimental data results, it became possible to derive a set of conclusions.
The rest of the paper is organised as follows. Following a review of related literature, the methodology used for the study is presented. This describes the automatic approaches that were used to create the index and query segmentation for the IR experiments. A number of accuracy measures were defined and computed for these different approaches. The setting of the IR experiments and results from the experiments are subsequently reported using the measures of data recall and precision. A number of queries were used and analysed in detail to aid the explanation of the results arising from different document segmentation approaches, query segmentation approaches, and the effect of the existence of ambiguous word segments. The paper concludes with a summary of the pertinent findings and suggestions for future work.
Section snippets
Chinese segmentation for IR
The basic approaches of Chinese segmentation can be roughly divided into two groups, namely, character-based approaches and word-based approaches as shown in Fig. 1.
Segmentation experiments
The following sections report on the research to systematically investigate the relationship between segmentation accuracy and its effect on CIR.
Test corpus and query formulation
The same set of documents used for the segmentation experiments were used for the IR experiments. Based on overall economic subject contents of these documents, a set of queries was first proposed by four native Chinese speakers from the People Republic of China (PRC) who were also graduate research students in Nanyang Technological University (NTU). A total of 41 queries were initially elicited. Queries were expressed either as complete sentences or phrases. A final set of 20 queries was
Analysis and discussion of results
In this research, an attempt was made to carry out an in-depth analysis of a number of results with the aim to investigate how segmentation results account for differences in IR performance. The analysis was restricted to a number of cases where the results demonstrate significant difference according to the paired t-test result obtained from average IR performance. At the same time, a new set of paired t-test is conducted on individual queries to identify queries that exhibited significant
Conclusions and future work
Based on the work on this research, it can be concluded that the segmentation approaches used for document and query processing do have an influence on the IR results although there is no direct relationship between segmentation accuracy and IR results. The following important observations can be concluded from the research:
- •
The IR performance is affected by both document segmentation and query segmentation. If one of them is kept constant and the other is varied using different approaches,
References (55)
- Allan, J., et al. (1996). INQUERY at TREC-5. Available: http://trec.nist.gov/pubs/trec5/t5_proceedings.html,...
- Allan, J., et al. (1997). INQUERY does battle with TREC-6. Available:...
- Beaulieu, M. et al. (1996). Okapi at TREC-5. Available: http://trec.nist.gov/pubs/trec5/t5_proceedings.html,...
- Chen, A., Jiang, H., & Gey, F. (2000). English–Chinese cross-language IR using bilingual dictionaries. Available:...
- City University. (1996). Chinese results. Available: http://trec.nist.gov/pubs/trec5/t5_proceedings.html,...
- Claritech Corporation. (1996). Chinese results. Available: http://trec.nist.gov/pubs/trec5/t5_proceedings.html,...
- Cornell University. (1996). Chinese results. Available: http://trec.nist.gov/pubs/trec5/t5_proceedings.html,...
- Cuny (1996). Chinese results. Available: http://trec.nist.gov/pubs/trec5/t5_proceedings.html,...
- Dai, Y.B. (1997). Developing a new statistical method for Chinese text segmentation. First year report of Master of...
- et al.
An integrated bigram approach with single-character word list for Chinese word segmentation
TEXT Technology
(1998)
The design and realization Chinese automatic segmenting system CASS
Journal of Chinese Information Processing
Parallel Chinese word segmentation algorithm based on maximum matching
Neural, Parallel and Science Computations
Cited by (95)
On the feasibility of character n-grams pseudo-translation for Cross-Language Information Retrieval tasks
2016, Computer Speech and LanguageCitation Excerpt :In the following, we explain in some detail these advantages for the particular case of IR. A first major advantage of character n-grams when applied to IR is their inherent simplicity and ease of application (Foo and Li, 2004). IR systems typically utilize language-specific linguistic tools and resources to facilitate retrieval: stopword lists, phrase lists, stemmers, decompounders, lexicons, thesauri, part-of-speech taggers, etc.
Research on Intelligent Customer Service System for Power Industry Based on Semantic Understanding
2023, 2023 IEEE 3rd International Conference on Data Science and Computer Application, ICDSCA 2023Type Linking for Query Understanding and Semantic Search
2022, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data MiningViral marketing: influencer marketing pivots in tourism–a case study of meme influencer instigated travel interest surge
2022, Current Issues in Tourism