Abstract
Much research in recent years has been devoted to meta-search and multilingual search to improve performance and increase the scope of the search. Since most existing web search algorithms are originally developed for English web documents, one would question the efficiency and performance of these techniques as they are applied to documents of other languages. In this work, we have chosen Chinese web search and documents for our study. Potential issues and problems in applying well-known English language based algorithms to Chinese web documents are identified and discussed. Through our qualitative and exploratory quantitative analysis, it can be concluded that these algorithms and techniques cannot be directly used to develop an efficient Chinese search engine.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Brin, S., Page, L.: The Anatomy of a Large-Scale Hypertextual Web Search Engine (July 31, 2005), http://www-db.stanford.edu/~backrub/google.html
China Internet Information Center. A Survey and Report on the Status of Internet Development in China (July 31, 2005), http://www.cnnic.net.cn/download/2004/2004072002.pdf
Chinese-search-engine.com. Chinese Search Engine Survey (September 21, 2004), http://chinese-search-engine.com/chinese-search-engine/survey.htm
Chinese-search-engine.com. Marketing China: Simple Facts About China (September 21, 2004), http://chinese-search-engine.com/marketing-china/china-facts.htm
Chinese Mac FAQ. Character Sets and Encodings (July 31, 2005), http://www.yale.edu/chinesemac/pages/charset_encoding.html
Ciravegna, F.: Challenges in Information Extraction Text for Knowledge Management. IEEE Intelligent Systems and Their Applications (2001)
Craswell, N., Hawking, D., Thistlewaite, P.: Merging Results from Isolated Search Engines. In: The Tenth Australasian Database Conference (1999)
Foo, S., Li, H.: Chinese Word Segmentation and Its Effect on Information Retrieval. Information Processing and Management 40(1), 161–190 (2004)
Freitag, D., Kushmerick, N.: Boost Wrapper Induction. In: The Seventeenth National Conference on Artificial Intelligence (AAAI 2000) (2000)
Ishida, R.: Ruby Markup and Styling (July 31, 2005), http://www.w3.org/International/tutorials/ruby
Jin, H., Wong, K.F.: A Chinese Dictionary Construction Algorithm for Information Retrieval. ACM Transactions on Asian Language Information Processing 1(4), 281–296 (2002)
Kosala, R., Blockeel, H.: Web Mining Research: A Survey. ACM SIGKDD Explorations 2(1), 1–15 (2000)
Kushmerick, N., Weld, S.D., Doorenbos, R.: Wrapper Induction for Information Extraction. In: International Joint Conference on Artificial Intelligence, pp. 729–737 (1997)
Li, K.F., Wang, Y., Nishio, S., Yu, W.: A Formal Approach to Evaluate and Compare Internet Search Engines: A Case Study on Searching the Chinese Web. In: Zhang, Y., Tanaka, K., Yu, J.X., Wang, S., Li, M. (eds.) APWeb 2005. LNCS, vol. 3399, pp. 195–206. Springer, Heidelberg (2005)
Liu, G., et al.: China Web Graph Measurements and Evolution. In: Zhang, Y., Tanaka, K., Yu, J.X., Wang, S., Li, M. (eds.) APWeb 2005. LNCS, vol. 3399, pp. 668–679. Springer, Heidelberg (2005)
Luk, R.W.P., Kwok, K.L.: A Comparison of Chinese Document Indexing Strategies and Retrieval Models. ACM Transactions on Asian Language Information Processing 1(3), 225–268 (2002)
Muslea, I., Minton, S., Knoblock, C.: A Hierarchical Approach to Wrapper Induction. In: The Third International Conference on Autonomous Agents, pp. 190–197 (1999)
Soderland, S.: Learning Information Extraction Rules for Semistructured and Free Text. Machine Learning, 1–44 (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yu, L., Li, K.F., Manning, E.G. (2006). The Adaptability of English Based Web Search Algorithms to Chinese Search Engines. In: Zhou, X., Li, J., Shen, H.T., Kitsuregawa, M., Zhang, Y. (eds) Frontiers of WWW Research and Development - APWeb 2006. APWeb 2006. Lecture Notes in Computer Science, vol 3841. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11610113_36
Download citation
DOI: https://doi.org/10.1007/11610113_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31142-3
Online ISBN: 978-3-540-32437-9
eBook Packages: Computer ScienceComputer Science (R0)