Abstract
Using probabilistic Language Modeling approach in Information Retrieval, model for each document is estimated individually. However, with Web pages becoming more complex, each of them may contain some blocks discussing different topics. Consequently, the performance of statistic model for web document tends to be degraded by the mixture of topics. In this paper, we argue that segmenting Web page into several relatively independent blocks will assist the language modeling and a Block-based Language Modeling (BLM) approach is proposed. Different with normal method, BLM refines the modeling process into two parts: the probability of a query occurring in a block, and the probability of a block occurring in a Web page. Then given a query, those pages with more relevant blocks tend to be retrieved. Experimental results show that when unigram model is used, our approach outperforms original language modeling for web search in most cases.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Ponte, J., Croft, W.: A Language Modeling Approach to Information Retrieval. In: Proc. 21st annual international ACM SIGIR conference on Research and development in information retrieval, 1998 SIGIR (1998)
Kaasinen, E., Aaltonen, M., Kolari, J., Melakoski, S., Laakko, T.: Two Approaches to Bringing Internet Services to WAP Devices. In: Proc. 9th International World Wide Web Conference, pp. 231–246 (2000)
Lin, S.H., Ho, J.M.: Discovering Informative Content Blocks from Web Documents. In: Proc. ACM SIGKDD 2002 (2002)
Wong, W., Fu, A.W.: Finding Structure and Characteristics of Web Documents for Classification. In: ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery(DMKD), Dallas, TX., USA (2000)
Chen, J., Zhou, B., Shi, J., Zhang, H., Wu, Q.: Function Based Object Model towards Website Adaptation. In: Proc. 10th International World Wide Web Conference (2001)
Yang, Y., Zhang, H.: HTML Page Analysis Based on Visual Cues. In: 6th International Conference on Document Analysis and Recognition, Seattle, USA (2001)
Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: Extracting Content Structure for Web Pages based on Visual Representation. In: the 5th Asia Pacific Web Conference (2003)
Embley, D.W., Jiang, Y., Ng, Y.-K.: Record-boundary discovery in Web documents. In: Proc. 1999 ACM SIGMOD international conference on Management of data, Philadelphia PA, pp. 467–478 (1999)
Yu, S., Cai, D., Wen, J.R., Ma, W.Y.: Improving pseudo-relevance feedback in Web information retrieval using Web page segmentation. In: Proc. 12th World Wide Web Conference, Budapest, Hungary (2003)
Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: Block-based Web Search. In: Proc. 27th annual international ACM SIGIR conference on Research and development in information retrieval (2004)
Yi, L., Liu, B., Li, X.: Eliminating Noisy Information in Web Pages for Data Mining. In: Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD-2003), Washington, DC, USA (August 2003)
Song, R., Liu, H., Wen, J.R.: Learning Block Importance Models for Web Pages. In: Proc. 13th World Wide conference (WWW2004) (May 2004)
Cai, D., He, X., Wen, J.R., Ma, W.Y.: Block-level Link Analysis. In: Proc. 27th annual international ACM SIGIR conference on Research and development in information retrieval (2004)
Zhai, C., Lafferty, J.: A Study of Smoothing Methods for Language Models Applied to Ad Hoc Retrieval. In: Proc. ACM SIGIR conference on Research and development in information retrieval (2001)
Berger, A., Lafferty, J.: Information Retrieval as Statistical Translation. In: Proc. ACM SIGIR conference on Research and development in information retrieval (1999)
Zelen, M., Severo, N.: “Probability Functions” Handbook of Mathematical Functions. National Bureau of Standards Applied Mathematics Series, vol. 55 (1964)
Kleinber, J.: Authoritative Sources in a Hyperlinked Environment. Journal of the ACM 46(5), 604–622 (1999)
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web, Technical Report, Stanford University, Stanford, CA (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, S., Huang, S., Xue, GR., Yu, Y. (2005). Block-Based Language Modeling Approach Towards Web Search. In: Zhang, Y., Tanaka, K., Yu, J.X., Wang, S., Li, M. (eds) Web Technologies Research and Development - APWeb 2005. APWeb 2005. Lecture Notes in Computer Science, vol 3399. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31849-1_18
Download citation
DOI: https://doi.org/10.1007/978-3-540-31849-1_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25207-8
Online ISBN: 978-3-540-31849-1
eBook Packages: Computer ScienceComputer Science (R0)