Abstract
When a query is submitted to a search engine, the search engine returns a dynamically generated result page that contains the number of hits (i.e., the number of matching results) for the query. Hit number is a very useful piece of information in many important applications such as obtaining document frequencies of terms, estimating the sizes of search engines and generating search engine summaries. In this paper, we propose a novel technique for automatically identifying the hit number for any search engine and any query. This technique consists of three steps: first segment each result page into a set of blocks, then identify the block(s) that contain the hit number using a machine learning approach, and finally extract the hit number from the identified block(s) by comparing the patterns in multiple blocks from the same search engine. Experimental results indicate that this technique is highly accurate.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Lin Can, Zhang Qian, Xiaofeng Meng, and Wenyin Lin. Postal address detection from web documents. In WIRI, pages 40–45. IEEE Computer Society, 2005.
Cope, J., Craswell, N., Hawking, D.: Automated discovery of search interfaces on the web. In: Schewe, K.-D., Zhou, X. (eds.) ADC. CRPIT, vol. 17, pp. 181–189. Australian Computer Society (2003)
Doorenbos, R.B., Etzioni, O., Weld, D.S.: A scalable comparison-shopping agent for the world-wide web. Agents, 39–48 (1997)
Ipeirotis, P.G., Gravano, L., Sahami, M.: Probe, count, and classify: Categorizing hidden web databases. In: SIGMOD Conference (2001)
Liu, B., Grossman, R.L., Zhai, Y.: Mining web pages for data records. IEEE Intelligent Systems 19(6), 49–55 (2004)
Witten, I.H., Frank, E.: Data Mining. In: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.T.: Fully automatic wrapper generation for search engines. In: Ellis, A., Hagino, T. (eds.) WWW, pp. 66–75. ACM, New York (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ling, Y., Meng, X., Meng, W. (2006). Automated Extraction of Hit Numbers from Search Result Pages. In: Yu, J.X., Kitsuregawa, M., Leong, H.V. (eds) Advances in Web-Age Information Management. WAIM 2006. Lecture Notes in Computer Science, vol 4016. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11775300_7
Download citation
DOI: https://doi.org/10.1007/11775300_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-35225-9
Online ISBN: 978-3-540-35226-6
eBook Packages: Computer ScienceComputer Science (R0)