skip to main content
10.1145/1008992.1009070acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Block-based web search

Published: 25 July 2004 Publication History

Abstract

Multiple-topic and varying-length of web pages are two negative factors significantly affecting the performance of web search. In this paper, we explore the use of page segmentation algorithms to partition web pages into blocks and investigate how to take advantage of block-level evidence to improve retrieval performance in the web context. Because of the special characteristics of web pages, different page segmentation method will have different impact on web search performance. We compare four types of methods, including fixed-length page segmentation, DOM-based page segmentation, vision-based page segmentation, and a combined method which integrates both semantic and fixed-length properties. Experiments on block-level query expansion and retrieval are performed. Among the four approaches, the combined method achieves the best performance for web search. Our experimental results also show that such a semantic partitioning of web pages effectively deals with the problem of multiple drifting topics and mixed lengths, and thus has great potential to boost up the performance of current web search engines.

References

[1]
The .GOV test collection. TREC Web Tracks homepage. http://es.cmis.csiro.au/TRECWeb/.
[2]
Bailey, P., Craswell, N., and Hawking, D., Engineering a multi-purpose test collection for Web retrieval experiments, Information Processing and Management, 2001.
[3]
D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, Extracting content structure for web pages based on visual representation, Proc.5th Asia Pacific Web Conference, Xi'an China, 2003.
[4]
D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, VIPS: a vision-based page segmentation algorithm, Microsoft Technical Report, MSR-TR-2003-79, 2003.
[5]
Callan, J. P., Passage-Level Evidence in Document Retrieval, In Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, 1994, pp. 302--310.
[6]
Chakrabarti, S., Joshi, M., and Tawde, V., Enhanced topic distillation using text, markup tags, and hyperlinks, In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, ACM Press, 2001, pp. 208--216.
[7]
Crivellari, F. and Melucci, M., Web Document Retrieval Using Passage Retrieval, Connectivity Information, and Automatic Link Weighting--TREC-9 Report, In The Ninth Text REtrieval Conference (TREC 9), 2000.
[8]
Embley, D. W., Jiang, Y., and Ng, Y.-K., Record-boundary discovery in Web documents, In Proceedings of the 1999 ACM SIGMOD international conference on Management of data, Philadelphia PA, 1999, pp. 467--478.
[9]
Hearst, M. A., Multi-Paragraph Segmentation of Expository Text, In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, New Mexico State University, Las Cruces, New Mexico, 1994, pp. 9--16.
[10]
Kaszkiel, M. and Zobel, J., Passage Retrieval Revisited, In Proceedings of the 20th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, 1997, pp. 178--185.
[11]
Kaszkiel, M. and Zobel, J., Effective Ranking with Arbitrary Passages, Journal of the American Society for Information Science, Vol. 52, No. 4, 2001, pp. 344--364.
[12]
Kwok, K. L., Grunfeld, L., Dinstl, N., and Chan, M., TREC-9 Cross Language, Web and Question-Answering Track Experiments using PIRCS, In The Ninth Text REtrieval Conference (TREC 9), 2000, pp. 419--427.
[13]
Lin, S.-H. and Ho, J.-M., Discovering Informative Content Blocks from Web Documents, In Proceedings of ACM SIGKDD'02, 2002.
[14]
Liu, S., Yu, C. and Wu, W., UIC at TREC-2002: Web Track. In The Eleventh Text REtrieval Conference (TREC 2002), 2002.
[15]
Namba, I., Fujitsu Laboratories TREC-9 Report, In The Ninth Text REtrieval Conference (TREC 9), 2000, pp. 203--208.
[16]
Ponte, J. M. and Croft, W. B., Text Segmentation by Topic, In Proceedings of the 1st European Conference on Research and Advanced Technology for Digital Libraries, 1997.
[17]
Robertson, S. E., Overview of the okapi projects, Journal of Documentation, Vol. 53, No. 1, 1997, pp. 3--7.
[18]
Salton, G., Allan, J., and Buckley, C., Approaches to passage retrieval in full text information systems, In Proceedings of the 16th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, Pittsburgh, Pennsylvania, USA, 1993, pp. 49--58.
[19]
Salton, G., Singhal, A., Buckley, C., and Mitra, M., Automatic Text Decomposition Using Text Segments and Text Themes, In Proceedings of the Seventh ACM Conference on Hypertext (Hypertext'96), ACM Press, New York, 1996.
[20]
Wilkinson, R., Effective Retrieval of Structured Documents, In Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, 1994, pp. 311--317.
[21]
Wong, W. and Fu, A. W., Finding Structure and Characteristics of Web Documents for Classification, In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD), Dallas, TX., USA, 2000.
[22]
Yang, Y. and Zhang, H., HTML Page Analysis Based on Visual Cues, In 6th International Conference on Document Analysis and Recognition (ICDAR 2001), Seattle, Washington, USA, 2001.
[23]
S. Yu, D. Cai, J.-R. Wen, and W.-Y. Ma, Improving pseudo-relevance feedback in web information retrieval using web page segmentation, Proc. 12th World Wide Web Conference, Budapest, Hungary, 2003.
[24]
Zobel, J., Moffat, A., Wilkinson, R., and Sacks-Davis, R., Efficient retrieval of partial documents, Information Processing and Management, Vol. 31, No. 3, 1995, pp. 361--377.

Cited By

View all
  • (2022)Online learning agents for cost-sensitive topical data acquisition from the webIntelligent Data Analysis10.3233/IDA-20510726:3(695-722)Online publication date: 18-Apr-2022
  • (2022)Customizable Tabular Access to Web Data Records for Convenient Low-vision Screen Magnifier InteractionACM Transactions on Accessible Computing10.1145/351704415:2(1-22)Online publication date: 19-May-2022
  • (2022)Multi-CPRProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531736(3046-3056)Online publication date: 6-Jul-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
July 2004
624 pages
ISBN:1581138814
DOI:10.1145/1008992
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2004

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. VIsion-based page segmentation
  2. page segmentation
  3. passage retrieval
  4. query expansion
  5. web information retrieval

Qualifiers

  • Article

Conference

SIGIR04
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)1
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Online learning agents for cost-sensitive topical data acquisition from the webIntelligent Data Analysis10.3233/IDA-20510726:3(695-722)Online publication date: 18-Apr-2022
  • (2022)Customizable Tabular Access to Web Data Records for Convenient Low-vision Screen Magnifier InteractionACM Transactions on Accessible Computing10.1145/351704415:2(1-22)Online publication date: 19-May-2022
  • (2022)Multi-CPRProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531736(3046-3056)Online publication date: 6-Jul-2022
  • (2021)GleanProceedings of the VLDB Endowment10.14778/3447689.344770314:6(997-1005)Online publication date: 12-Apr-2021
  • (2021)Multi-Task Neural Sequence Labeling for Zero-Shot Cross-Language Boilerplate RemovalIEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology10.1145/3486622.3493938(326-334)Online publication date: 14-Dec-2021
  • (2021)Semantic table-of-contents for efficient web screen readingProceedings of the 36th Annual ACM Symposium on Applied Computing10.1145/3412841.3442066(1941-1949)Online publication date: 22-Mar-2021
  • (2021)Postal address extraction from the web: a comprehensive surveyArtificial Intelligence Review10.1007/s10462-021-09983-1Online publication date: 14-Mar-2021
  • (2021)Unsupervised Recognition of the Logical Structure of Business Documents Based on Spatial RelationshipsComputer Analysis of Images and Patterns10.1007/978-3-030-89131-2_6(57-72)Online publication date: 31-Oct-2021
  • (2020)TableView: Enabling Efficient Access to Web Data Records for Screen-Magnifier UsersProceedings of the 22nd International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3373625.3417030(1-12)Online publication date: 26-Oct-2020
  • (2020)Boilerplate Removal using a Neural Sequence Labeling ModelCompanion Proceedings of the Web Conference 202010.1145/3366424.3383547(226-229)Online publication date: 20-Apr-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media