poster

As we may perceive: finding the boundaries of compound documents on the web

Author:
Pavel Dmitriev

Cornell University, Ithaca, NY, USA

Cornell University, Ithaca, NY, USA
View Profile

WWW '08: Proceedings of the 17th international conference on World Wide WebApril 2008Pages 1029–1030https://doi.org/10.1145/1367497.1367640

Published:21 April 2008Publication History

WWW '08: Proceedings of the 17th international conference on World Wide Web

Pages 1029–1030

ABSTRACT

This paper considers the problem of identifying on the Web compound documents (cDocs) -- groups of web pages that in aggregate constitute semantically coherent information entities. Examples of cDocs are a news article consisting of several html pages, or a set of pages describing specifications, price, and reviews of a digital camera. Being able to identify cDocs would be useful in many applications including web and intranet search, user navigation, automated collection generation, and information extraction.

In the past, several heuristic approaches have been proposed to identify cDocs [1][5]. However, heuristics fail to capture the variety of types, styles and goals of information on the web, and do not account for the fact that the definition of a cDoc often depends on the context. This paper presents an experimental evaluation of three machine learning-based algorithms for cDoc discovery. These algorithms are responsive to the varying structure of cDocs and adaptive to their application-specific nature. Based on our previous work [4], this paper proposes a different scenario for discovering cDocs, and compares in this new setting the local machine learned clustering algorithm from [4] to a global purely graph based approach [3] and a Conditional Markov Network approach previously applied to noun coreference task [6]. The results show that the approach of [4] outperforms the other algorithms, suggesting that global relational characteristics of web sites are too noisy for cDoc identification purposes.

References

Eiron, N., McCurley, K. S. Untangling compound documents on the web. In Proceedings of Hypertext'2003. Google ScholarDigital Library
Dmitriev, P. As We May Perceive: Finding the Boundaries of Compound Documents on the Web. Ph.D. Dissertation, Cornell University, January 2008. Google ScholarDigital Library
Dmitriev, P., Lagoze, C. Mining Generalized Graph Patterns based on User Examples. In Proceedings of ICDM'2006. Google ScholarDigital Library
Dmitriev, P., Lagoze, C., Suchkov, B. As We May Perceive: Inferring Logical Documents from Hypertext. In Proceedings of Hypertext'2005. Google ScholarDigital Library
Li, W.-S., Kolak, O., Vu, Q., Takano, H. Defining logical domains in a Web Site. In Proceedints of Hypertext'2000. Google ScholarDigital Library
McCallum, A., Wellner, B. Toward Conditional Models of identity uncertainty with application to proper noun coreference. In Proceedings of IJCAI-IIWeb'2003.Google Scholar

Index Terms

As we may perceive: finding the boundaries of compound documents on the web
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
    2. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

As we may perceive: inferring logical documents from hypertext
HYPERTEXT '05: Proceedings of the sixteenth ACM conference on Hypertext and hypermedia

In recent years, many algorithms for the Web have been developed that work with information units distinct from individual web pages. These include segments of web pages or aggregation of web pages into web communities. Such logical information units ...
Read More
Finding the boundaries of information resources on the web
WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web

In recent years, many algorithms for the Web have been developed that work with information units distinct from individual web pages. These include segments of web pages or aggregation of web pages into web communities. Using these logical information ...
Read More
A study of tabbed browsing among mozilla firefox users
CHI '10: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems

We present a study which investigated how and why users of Mozilla Firefox use multiple tabs and windows during web browsing. The detailed web browsing usage of 21 participants was logged over a period of 13 to 21 days each, and was supplemented by ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '08: Proceedings of the 17th international conference on World Wide Web
April 2008
1326 pages
ISBN:9781605580852
DOI:10.1145/1367497
General Chairs:
Jinpeng Huai
Beihang University, China
,
Robin Chen
AT&T Labs, USA
,
Hsiao-Wuen Hon
Microsoft Research Asia, China
,
Yunhao Liu
HK University of Science and Technology, Hong Kong
,
Program Chairs:
Wei-Ying Ma
Microsoft Research Asia, China
,
Andrew Tomkins
Yahoo! Research, USA
,
Xiaodong Zhang
The Ohio State University, USA
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 April 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
compound documents
machine learning
www
Qualifiers
- poster
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Upcoming Conference
WWW '24

Sponsor:

sigweb

The ACM Web Conference 2024

May 13 - 17, 2024

Singapore , Singapore
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 10
  Total Citations
  View Citations
- 190
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

As we may perceive: finding the boundaries of compound documents on the web

WWW '08: Proceedings of the 17th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

As we may perceive: inferring logical documents from hypertext

Finding the boundaries of information resources on the web

A study of tabbed browsing among mozilla firefox users