ABSTRACT
This paper considers the problem of identifying on the Web compound documents (cDocs) -- groups of web pages that in aggregate constitute semantically coherent information entities. Examples of cDocs are a news article consisting of several html pages, or a set of pages describing specifications, price, and reviews of a digital camera. Being able to identify cDocs would be useful in many applications including web and intranet search, user navigation, automated collection generation, and information extraction.
In the past, several heuristic approaches have been proposed to identify cDocs [1][5]. However, heuristics fail to capture the variety of types, styles and goals of information on the web, and do not account for the fact that the definition of a cDoc often depends on the context. This paper presents an experimental evaluation of three machine learning-based algorithms for cDoc discovery. These algorithms are responsive to the varying structure of cDocs and adaptive to their application-specific nature. Based on our previous work [4], this paper proposes a different scenario for discovering cDocs, and compares in this new setting the local machine learned clustering algorithm from [4] to a global purely graph based approach [3] and a Conditional Markov Network approach previously applied to noun coreference task [6]. The results show that the approach of [4] outperforms the other algorithms, suggesting that global relational characteristics of web sites are too noisy for cDoc identification purposes.
- Eiron, N., McCurley, K. S. Untangling compound documents on the web. In Proceedings of Hypertext'2003. Google ScholarDigital Library
- Dmitriev, P. As We May Perceive: Finding the Boundaries of Compound Documents on the Web. Ph.D. Dissertation, Cornell University, January 2008. Google ScholarDigital Library
- Dmitriev, P., Lagoze, C. Mining Generalized Graph Patterns based on User Examples. In Proceedings of ICDM'2006. Google ScholarDigital Library
- Dmitriev, P., Lagoze, C., Suchkov, B. As We May Perceive: Inferring Logical Documents from Hypertext. In Proceedings of Hypertext'2005. Google ScholarDigital Library
- Li, W.-S., Kolak, O., Vu, Q., Takano, H. Defining logical domains in a Web Site. In Proceedints of Hypertext'2000. Google ScholarDigital Library
- McCallum, A., Wellner, B. Toward Conditional Models of identity uncertainty with application to proper noun coreference. In Proceedings of IJCAI-IIWeb'2003.Google Scholar
Index Terms
- As we may perceive: finding the boundaries of compound documents on the web
Recommendations
As we may perceive: inferring logical documents from hypertext
HYPERTEXT '05: Proceedings of the sixteenth ACM conference on Hypertext and hypermediaIn recent years, many algorithms for the Web have been developed that work with information units distinct from individual web pages. These include segments of web pages or aggregation of web pages into web communities. Such logical information units ...
Finding the boundaries of information resources on the web
WWW '05: Special interest tracks and posters of the 14th international conference on World Wide WebIn recent years, many algorithms for the Web have been developed that work with information units distinct from individual web pages. These include segments of web pages or aggregation of web pages into web communities. Using these logical information ...
A study of tabbed browsing among mozilla firefox users
CHI '10: Proceedings of the SIGCHI Conference on Human Factors in Computing SystemsWe present a study which investigated how and why users of Mozilla Firefox use multiple tabs and windows during web browsing. The detailed web browsing usage of 21 participants was logged over a period of 13 to 21 days each, and was supplemented by ...
Comments