Using Web structure and summarisation techniques for Web content mining

https://doi.org/10.1016/j.ipm.2004.08.003Get rights and content

Abstract

The dynamic nature and size of the Internet can result in difficulty finding relevant information. Most users typically express their information need via short queries to search engines and they often have to physically sift through the search results based on relevance ranking set by the search engines, making the process of relevance judgement time-consuming. In this paper, we describe a novel representation technique which makes use of the Web structure together with summarisation techniques to better represent knowledge in actual Web Documents. We named the proposed technique as Semantic Virtual Document (SVD). We will discuss how the proposed SVD can be used together with a suitable clustering algorithm to achieve an automatic content-based categorization of similar Web Documents. The auto-categorization facility as well as a “Tree-like” Graphical User Interface (GUI) for post-retrieval document browsing enhances the relevance judgement process for Internet users. Furthermore, we will introduce how our cluster-biased automatic query expansion technique can be used to overcome the ambiguity of short queries typically given by users. We will outline our experimental design to evaluate the effectiveness of the proposed SVD for representation and present a prototype called iSEARCH (Intelligent SEarch And Review of Cluster Hierarchy) for Web content mining. Our results confirm, quantify and extend previous research using Web structure and summarisation techniques, introducing novel techniques for knowledge representation to enhance Web content mining.

Introduction

The rapid growth of the Internet has led to the development of Internet2. Web surfers view information retrieved from the Internet as rich and relevant. Due to the enormous amount of information on the Internet, users typically use search engines to assist them discover relevant information. The results of Graphic, Visualisation and Usability (GVU) Centre’s October 1998 tenth WWW user survey showed that about 85% of people use search engines to locate information (GVU, 1998). However, the dynamic nature and size of the Internet can result in searches that are incomplete, outdated or large number of documents returned. In addition, users of search engines typically have little or no training on how to best utilise them and they also refrain from using the advanced search features that many search engines now offer. Researchers have developed many different techniques to address this challenging problem of locating relevant Web information effectively and efficiently. Examples of such techniques include meta-searching, post-retrieval analysis and enhanced visualisation of search results (Chen et al., 2001, Hearst and Pedersen, 1996, Zamir and Etzioni, 1999).

The main objective of this research is to investigate how the Web structure together with summarisation techniques can be used for Web content mining to address the challenging problem of locating relevant Web information effectively and efficiently with the help of search engine technologies. In other words, the following novel techniques will be exploited:

  • A method to better represent knowledge in actual Web Documents.

  • Content-based automatic clustering of Web Documents.

  • Intuitive GUI for visualising and browsing the clustering results.

  • Term selection in pseudo-relevance feedback to overcome the ambiguity of short queries.

The motivation for the research detailed in this paper is twofold. It emanates both from a need to plug a research gap in content-based knowledge representation of actual Web Documents and a need to provide users with a better means to assist them discover relevant information from the Internet more effectively and efficiently.

Web mining is the use of data mining techniques to automatically discover and extract information from Web Documents and services (Etzioni, 1996). It can be classified into three categories: Web content mining, Web structure mining and Web usage mining (Kosala & Blokeel, 2000). Web content mining refers to the discovery of useful information from Web contents. It encompasses resource discovery from the Web (Chakrabarti et al., 1999, Cho et al., 1998), document categorisation and clustering (Kohonen et al., 2000, Zamir and Etzioni, 1999), and information extraction from Web pages (Chang et al., 2003, Tolle and Chen, 2000). Many search engines are available on the Internet, each having its own characteristics and employing different algorithms to index, rank and present Web Documents. Users typically use search engines to assist them discover relevant information or to achieve certain level of Web content mining. However, current search engines have the following major limitations:

  • Users are presented with either too few or too many search results based on relevance ranking and have to physically sift through them one by one (Tombros & Sanderson, 1998).

  • Users typically use short keywords as the query (Spink & Xu, 2000) that may not fully describe their interest as they may have only a vague idea of what information is needed. Another recent survey conducted by NEC Research Institute shows that about 70% of Web users typically use only a single keyword or search term (Butler, 2000).

  • The search results have low precision, which is due to the irrelevance of many search results. This results in a difficulty finding the relevant information (Kosala & Blokeel, 2000).

  • The search results have low recall, which is due to the inability to index all the information available on the Web such as dynamically generated Web Documents. This results in a difficulty finding the unindexed information that is relevant (Kosala & Blokeel, 2000).


To overcome the first limitation listed above, Web content mining techniques have be applied to generally improve the searching experience (Chen et al., 2001, Zamir and Etzioni, 1999). On the other hand, we believe that the various search engines are also actively involved in researching new techniques (Google Search Engine, n.d., Olsen, 2002) to overcome the last limitation listed above.

The multi-stage process of search, starting with a general query and then getting more specific, has been investigated and is well documented in non-Web search (Marchionini, 1995). However, little work has been done to incorporate such idea into Web search and the most relevant work we found in literature is reported in Chang and Hsu, 1999, Crimmins and Smeaton, 1999. Furthermore, categorisation and clustering techniques have also been investigated as a post-retrieval document browsing technique, where search results are classified into categories such that the user can browse and navigate through the set of retrieved documents more easily. NothernLight Search Engine (online) is an example of a search engine that categorises retrieved Web pages into predefined search categories called “Custom Search Folders”. Another approach is to categorise Web pages on the fly without resorting to predefined categories. For instance, SONIA (Sahami, Yusufali, & Baldonado, 1998) is a meta-search engine that clusters search results, extracting keywords to describe each cluster and allow the user to expand search within a cluster. Scatter/Gather (Cutting et al., 1992, Cutting et al., 1993, Hearst and Pedersen, 1996) is another example of a system that allows users to iteratively refine their search by clustering documents interactively and browsing the results.

Most of the Web Documents available on the Internet are defined through Hyper Text Markup Language (HTML) that allows an author to organise the presentation of a document content by means of special tags and interpreted by Web browsers. Web Documents can contain both multimedia information and connections to other documents through hyperlinks. A hyperlink is often created based on the principle that links are connections among documents that are similar. Hyperlinks are increasingly being used to improve the ability to organise, search and analyse the Web (Brin and Page, 1998, Yang et al., 2002). Previous research has shown that extended anchortext instead of document full-text (Glover, Tsioutsiouliklis, Lawrence, Pennock, & Flake, 2002) and query-biased summarisation technique (White, Jose, & Ruthven, 2003) are more effective in representing Web Documents. Moreover, results have shown that query expansion using document summaries can be considerably more effective than using full-document expansion (Lam-Adesina & Jones, 2001).

In this paper, we will describe how the proposed Semantic Virtual Document (SVD) can be applied to better represent knowledge in actual Web Documents. We will also discuss and present content-based automatic clustering of Web Documents using the Hierarchical Agglomerative Clustering (HAC) algorithm and a “Tree-like” GUI interface for post-retrieval document browsing to enhance the relevance judgement process. Furthermore, we will also introduce how our cluster-biased automatic query expansion technique can be used to overcome the ambiguity of short queries typically given by users. We will also outline our experimental design to evaluate the effectiveness of the proposed SVD via a prototype system iSEARCH.

The remainder of this paper is organised as follows: Section 2 discusses the proposed SVD for knowledge representation. In Section 3, we present the well-known HAC algorithm and the proposed SVDs to automatically reorganise the results returned by search engines. Section 4 presents our “Tree-like” visual interface for browsing the document clustering results while cluster-biased automatic query expansion will be presented in Section 5. In Section 6, we will outline the design and results of our experiments to validate the proposed methods via iSEARCH. Finally, Section 7 provides conclusions and future work.

Section snippets

Semantic Virtual Documents

We will discuss and present our techniques for knowledge representation of actual Web Documents using the proposed SVD, which contain context-dependent summaries that are highly descriptive of the actual Web Documents contents. Each SVD not only makes use of extended anchortext instead of document full-text (Glover et al., 2002) and query-biased summarisation technique (Tombros and Sanderson, 1998, White et al., 2003) but also incorporates our novel anchortext-biased summarisation technique in

Clustering of Web Documents

In order to automatically reorganise the results returned by search engines, we will discuss and present content-based automatic clustering of Web Documents using the HAC technique and the proposed SVD. We will also describe our data structure together with our fast implementation technique of HAC to speed up the automatic document clustering process. In addition, we will also illustrate how to represent the computer-generated clusters with descriptive textual summaries.

“Tree-like” GUI––A visual interface for browsing

Most of the Web search engines are text based. They display search results from user queries as long lists of pointers with/without summaries of retrieved pages. Proposals for visualising the output of an information retrieval system were presented as early as in the 1960s (Sammon, 1969). Scatter/Gather (Cutting et al., 1992, Cutting et al., 1993, Hearst and Pedersen, 1996) and Vivisimo (Vivisimo, 2000; http://vivisimo.com) are examples of visual presentations of search results that allows

Pseudo-relevance feedback

To address the problem of word mismatch (Furnas, Landauer, Gomez, & Dumais, 1987) and short queries (Butler, 2000) typically used by search engine users, researchers have shown that query expansion using document summaries can be considerably more effective than using full-document expansion (Lam-Adesina & Jones, 2001). We use the vector space model instead of the probabilistic model for both term weighting and pseudo-relevance feedback in our system. As mentioned earlier, our system can

Experimental results

In order to demonstrate the feasibility and effectiveness of the proposed SVD in Web content mining, we have developed a prototype system called iSEARCH and conducted several experiments with online Web Documents. We first compared SVD with hypertext knowledge representation where only the actual Web Documents were used. Next, we also considered the effectiveness of only query-biased summaries created for the actual Web Documents as another form of knowledge representation technique.

For our

Conclusions and future work

We introduced a novel technique SVD for representation of actual Web Documents. In addition, we also discussed and presented a prototype system with SVD representation, HAC clustering, a “Tree-like” GUI and cluster-biased automatic query expansion techniques to enhance the relevance judgment process.

Experimental results have shown that SVD representation resulted in a faster and more accurate document clustering. Furthermore, term suggestion based on the proposed cluster-biased automatic query

References (47)

  • Cho, J., Garcia-Molina, H., & Page, L. (1998). Efficient crawling through URL ordering. In Proceedings of the 7th...
  • F. Crimmins et al.

    TetraFusion: Information discovery on the Internet

    IEEE Intelligent Systems

    (1999)
  • Cutting, D. R., Karger, D. R., & Pedersen, J. O. (1993). Constant interaction-time scatter/gather browsing of very...
  • Cutting, D. R., Karger, D. R., Pedersen, J. O., & Tukey, J. W. (1992). Scatter/gather: A cluster-based approach to...
  • O. Etzioni

    The World Wide Web: Quagmire or gold mine

    Communications of the ACM

    (1996)
  • Flake, G. W., Lawrence, S., & Giles, C. L. (2000). Efficient identification of Web communities. In Proceedings of the...
  • M. Frauenfelder

    A smarter Web

    Technology Review

    (2001)
  • G.W. Furnas et al.

    The vocabulary problem in human–system communication

    Communications of the ACM

    (1987)
  • D.F. Gallagher

    The Web’s missing links

    Technology Review

    (2002)
  • Glover, E. J., Tsioutsiouliklis, K., Lawrence, S., Pennock, D. M., & Flake, G. W. (2002). Using Web structure for...
  • Google Search Engine (n.d.). [WWW page]. URL...
  • Graphic, Visualisation and Usability (1998). [WWW page]. WWW user surveys. URL...
  • Hearst, M. A., & Pedersen, J. O. (1996). Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In...
  • Cited by (23)

    View all citing articles on Scopus
    View full text