Mining the Web for relations

https://doi.org/10.1016/S1389-1286(00)00085-2Get rights and content

Abstract

The Web is a vast source of information. However, due to the disparate authorship of Web pages, this information is buried in its amorphous and chaotic structure. At the same time, with the pervasiveness of Web access, an increasing number of users is relying on Web search engines for interesting information. We are interested in identifying how pieces of information are related as they are presented on the Web. One such problem is studying patterns of occurrences of related phrases in Web documents and in identifying relationships between these phrases. We call these the duality problems of the Web. Duality problems are materialized in trying to define and identify two sets of inter-related concepts, and are solved by iteratively refining mutually dependent coarse definitions of these concepts. In this paper we define and formalize the general duality problem of relations on the Web. Duality of patterns and relationships are of importance because they allow us to define the rules of patterns and relationships iteratively through the multitude of their occurrences. Our solution includes Web crawling to iteratively refine the definition of patterns and relations. As an example we solve the problem of identifying acronyms and their expansions through patterns of occurrences of (acronym, expansion) pairs as they occur in Web pages.

Introduction

The World Wide Web is a vast source of information. However, the Web consists of an ever-growing set of pages authored by people with vastly differing cultures, interests, and educational levels, while the goal of the individual Web page author is to furnish information. Web crawlers visit these Web pages and index the crawled pages to serve search engines. As these crawlers analyze these Web pages they could look for and learn interesting pieces of information which remain buried in these pages. For instance, a crawler could analyze link information in Web pages to identify how many pages point to a Web page, and how many pages a Web page points to. Based upon this information, the crawler can identify pages that are authorities on certain topics and pages that are starting points (hubs) for such authorities 3, 6. This technique can be extended to identify communities on the Web which consist of pages that point to each other in particular ways [9].

Links are only one type of relations that link entities (Web pages in this case) of the Web. There could be other kinds of relationships of a semantic nature between entities. Identifying these relationships and the patterns of occurrences of these relationships can help provide valuable information buried in the Web. This will be increasingly the case as access to the Web becomes pervasive and as end users rely on the Web to look for information with expectation of reliability. For instance, one may be interested in searching the Web to find out the author of a particular book, or to find all books written by a particular author [2]. Such information is typically not easily served by search engines of today.

The vision of a semantic Web [1] includes collaborations that extend to machines that are capable of analyzing all the data on the Web for content, links, and transactions. We are not anywhere near realizing this vision yet, but we do have some loose structure in the form of text, structure, and links in HTML. We would like to exploit what is available to find interesting information.

The rest of this paper is organized as follows. Section 2discusses duality problems in the World Wide Web. Section 3formalizes the duality problem of patterns and relations. Section 4extends this to higher-level duality problems. In Section 5we solve a 2-level duality problem of finding acronyms and their expansions in detail. Section 6discusses the issues in mining over structures and links. Section 7further generalizes the duality problem and formulates how to discover new relationships. In Section 8we discuss related research work and in Section 9we draw conclusions and discuss work in progress and future work.

Section snippets

Duality problems in the World Wide Web

Duality problems are materialized in trying to identify two sets of inter-related concepts. Consider the problem of extracting a relation of books – (author, title) pairs from the Web [2]. Intuitively, the problem can be solved as follows.

  • 1.

    Begin with a small seed set of (author, title) pairs.

  • 2.

    Find all occurrences of those pairs on the Web.

  • 3.

    Identify patterns for the citations of the books from these occurrences.

  • 4.

    Then, search the Web for these patterns to recognize more new (author, title) pairs.

  • 5.

Duality of relations and patterns on the Web

Let W be a large database of documents such as the Web. Let R={ri|i=1,…,n} and P={pj|j=1,…,m} be sets of relations and patterns, respectively. A relation is a pair of interrelated concepts, such as (acronym, expansion) pairs. A pattern is the way in which relations are marked up in Web pages. ri occurs in W at least one time with one (or possibly more) pattern(s) pj. A pattern pj signifies at least one (or more) relation(s) ri.

We iteratively identify two sets R and P, starting with R0 and P0

Higher level duality problems

It is possible to define higher-level dualities when the mutually recursive relation between R and P is through another set, say S. Here an approximation to R in a particular iteration may depend on an approximation to P in a previous iteration, which in turn may depend on an approximation to S in a previous iteration. The approximation of S itself may come from R. Thus we can define a 2-level duality as:Ri=Ri−1∪f(Pi−1,Wi)Pi=Pi−1∪g(Si−1,Wi)Si=Si−1∪h(Ri−1,Wi)

The figure on the right side depicts

Solving a 2-level duality problem: mining the Web for acronyms

Here we give an experiment we ran with solving the 2-level duality problem. We apply the duality to identify acronyms and their expansions. We call the occurrences of (acronym, expansion) relations AE-pairs. An acronym comes from the space of words defined by the regular expression [A–Za–z0–9][A–Za–z0–9]*. An expansion is a string of words that stands for an acronym. Acronym formation rule is a rule which specifies how an acronym is formed from its expansion. The acronym identification problem

Mining over structures, duality-links, and metadata

In the acronym-expansion formation we saw two kinds of patterns: text patterns and HTML structure patterns. In the text patterns we described the occurrence of acronyms and their expansions in terms of tuples of regular expressions representing character strings. In HTML structure patterns, we described patterns using element/attribute names and values and parent/child/sibling relationships. For generalized relationships between two entities we need a description language that describes how two

Proposal for discovering new relations

So far in our discussion we have kept the notion of relation fixed in any instance of mining. For example, the relationship between acronyms and their expansions or between acronyms and theirs formation rules is fixed. These two relationships together define the overall relationship of the acronym problem that we are solving. We mine Web pages iteratively to find entities that substantiate that relationship. However, if we were to treat relations themselves as variables that can be mined and

Related work

Bibliometrics [10] studies the world of authorships and citations through measurement. Bibliometric coupling measures similarity of two technical papers based upon their common citations. Co-citation strength is a measure of the number of times two papers are cited together. Statistical techniques are used to compute these and other related measures [11]. In typical bibliometric situations the citations and authorships are explicit and do not have to be learned or derived as in our system.

HITS

Conclusions and future work

In this paper we studied the duality problem of how entities are related on the Web. Given that the Web is a great source of information where the information itself is buried under the visual markups, texts, and links of the Web pages, discovering relationships between entities is an interesting problem. The repeated occurrences of loosely defined structures and relationships help us define these entities with increased confidence. In this paper we formalized the iterative process of mining

Neel Sundaresan is a research manager of the eMerging Internet Technologies Department at the IBM Almaden Research Center. He has been with IBM since December 1995 and has pioneered several XML and internet related research projects. He was one of the chief architects of the Grand Central Station project at IBM Research for building XML-based search engines. He received his Ph.D. in computer science in 1995. He has done research and advanced technology work in the area of compilers and

References (20)

  • T. Berners-Lee, Weaving the Web, Harpers, San Francisco, CA,...
  • S. Brin, Extracting patterns and relations from the World Wide Web, in: Proc. WebDB '98, Valencia,...
  • S. Chakrabarti, M. van de Berg and B. Dom, Focused crawling: a new approach to topic-specific Web resource discovery,...
  • M. Collins and Y. Singer, Unsupervised models for named entity classification, in: EMNLP 99,...
  • Extensible Markup Language (XML) 1.0, W3C Recommendation, T. Bray, J. Paoli and C.M. Sperberg-McQueen (Eds.), Feb....
  • D. Gibson, J. Kleinberg and P. Raghavan, Inferring Web communities from link topology, in: HyperText ’98, Pittsburgh,...
  • T. Kistler and H. Marais, WebL: a programming language for the Web, in: Proc 7th World Wide Web Conference '98 (WWW7),...
  • J. Kleinberg, Authoritative sources in a hyperlinked environment, in: Proc. 9th ACM–SIAM Symposium on Discrete...
  • R. Kumar, P. Raghavan, S. Rajagopalan and A. Tomkins, Trawling the Web for emerging cyber-communities, in: Proc. 8th...
  • R. Larson, Bibliometrics of the World Wide Web: an exploratory analysis of the intellectual structure of cyberspace,...
There are more references available in the full text version of this article.

Cited by (19)

  • The spatial aggregation of rural e-commerce in China: An empirical investigation into Taobao Villages

    2020, Journal of Rural Studies
    Citation Excerpt :

    Web crawler technology is an effective tool to amass all visible information that is published on the Internet (Thelwall, 2001). This program scours websites, inspects the retrieved webpages and outputs the requested information (Sundaresan and Yi, 2000; Korfiatis et al., 2006). In this study, a web crawling program written in Python was used to assess the Baidu Enterprise Credit System.

  • An improvement to e-commerce recommendation using product network analysis

    2014, Proceedings - Pacific Asia Conference on Information Systems, PACIS 2014
  • Building social relationship ontology model based on fuzzy set

    2012, International Journal of Digital Content Technology and its Applications
View all citing articles on Scopus

  1. Download : Download high-res image (41KB)
  2. Download : Download full-size image
Neel Sundaresan is a research manager of the eMerging Internet Technologies Department at the IBM Almaden Research Center. He has been with IBM since December 1995 and has pioneered several XML and internet related research projects. He was one of the chief architects of the Grand Central Station project at IBM Research for building XML-based search engines. He received his Ph.D. in computer science in 1995. He has done research and advanced technology work in the area of compilers and programming languages, parallel and distributed systems and algorithms, information theory, data mining and semi-structured data, speech synthesis, agent systems, and internet tools and technologies. He has over 30 research publications and has given several invited and refereed talks and tutorials at national and international conferences. He has been a member of the W3C standards effort.

  1. Download : Download high-res image (55KB)
  2. Download : Download full-size image
Jeonghee Yi is a Ph.D. candidate in computer science at the University of California, Los Angeles. She is a researcher at IBM Almaden Research Center, San Jose, California since July 1998. Her current research interests include data mining, Web mining, internet technologies, semi-structured data, and database systems. She received a BS and a MS degrees in Computer Science from Ewha Woman's University, Korea, in 1986 and 1988, respectively, and a MS degree in computer science from the University of California, Los Angeles in 1994. The work described here was partially supported through an IBM Graduate Fellowship.

View full text