Mining the Web for relations
Introduction
The World Wide Web is a vast source of information. However, the Web consists of an ever-growing set of pages authored by people with vastly differing cultures, interests, and educational levels, while the goal of the individual Web page author is to furnish information. Web crawlers visit these Web pages and index the crawled pages to serve search engines. As these crawlers analyze these Web pages they could look for and learn interesting pieces of information which remain buried in these pages. For instance, a crawler could analyze link information in Web pages to identify how many pages point to a Web page, and how many pages a Web page points to. Based upon this information, the crawler can identify pages that are authorities on certain topics and pages that are starting points (hubs) for such authorities 3, 6. This technique can be extended to identify communities on the Web which consist of pages that point to each other in particular ways [9].
Links are only one type of relations that link entities (Web pages in this case) of the Web. There could be other kinds of relationships of a semantic nature between entities. Identifying these relationships and the patterns of occurrences of these relationships can help provide valuable information buried in the Web. This will be increasingly the case as access to the Web becomes pervasive and as end users rely on the Web to look for information with expectation of reliability. For instance, one may be interested in searching the Web to find out the author of a particular book, or to find all books written by a particular author [2]. Such information is typically not easily served by search engines of today.
The vision of a semantic Web [1] includes collaborations that extend to machines that are capable of analyzing all the data on the Web for content, links, and transactions. We are not anywhere near realizing this vision yet, but we do have some loose structure in the form of text, structure, and links in HTML. We would like to exploit what is available to find interesting information.
The rest of this paper is organized as follows. Section 2discusses duality problems in the World Wide Web. Section 3formalizes the duality problem of patterns and relations. Section 4extends this to higher-level duality problems. In Section 5we solve a 2-level duality problem of finding acronyms and their expansions in detail. Section 6discusses the issues in mining over structures and links. Section 7further generalizes the duality problem and formulates how to discover new relationships. In Section 8we discuss related research work and in Section 9we draw conclusions and discuss work in progress and future work.
Section snippets
Duality problems in the World Wide Web
Duality problems are materialized in trying to identify two sets of inter-related concepts. Consider the problem of extracting a relation of books – (author, title) pairs from the Web [2]. Intuitively, the problem can be solved as follows.
- 1.
Begin with a small seed set of (author, title) pairs.
- 2.
Find all occurrences of those pairs on the Web.
- 3.
Identify patterns for the citations of the books from these occurrences.
- 4.
Then, search the Web for these patterns to recognize more new (author, title) pairs.
- 5.
Duality of relations and patterns on the Web
Let be a large database of documents such as the Web. Let R={ri|i=1,…,n} and P={pj|j=1,…,m} be sets of relations and patterns, respectively. A relation is a pair of interrelated concepts, such as (acronym, expansion) pairs. A pattern is the way in which relations are marked up in Web pages. ri occurs in at least one time with one (or possibly more) pattern(s) pj. A pattern pj signifies at least one (or more) relation(s) ri.
We iteratively identify two sets R and P, starting with R0 and P0
Higher level duality problems
It is possible to define higher-level dualities when the mutually recursive relation between R and P is through another set, say S. Here an approximation to R in a particular iteration may depend on an approximation to P in a previous iteration, which in turn may depend on an approximation to S in a previous iteration. The approximation of S itself may come from R. Thus we can define a 2-level duality as:
The figure on the right side depicts
Solving a 2-level duality problem: mining the Web for acronyms
Here we give an experiment we ran with solving the 2-level duality problem. We apply the duality to identify acronyms and their expansions. We call the occurrences of (acronym, expansion) relations AE-pairs. An acronym comes from the space of words defined by the regular expression [A–Za–z0–9][A–Za–z0–9]*. An expansion is a string of words that stands for an acronym. Acronym formation rule is a rule which specifies how an acronym is formed from its expansion. The acronym identification problem
Mining over structures, duality-links, and metadata
In the acronym-expansion formation we saw two kinds of patterns: text patterns and HTML structure patterns. In the text patterns we described the occurrence of acronyms and their expansions in terms of tuples of regular expressions representing character strings. In HTML structure patterns, we described patterns using element/attribute names and values and parent/child/sibling relationships. For generalized relationships between two entities we need a description language that describes how two
Proposal for discovering new relations
So far in our discussion we have kept the notion of relation fixed in any instance of mining. For example, the relationship between acronyms and their expansions or between acronyms and theirs formation rules is fixed. These two relationships together define the overall relationship of the acronym problem that we are solving. We mine Web pages iteratively to find entities that substantiate that relationship. However, if we were to treat relations themselves as variables that can be mined and
Related work
Bibliometrics [10] studies the world of authorships and citations through measurement. Bibliometric coupling measures similarity of two technical papers based upon their common citations. Co-citation strength is a measure of the number of times two papers are cited together. Statistical techniques are used to compute these and other related measures [11]. In typical bibliometric situations the citations and authorships are explicit and do not have to be learned or derived as in our system.
HITS
Conclusions and future work
In this paper we studied the duality problem of how entities are related on the Web. Given that the Web is a great source of information where the information itself is buried under the visual markups, texts, and links of the Web pages, discovering relationships between entities is an interesting problem. The repeated occurrences of loosely defined structures and relationships help us define these entities with increased confidence. In this paper we formalized the iterative process of mining
Neel Sundaresan is a research manager of the eMerging Internet Technologies Department at the IBM Almaden Research Center. He has been with IBM since December 1995 and has pioneered several XML and internet related research projects. He was one of the chief architects of the Grand Central Station project at IBM Research for building XML-based search engines. He received his Ph.D. in computer science in 1995. He has done research and advanced technology work in the area of compilers and
References (20)
- T. Berners-Lee, Weaving the Web, Harpers, San Francisco, CA,...
- S. Brin, Extracting patterns and relations from the World Wide Web, in: Proc. WebDB '98, Valencia,...
- S. Chakrabarti, M. van de Berg and B. Dom, Focused crawling: a new approach to topic-specific Web resource discovery,...
- M. Collins and Y. Singer, Unsupervised models for named entity classification, in: EMNLP 99,...
- Extensible Markup Language (XML) 1.0, W3C Recommendation, T. Bray, J. Paoli and C.M. Sperberg-McQueen (Eds.), Feb....
- D. Gibson, J. Kleinberg and P. Raghavan, Inferring Web communities from link topology, in: HyperText ’98, Pittsburgh,...
- T. Kistler and H. Marais, WebL: a programming language for the Web, in: Proc 7th World Wide Web Conference '98 (WWW7),...
- J. Kleinberg, Authoritative sources in a hyperlinked environment, in: Proc. 9th ACM–SIAM Symposium on Discrete...
- R. Kumar, P. Raghavan, S. Rajagopalan and A. Tomkins, Trawling the Web for emerging cyber-communities, in: Proc. 8th...
- R. Larson, Bibliometrics of the World Wide Web: an exploratory analysis of the intellectual structure of cyberspace,...
Cited by (19)
Policy interventions and market innovation in rural China: Empirical evidence from Taobao villages
2024, Economic Analysis and PolicyThe spatial aggregation of rural e-commerce in China: An empirical investigation into Taobao Villages
2020, Journal of Rural StudiesCitation Excerpt :Web crawler technology is an effective tool to amass all visible information that is published on the Internet (Thelwall, 2001). This program scours websites, inspects the retrieved webpages and outputs the requested information (Sundaresan and Yi, 2000; Korfiatis et al., 2006). In this study, a web crawling program written in Python was used to assess the Baidu Enterprise Credit System.
Research on the reading system of literary works based on personalized recommendation on network
2020, ACM International Conference Proceeding SeriesFocused crawler enhancement technique with language detection module for malay web retrieval
2018, GEMA Online Journal of Language StudiesAn improvement to e-commerce recommendation using product network analysis
2014, Proceedings - Pacific Asia Conference on Information Systems, PACIS 2014Building social relationship ontology model based on fuzzy set
2012, International Journal of Digital Content Technology and its Applications
Neel Sundaresan is a research manager of the eMerging Internet Technologies Department at the IBM Almaden Research Center. He has been with IBM since December 1995 and has pioneered several XML and internet related research projects. He was one of the chief architects of the Grand Central Station project at IBM Research for building XML-based search engines. He received his Ph.D. in computer science in 1995. He has done research and advanced technology work in the area of compilers and programming languages, parallel and distributed systems and algorithms, information theory, data mining and semi-structured data, speech synthesis, agent systems, and internet tools and technologies. He has over 30 research publications and has given several invited and refereed talks and tutorials at national and international conferences. He has been a member of the W3C standards effort.
Jeonghee Yi is a Ph.D. candidate in computer science at the University of California, Los Angeles. She is a researcher at IBM Almaden Research Center, San Jose, California since July 1998. Her current research interests include data mining, Web mining, internet technologies, semi-structured data, and database systems. She received a BS and a MS degrees in Computer Science from Ewha Woman's University, Korea, in 1986 and 1988, respectively, and a MS degree in computer science from the University of California, Los Angeles in 1994. The work described here was partially supported through an IBM Graduate Fellowship.
- 1
E-mail: [email protected]