Information extraction for search engines using fast heuristic techniques

https://doi.org/10.1016/j.datak.2009.10.002Get rights and content

Abstract

We study the structured records of web pages and the relevant problems associated with the extraction and alignment of these structured records. Current automatic wrappers are complicated because they take into consideration the problems of locating relevant data region using visual cues and the use of complicated algorithms to check the similarity of data records. In this paper, we develop a non-visual automatic wrapper which questions the need for complex visual based wrappers in data extraction. The novel techniques for our wrapper are (1) filtering rules to detect and filter out irrelevant data records, (2) a tree matching algorithm using frequency measures to increase the speed of data extraction, (3) an algorithm to calculate the number and size of the components of data records to detect the correct data region, (4) a data alignment algorithm which is able to align iterative (repetitive HTML command tags) and disjunctive (optional) data items and (5) a data merging and partitioning method to solve the imperfect segmentation problem (the problem of correctly identifying the atomic entities in data items). Results show that our wrapper is as robust and in many cases outperforms the state of the art wrappers such as ViNT and DEPTA. This wrapper could have significant speed advantages when processing large volumes of web sites data, which could be helpful in meta search engine development.

Introduction

The extraction of relevant data from a target source is called Information Extraction. The target source can be a natural language source or structured records (data records) which usually contain important information. Therefore, there is a need to develop wrappers to extract these structured records. Wrappers developed recently are mostly fully automated and they could have significant speed advantages when processing large volumes of web site data, therefore they could be helpful in meta search engine development [56], [34] and in comparing and evaluating shopping lists [52].

A wrapper needs to go through a few stages to complete its operation. These include collecting and labeling training pages, creating and generalizing extraction rules from the set of labeled pages, extracting the relevant data and using them as output in a suitable format for further processing [10]. Research works on wrapper design focus mostly on the intermediate operation, that is the extraction of relevant data from a set of sample pages, although some of the works provide a solution for collecting training data such as a crawler and others provide output in XML format or relational database format for further integration of data. Generally, the labeling phase specifies the output of the extraction task and this requires the involvement of a user [2], [9], [13]. However, some contemporary systems do not require such labeling, instead the labeling and annotation of data are usually carried out after the generation of extraction rules [3], [11], [7], [33], [52]. Data extracted can be aligned and used for further processing [33], [52]. Data alignment is optional as they may not be needed by a user in some cases. Once data records are properly aligned and tabulated, they are labeled so that they can be easily distinguished [30], [50].

In general, wrappers can be classified in a number of ways. The simplest way is to classify wrappers based on their data extraction ability [26]. Some wrappers are designed to extract data records at record level, which only involves the extraction of relevant data records without any data alignment [6], [7], [25], [27], [37], while other wrappers are developed to extract data records at data unit level which is also known as data alignment. These wrappers split the extracted data records into smaller attributes called data items and rearrange them in a tabular form [30], [33], [52]. Data items have several atomic entities which can be separated further into smaller components, known as data units [26]. For example, a data record “book” may contain smaller attributes such as author, title, price and ISBN.

The other way to distinguish a wrapper is to look at the principles of operation of the wrappers to determine whether they are manual, semisupervised, supervised or automated wrappers. Manual wrappers are the earliest wrappers developed and these wrappers need knowledgeable and trained users familiar with the underlying structure of the web pages in their operation [31], [36], [5], [45], [24]. By observing the HTML page, the users will find the particular patterns of data records and they will be able to hand code the wrapper based on these patterns. However, this method soon became impractical, virtually every type of pages needs its own wrapper, and the wrappers need maintenance and update should their target web pages change their layout. Thus, this wrapper is not easily scalable for large web sites.

Supervised wrapper requires the user to label the HTML pages and the wrapper will automatically extract the information based on the labeled instances [2], [9], [13], [18], [29], [39], [41]. However, the involvement of a user is still needed for the operation of the wrapper and the labeling of web page is time consuming.

Similar to supervised wrapper, semisupervised wrapper requires the user to label the HTML pages [3], [11], [12], [54]. However, once labeling is carried out, the wrapper will automatically predict the set of extraction rules for extracting data from other similar HTML pages. The semisupervised wrapper still requires human intervention, thus, it is labor intensive and not suitable for large-scale web comparisons.

To overcome this drawback, fully automatic wrappers were developed [46], [7], [8], [37], [48], [33], [30], [52], [55], [57], [25], [26], [27], [38]. The advantage of automatic wrappers is that they are able to extract relevant data from different web pages, provided that the data records in the sample pages are similar in structure. Automatic wrappers work by checking the pattern and structure of data records, typical examples are the detection of repetitive sequence of HTML Tags (Mining Data Region (MDR) wrapper [7]), the space occupied by data records (Visual Segmentation based Data Records (VSDR) wrapper [37]), the use of semantic properties in data records (Ontology Assisted Data Extraction (ODE) [50] and the works of [17], [40]), the grouping of data records into clusters [22], [38], the extraction of data tables [14] and news articles [28], [32]. Automatic wrappers are able to cover a larger domain but they are not as accurate as their manual counterparts as they are designed based on a number of assumptions. As early automatic wrappers use HTML Tags to determine the structure of data records, the accuracy and precision of these methods are affected by ambiguities in the HTML language, furthermore if the information in the web page is not uniformly presented [6], [7], [30], [46]. To overcome these limitations, recent wrappers use additional visual cues, such as context, in font, color, style, size and relative object positions with the web pages, pictures, for example [8], [25], [26], [27], [37], [48], [49]. Intuitively, the argument of using visual cues seems appealing and one that seems adequately supported by comparative results between existing state of the art non-visual wrappers [7], [55], [57], [30]. Using additional visual cues will result in increase in complexity and decrease in the speed of the wrapper. For more information on wrappers, the readers are encouraged to refer to the surveys by Laender et al. [1] and Chang et al. [10].

In this paper, we focus on developing an automated non-visual wrapper for the extraction of data records at record and data unit levels, particularly the search engine results pages. Our aim is to improve on current non-visual based wrapper performance and demonstrate that our wrapper, Wrapper Incorporating Set of Heuristic Techniques (WISH) performs equally as well, and in many cases, better than the current state of the art automatic visual wrappers. Our results question how effective visual information is currently being used and that there is at least a need to rethink the way in which visual information is currently used.

This paper is divided into several sections. Section 2 reviews the existing problems in current state of the art wrappers and our proposed solutions to these problems. Section 3 describes the work relevant to our research. In Section 4 we discuss our proposed methodology in detail. Section 5 discusses the result of our experimental tests while Section 6 summarizes our work. It is worth noting that data labeling is outside the scope of this paper.

Section snippets

Problem formulation and proposed solutions

A review of previous work shows that the problems generally related to wrapper design are:

  • 1.

    Current non-visual wrappers rely on HTML Document Object Model (DOM) Tree are unable to locate and extract relevant data region (groups of data records [52]) from search engine results pages [6], [7], [30]. A DOM Tree is the underlying code of a HTML page that has a “tree” like structure rendered by an internet browser. Visual based wrappers [25], [26], [27], [33], [37] are developed to overcome this

Related work

In this section, we describe the current work related to ours, which is data extraction at record and data unit levels.

Overview of WISH

In this section we discuss the requirements and the assumptions made for WISH. For WISH to work successfully the sample pages used for data extraction should be obtained from a search engine query and each of these sample pages must contain at least three data records. WISH however, does not require the HTML page to be converted to XHTML format as the parser can recognize the HTML format.

The three main components of WISH are shown in Fig. 9. The first component involves parsing the HTML page

Preparation of datasets

The datasets used in this study are taken from web pages that contain search engine results. These datasets are divided into seven groups: Training Set, Dataset 1, Dataset 2, Dataset 3, Dataset 4, Dataset 5, and Dataset 6 with size of 50 web pages for Training Set, 150 web pages for Dataset 1, 119 web pages for Dataset 2, 50 web pages for Dataset 3, 51 web pages for Dataset 4, 80 web pages for Dataset 5, and finally 100 web pages for Dataset 6. The data distribution for each of the datasets

Conclusions

In this study we propose a non-visual wrapper WISH which is able to extract data records from structured web pages and align them using a regular template. Our results show that our wrapper is able to obtain results as well as and in most cases better than the current state of the art visual wrappers such as ViNT and DEPTA. Our approach uses a set of filtering methods based on the DOM Tree structure of data records and a more accurate algorithm to calculate the space occupied by the data

Jer Lang Hong received the B.Sc. Computer Science degree from University of Nottingham in 2005 and is currently a Ph.D. student in Monash University. His research interests include Information Extraction, and Automated Data Extraction. He is also an author and co author of several ACM/IEEE conference papers.

References (58)

  • Arvind Arasu et al.

    Extracting structured data from web pages

  • Bing Liu et al.

    Mining data records in web pages

  • Bing Liu, Yanhong Zhai, NET – a system for extracting web data from flat and nested data records, in: Web Information...
  • Brad Adelberg, NoDoSE⧹—a tool for semi-automatically extracting structured and semistructured data from text documents,...
  • Chia-Hui Chang et al.

    A survey of web information extraction systems

    IEEE T. Knowl. Data Eng.

    (2006)
  • Chia-Hui Chang et al.

    IEPAD: information extraction based on pattern discovery

  • Dan Gusfield

    Algorithms on strings, trees, and sequences: computer science and computational biology

    (1997)
  • David Sankoff and Joseph Kruskal, Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence...
  • D.W. Embley, D.M. Campbell, Y.S. Jiang, S.W. Liddle, D.W. Lonsdale, Conceptual-model-based data extraction from...
  • Dayne Freitag, Information extraction from HTML: Application of a general learning approach, in: Proceedings of the...
  • E. Tanaka et al.

    The tree-to-tree editing problem

    Int. J. Pattern Recognit. Artif. Intell.

    (1988)
  • Gabriel Valiente, An efficient bottom-up distance between trees, in: Proc. Eighth International Symposium String...
  • Gautam Das, Rudolf Fleischer, Leszek Gasieniec, Dimitris Gunopulos, Juha Karkkainen, Episode matching, in: Proceedings...
  • Gengxin Miao, Junichi Tatemura, Wang-Pin Hsiung, Arsany Sawires, Louise E. Moser, Extracting data records from the web...
  • Gonzalo Navarro

    A guided tour to approximate string matching

    ACM Comput. Surv.

    (2001)
  • Gustavo O. Arocena, Alberto O. Mendelzon, WebOQL: restructuring documents, databases and webs, in: Proceedings of the...
  • Hongkun Zhao, Weiyi Meng, Clement Yu, Automatic extraction of dynamic record sections from search engine result pages,...
  • Hongkun Zhao et al.

    Mining templates from search result records of search engines

  • Hongkun Zhao et al.

    Fully automatic wrapper generation for search engines

  • Cited by (42)

    • Box clustering segmentation: A new method for vision-based web page preprocessing

      2017, Information Processing and Management
      Citation Excerpt :

      Various methods belong to this group of algorithms. They can be divided into partially or fully supervised (Bing, Guo, Lam, Niu, & Wang, 2014; Bu, Zhang, Xia, & Wang, 2014; Fragkou, 2013) and unsupervised (Burget, 2007; Cai et al., 2003; Hong, Siew, & Egerton, 2010; Liu, Meng, & Meng, 2010; Shi, Liu, Shen, Yuan, & Huang, 2015) methods. In this work, considering the goals and applications we formulated in the introduction, we consider only the unsupervised page segmentation methods.

    • Matching parse thickets for open domain question answering

      2017, Data and Knowledge Engineering
      Citation Excerpt :

      This representation allows merging syntactic information with discourse information into the unified framework, so that the matching of parse thicket delivers exhaustive information on the similarity between two texts, such as a question and an answer. Parse thickets are designed so that exhaustive information of the hybrid nature, syntactic and discourse, is combined in a form ready for matching to be applied in search applications [39]. The operation of generalization to learn from parse trees for a pair of sentences turned out to be essential for search re-ranking.

    • AutoRM: An effective approach for automatic Web data record mining

      2015, Knowledge-Based Systems
      Citation Excerpt :

      Record-level approaches try to mine the similar data records from Web pages. Typical record-level approaches include OMINI [6], MDR [27], DeLa [37], NET [29], ViNTs [41], MSE [42], DEPTA [40], ODE [35], WISH [21], ViDE [28], G-STM [22], SILA [31], CTVS [36], etc. The approaches for automatic Web data record mining are just record-level approaches.

    • TEX: An efficient and effective unsupervised web information extractor

      2013, Knowledge-Based Systems
      Citation Excerpt :

      Although they can be handcrafted [15,24,4,42,51,50,20], the costs involved motivated many researchers to work on proposals to learn them automatically. These proposals are either supervised, i.e., they require the user to provide a number of information samples to be extracted [11,44,58,26,32,8,22,9,14,18,30,5,40,21,59], or unsupervised, i.e., they extract as much prospective information as they can and the user then gathers the relevant information from the results [62,12,16,2,28,25,60,39,46,64,67,38,59,57]. Since typical web documents are growing in complexity, a number of authors are also working on techniques whose goal is to identify the region within a web document where relevant information is most likely to be contained [37,7,61,63,27,34,65,66,52,45,35,6].

    View all citing articles on Scopus

    Jer Lang Hong received the B.Sc. Computer Science degree from University of Nottingham in 2005 and is currently a Ph.D. student in Monash University. His research interests include Information Extraction, and Automated Data Extraction. He is also an author and co author of several ACM/IEEE conference papers.

    Eu-Gene Siew received the B.Comm. degree from Adelaide University, Australia in 1995 and obtained his M.BusSys. and Ph.D. from Monash University, Australia in 1999, and 2005 respectively. He worked in a teaching institution in Penang before joining the School of Information Technology in 2007. He has published and presented numerous papers in journals, and international conferences and also written a book chapter. He is a member of the Malaysian Small Medium Enterprise and Malaysian National Computer Confederation. Additionally, he is a member of the Centre for Research in Intelligent Systems (CRIS) and the Centre for Decision Support and Enterprise Systems Research (CDSESR). His main research interest lies in the area of the application of intelligent techniques. His other research interests also include data mining and information extraction from web pages.

    Simon Egerton completed his Ph.D. at the University of Essex in the United Kingdom. He returned to academia after a number of years in industry and is currently a lecturer in the School of Information Technology at Monash University Malaysia where he is head of the Intelligent Systems research group. His research interests cover a broad range of intelligent systems, but he likes to specialise in robotics and smart device ecologies.

    View full text