Information extraction for search engines using fast heuristic techniques
Introduction
The extraction of relevant data from a target source is called Information Extraction. The target source can be a natural language source or structured records (data records) which usually contain important information. Therefore, there is a need to develop wrappers to extract these structured records. Wrappers developed recently are mostly fully automated and they could have significant speed advantages when processing large volumes of web site data, therefore they could be helpful in meta search engine development [56], [34] and in comparing and evaluating shopping lists [52].
A wrapper needs to go through a few stages to complete its operation. These include collecting and labeling training pages, creating and generalizing extraction rules from the set of labeled pages, extracting the relevant data and using them as output in a suitable format for further processing [10]. Research works on wrapper design focus mostly on the intermediate operation, that is the extraction of relevant data from a set of sample pages, although some of the works provide a solution for collecting training data such as a crawler and others provide output in XML format or relational database format for further integration of data. Generally, the labeling phase specifies the output of the extraction task and this requires the involvement of a user [2], [9], [13]. However, some contemporary systems do not require such labeling, instead the labeling and annotation of data are usually carried out after the generation of extraction rules [3], [11], [7], [33], [52]. Data extracted can be aligned and used for further processing [33], [52]. Data alignment is optional as they may not be needed by a user in some cases. Once data records are properly aligned and tabulated, they are labeled so that they can be easily distinguished [30], [50].
In general, wrappers can be classified in a number of ways. The simplest way is to classify wrappers based on their data extraction ability [26]. Some wrappers are designed to extract data records at record level, which only involves the extraction of relevant data records without any data alignment [6], [7], [25], [27], [37], while other wrappers are developed to extract data records at data unit level which is also known as data alignment. These wrappers split the extracted data records into smaller attributes called data items and rearrange them in a tabular form [30], [33], [52]. Data items have several atomic entities which can be separated further into smaller components, known as data units [26]. For example, a data record “book” may contain smaller attributes such as author, title, price and ISBN.
The other way to distinguish a wrapper is to look at the principles of operation of the wrappers to determine whether they are manual, semisupervised, supervised or automated wrappers. Manual wrappers are the earliest wrappers developed and these wrappers need knowledgeable and trained users familiar with the underlying structure of the web pages in their operation [31], [36], [5], [45], [24]. By observing the HTML page, the users will find the particular patterns of data records and they will be able to hand code the wrapper based on these patterns. However, this method soon became impractical, virtually every type of pages needs its own wrapper, and the wrappers need maintenance and update should their target web pages change their layout. Thus, this wrapper is not easily scalable for large web sites.
Supervised wrapper requires the user to label the HTML pages and the wrapper will automatically extract the information based on the labeled instances [2], [9], [13], [18], [29], [39], [41]. However, the involvement of a user is still needed for the operation of the wrapper and the labeling of web page is time consuming.
Similar to supervised wrapper, semisupervised wrapper requires the user to label the HTML pages [3], [11], [12], [54]. However, once labeling is carried out, the wrapper will automatically predict the set of extraction rules for extracting data from other similar HTML pages. The semisupervised wrapper still requires human intervention, thus, it is labor intensive and not suitable for large-scale web comparisons.
To overcome this drawback, fully automatic wrappers were developed [46], [7], [8], [37], [48], [33], [30], [52], [55], [57], [25], [26], [27], [38]. The advantage of automatic wrappers is that they are able to extract relevant data from different web pages, provided that the data records in the sample pages are similar in structure. Automatic wrappers work by checking the pattern and structure of data records, typical examples are the detection of repetitive sequence of HTML Tags (Mining Data Region (MDR) wrapper [7]), the space occupied by data records (Visual Segmentation based Data Records (VSDR) wrapper [37]), the use of semantic properties in data records (Ontology Assisted Data Extraction (ODE) [50] and the works of [17], [40]), the grouping of data records into clusters [22], [38], the extraction of data tables [14] and news articles [28], [32]. Automatic wrappers are able to cover a larger domain but they are not as accurate as their manual counterparts as they are designed based on a number of assumptions. As early automatic wrappers use HTML Tags to determine the structure of data records, the accuracy and precision of these methods are affected by ambiguities in the HTML language, furthermore if the information in the web page is not uniformly presented [6], [7], [30], [46]. To overcome these limitations, recent wrappers use additional visual cues, such as context, in font, color, style, size and relative object positions with the web pages, pictures, for example [8], [25], [26], [27], [37], [48], [49]. Intuitively, the argument of using visual cues seems appealing and one that seems adequately supported by comparative results between existing state of the art non-visual wrappers [7], [55], [57], [30]. Using additional visual cues will result in increase in complexity and decrease in the speed of the wrapper. For more information on wrappers, the readers are encouraged to refer to the surveys by Laender et al. [1] and Chang et al. [10].
In this paper, we focus on developing an automated non-visual wrapper for the extraction of data records at record and data unit levels, particularly the search engine results pages. Our aim is to improve on current non-visual based wrapper performance and demonstrate that our wrapper, Wrapper Incorporating Set of Heuristic Techniques (WISH) performs equally as well, and in many cases, better than the current state of the art automatic visual wrappers. Our results question how effective visual information is currently being used and that there is at least a need to rethink the way in which visual information is currently used.
This paper is divided into several sections. Section 2 reviews the existing problems in current state of the art wrappers and our proposed solutions to these problems. Section 3 describes the work relevant to our research. In Section 4 we discuss our proposed methodology in detail. Section 5 discusses the result of our experimental tests while Section 6 summarizes our work. It is worth noting that data labeling is outside the scope of this paper.
Section snippets
Problem formulation and proposed solutions
A review of previous work shows that the problems generally related to wrapper design are:
- 1.
Current non-visual wrappers rely on HTML Document Object Model (DOM) Tree are unable to locate and extract relevant data region (groups of data records [52]) from search engine results pages [6], [7], [30]. A DOM Tree is the underlying code of a HTML page that has a “tree” like structure rendered by an internet browser. Visual based wrappers [25], [26], [27], [33], [37] are developed to overcome this
Related work
In this section, we describe the current work related to ours, which is data extraction at record and data unit levels.
Overview of WISH
In this section we discuss the requirements and the assumptions made for WISH. For WISH to work successfully the sample pages used for data extraction should be obtained from a search engine query and each of these sample pages must contain at least three data records. WISH however, does not require the HTML page to be converted to XHTML format as the parser can recognize the HTML format.
The three main components of WISH are shown in Fig. 9. The first component involves parsing the HTML page
Preparation of datasets
The datasets used in this study are taken from web pages that contain search engine results. These datasets are divided into seven groups: Training Set, Dataset 1, Dataset 2, Dataset 3, Dataset 4, Dataset 5, and Dataset 6 with size of 50 web pages for Training Set, 150 web pages for Dataset 1, 119 web pages for Dataset 2, 50 web pages for Dataset 3, 51 web pages for Dataset 4, 80 web pages for Dataset 5, and finally 100 web pages for Dataset 6. The data distribution for each of the datasets
Conclusions
In this study we propose a non-visual wrapper WISH which is able to extract data records from structured web pages and align them using a regular template. Our results show that our wrapper is able to obtain results as well as and in most cases better than the current state of the art visual wrappers such as ViNT and DEPTA. Our approach uses a set of filtering methods based on the DOM Tree structure of data records and a more accurate algorithm to calculate the space occupied by the data
Jer Lang Hong received the B.Sc. Computer Science degree from University of Nottingham in 2005 and is currently a Ph.D. student in Monash University. His research interests include Information Extraction, and Automated Data Extraction. He is also an author and co author of several ACM/IEEE conference papers.
References (58)
- et al.
DEByE – Date extraction by example
Data Knowl. Eng.
(2002) - et al.
Building intelligent web applications using lightweight wrappers
Data Knowl. Eng.
(2001) - et al.
Olera: semisupervised Web-data extraction with visual support
IEEE Intell. Syst.
(2004) - et al.
Generating finite-state transducers for semi-structured data extraction from the Web
Inf. Syst.
(1998) - et al.
Automatic hidden-web table interpretation, conceptualization, and semantic annotation
Data Knowl. Eng.
(2009) - et al.
Integration of association rules and ontologies for semantic query expansion
Data Knowl. Eng.
(2007) - et al.
Grammars have exceptions
Inf. Syst.
(1998) - et al.
A brief survey of web data extraction tools
SIGMOD Rec.
(2002) - Andrew Hogue, David Karger, Thresher: automating the unwrapping of semantic content from the World Wide Web, in:...
- et al.
The longest common subsequence problem revisited
Algorithmica
(1987)
Extracting structured data from web pages
Mining data records in web pages
A survey of web information extraction systems
IEEE T. Knowl. Data Eng.
IEPAD: information extraction based on pattern discovery
Algorithms on strings, trees, and sequences: computer science and computational biology
The tree-to-tree editing problem
Int. J. Pattern Recognit. Artif. Intell.
A guided tour to approximate string matching
ACM Comput. Surv.
Mining templates from search result records of search engines
Fully automatic wrapper generation for search engines
Cited by (42)
Box clustering segmentation: A new method for vision-based web page preprocessing
2017, Information Processing and ManagementCitation Excerpt :Various methods belong to this group of algorithms. They can be divided into partially or fully supervised (Bing, Guo, Lam, Niu, & Wang, 2014; Bu, Zhang, Xia, & Wang, 2014; Fragkou, 2013) and unsupervised (Burget, 2007; Cai et al., 2003; Hong, Siew, & Egerton, 2010; Liu, Meng, & Meng, 2010; Shi, Liu, Shen, Yuan, & Huang, 2015) methods. In this work, considering the goals and applications we formulated in the introduction, we consider only the unsupervised page segmentation methods.
Matching parse thickets for open domain question answering
2017, Data and Knowledge EngineeringCitation Excerpt :This representation allows merging syntactic information with discourse information into the unified framework, so that the matching of parse thicket delivers exhaustive information on the similarity between two texts, such as a question and an answer. Parse thickets are designed so that exhaustive information of the hybrid nature, syntactic and discourse, is combined in a form ready for matching to be applied in search applications [39]. The operation of generalization to learn from parse trees for a pair of sentences turned out to be essential for search re-ranking.
AutoRM: An effective approach for automatic Web data record mining
2015, Knowledge-Based SystemsCitation Excerpt :Record-level approaches try to mine the similar data records from Web pages. Typical record-level approaches include OMINI [6], MDR [27], DeLa [37], NET [29], ViNTs [41], MSE [42], DEPTA [40], ODE [35], WISH [21], ViDE [28], G-STM [22], SILA [31], CTVS [36], etc. The approaches for automatic Web data record mining are just record-level approaches.
NLP-based faceted search: Experience in the development of a science and technology search engine
2014, Expert Systems with ApplicationsA novel classification model for cotton yarn quality based on trained neural network using genetic algorithm
2013, Knowledge-Based SystemsTEX: An efficient and effective unsupervised web information extractor
2013, Knowledge-Based SystemsCitation Excerpt :Although they can be handcrafted [15,24,4,42,51,50,20], the costs involved motivated many researchers to work on proposals to learn them automatically. These proposals are either supervised, i.e., they require the user to provide a number of information samples to be extracted [11,44,58,26,32,8,22,9,14,18,30,5,40,21,59], or unsupervised, i.e., they extract as much prospective information as they can and the user then gathers the relevant information from the results [62,12,16,2,28,25,60,39,46,64,67,38,59,57]. Since typical web documents are growing in complexity, a number of authors are also working on techniques whose goal is to identify the region within a web document where relevant information is most likely to be contained [37,7,61,63,27,34,65,66,52,45,35,6].
Jer Lang Hong received the B.Sc. Computer Science degree from University of Nottingham in 2005 and is currently a Ph.D. student in Monash University. His research interests include Information Extraction, and Automated Data Extraction. He is also an author and co author of several ACM/IEEE conference papers.
Eu-Gene Siew received the B.Comm. degree from Adelaide University, Australia in 1995 and obtained his M.BusSys. and Ph.D. from Monash University, Australia in 1999, and 2005 respectively. He worked in a teaching institution in Penang before joining the School of Information Technology in 2007. He has published and presented numerous papers in journals, and international conferences and also written a book chapter. He is a member of the Malaysian Small Medium Enterprise and Malaysian National Computer Confederation. Additionally, he is a member of the Centre for Research in Intelligent Systems (CRIS) and the Centre for Decision Support and Enterprise Systems Research (CDSESR). His main research interest lies in the area of the application of intelligent techniques. His other research interests also include data mining and information extraction from web pages.
Simon Egerton completed his Ph.D. at the University of Essex in the United Kingdom. He returned to academia after a number of years in industry and is currently a lecturer in the School of Information Technology at Monash University Malaysia where he is head of the Intelligent Systems research group. His research interests cover a broad range of intelligent systems, but he likes to specialise in robotics and smart device ecologies.