Web-based closed-domain data extraction on online advertisements
Highlights
► We have developed a tool, ADEx, which automatically extracts data from online ads. ► ADEx applies ML approaches to identify/populate ads of various domains into a DB. ► ADEx classifies ad domains, tags keywords, and determines valid attribute values. ► ADEx is superior in performance compared with existing data extraction approaches.
Introduction
The web is a perfect publication forum for advertisements (ads for short), since ads websites (allow sellers to) post ads for potential buyers worldwide who can freely access archived and newly created ads anytime and anywhere, which cannot be provided by any traditional publication media. According to a report from eMarketer(.com), online advertising surpassed newspaper marketing in 2010 and the margin is widened in 2011, which indicates that online ads are popular and proliferating. Even though these days (online) information access has gone through evolutionary changes, most of the tools employed for accessing ads information still rely on an underlying database (DB) to maintain ads records. We recognize that existing ads information providers employ their own ads structures and formats for information processing. As a result, web users are forced to access ads archived at individual websites using a variety of searching tools provided by each website to look for ads of interest. A unified framework that integrates online ads available from various sources into a single underlying DB should facilitate the process of querying, question answering, and performing various data mining tasks on ads data. Creating an underlying DB from multiple sources, however, is a non-trivial task due to the diversity in the formats and contents of ads in various domains, a problem that we address and solve in this paper.
We introduce ADEx, a machine learning-based tool that automatically extracts and populates ads available at various sources into a unified DB. During the extraction process, ADEx employs effective supervised learning approaches to (i) categorize ads according to their domains, (ii) label non-stop keywords1 in classified ads based on their types such that the essential keywords are either (a) unique identifiers of a product/service P in an ad, (b) properties of P, or (c) qualitative values associated with P, which facilitates the process of identifying valid attribute values in each ad, and (iii) populate the tagged keywords as attribute values in an underlying DB. To populate various ads regardless of their structures/formats and ads domains to which they belong, ADEx analyzes, filters, and extracts (ir)relevant data from online ads using easy-to-implement algorithms to generate a single, unified source of information on ads data.
ADEx advances the current data extraction techniques. Unlike most of the existing information extraction approaches, it extracts data from un-/semi-/fully- structured ads data sources without altering its design. ADEx generalizes the data extraction process by labeling keywords in ads based on their types, as opposed to relying on domain-specific vocabularies or ontologies to identify keywords associated with DB attributes which are only applicable to the respective domains. Conducted empirical studies (see Section 4) have verified that ADEx is highly effective in automating the process of populating a DB with online ads from multiple sources.
ADEx is unique, since it (i) provides a unified tagging framework using its own type definitions on ads, (ii) develops an elegant set of empirically verified features to distinguish essential from useless data in ads, and (iii) introduces an optimal and effective approach which combines the tagging and extraction mechanisms (based on support vector machines and decision trees, respectively) to accurately populate the underlying DB. Moreover, ADEx either outperforms or is comparable with the current state-of-the-art data extraction tools.
The remaining of this paper is organized as follows. In Section 2, we discuss previous work on data extraction, a task performed by ADEx. In Section 3, we detail the design of ADEx. In Section 4, we present the empirical study conducted for verifying the effectiveness of ADEx in populating ads and compare its performance with other state-of-the-art data extraction approaches. In Section 5, we give a conclusion and discuss future work.
Section snippets
Related work
In this section, we discuss recently proposed methodologies for extracting unstructured or (semi-)structured data from online data sources that are closely related to the design of ADEx.
A number of supervised learning approaches [14], [15], [22] have been developed for extracting data from online sources. The machine learning approach in [14] extracts labeled attributes from web form interfaces. It matches a form element, which is an identified value type in our case, with its corresponding
ADEx
In this section, we present the overall process of ADEx, as shown in Fig. 1. We describe the three major, automated, consecutive tasks performed by ADEx during the process of extracting online ads data to populate a DB: (i) classifying online ads into their respective domains (as detailed in Section 3.1), (ii) tagging keywords in classified ads according to their types (as presented in Section 3.2), and (iii) extracting the tagged keywords of different types in ads and populating them as
Experimental results
In this section, we assess the overall performance of ADEx. In 4.1 The dataset, 4.2 Evaluation measures, we introduce the dataset and metrics employed for performance evaluation, respectively. In Section 4.3, we determine the ideal number of keywords, i.e., the size of the vocabulary, for capturing the content of ads in their respective domains for classification. Hereafter, we evaluate the effectiveness of each major task of ADEx, which include classifying ads (in Section 4.4), tagging
Conclusions
With the rise in popularity on online advertising, more web users turn to online sources to locate advertisements (ads for short) of interest. Since these web sources are independently operated and structured using different ads formats on various ads domains tailored for specific information processing, web users are required to access ads archived at different sites by employing a wide variety of searching tools individually. The existence of a unified database (DB), which integrates ads in
References (22)
- B. Allison, An improved hierarchical Bayesian model of language for document classification, in: Proceedings of COLING,...
- H. Chieu, H. Ng, Named entity recognition with a maximum entropy approach, in: Proceedings of Conference on Natural...
- W. Cohen, Fast and effective rule induction, in: Proceedings of ICML, 1995, pp....
- E. Cortez, A. da Silva, M. Goncalves, E. de Moura, ONDUX: on-demand unsupervised learning for information extraction,...
- et al.
Automatic wrappers for large scale web extraction
VLDB Endowment
(2011) - M. Hall, E. Frank, Combining Naive Bayes and decision tables, in: Proceedings of Florida Artificial Intelligence...
- R. Khare, Y. An, An empirical study on using hidden Markov model for search interface segmentation, in: Proceedings of...
- Y. Liu, Y. Zheng, One-against-all multi-class SVM classification using reliability measures, in: Proceedings of IJCNN,...
- et al.
Introduction to Information Retrieval
(2008) - et al.
Foundations of Statistical Natural Language Processing
(2003)
Cited by (10)
Web Data Extraction Approach for Deep Web using WEIDJ
2019, Procedia Computer ScienceLeveraging spatial join for robust tuple extraction from web pages
2014, Information SciencesCitation Excerpt :We will verify our claim through extensive experiments in Section 7. Automatic extraction methods extract information from web pages without any user interaction [9,14,17,27,32–35,39–41]. Thus, their extraction quality is lower than that of the interactive semi-automatic methods.
Big Data Bot with a Special Reference to Bioinformatics
2023, Computers, Materials and ContinuaWEIDJ: Development of a new algorithm for semi-structured web data extraction
2021, Telkomnika (Telecommunication Computing Electronics and Control)An Intelligent System for Identifying Influential Words in Real-Estate Classifieds
2018, Journal of Intelligent Systems