Elsevier

Information Systems

Volume 38, Issue 2, April 2013, Pages 183-197
Information Systems

Web-based closed-domain data extraction on online advertisements

https://doi.org/10.1016/j.is.2012.07.006Get rights and content

Abstract

Taking advantage of the popularity of the web, online marketplaces such as Ebay (.com), advertisements (ads for short) websites such as Craigslist(.org), and commercial websites such as Carmax(.com) (allow users to) post ads on a variety of products and services. Instead of browsing through numerous websites to locate ads of interest, web users would benefit from the existence of a single, fully integrated database (DB) with ads in multiple domains, such as Cars-for-Sale and Job-Postings, populated from various online sources so that ads of interest could be retrieved at a centralized site. Since existing ads websites impose their own structures and formats for storing and accessing ads, generating a uniform, integrated ads repository is not a trivial task. The challenges include (i) identifying ads domains, (ii) dealing with the diversity in structures of ads in various ads domains, and (iii) analyzing data with different meanings in each ads domain. To handle these problems, we introduce ADEx, a tool that relies on various machine learning approaches to automate the process of extracting (un-/semi-/fully- structured) data from online ads to create ads records archived in an underlying DB through domain classification, keyword tagging, and identification of valid attribute values. Experimental results generated using a dataset of 18,000 online ads originated from Craigslist, Ebay, and KSL(.com) show that ADEx is superior in performance compared with existing text classification, keyword labeling, and data extraction approaches. Further evaluations verify that ADEx either outperforms or performs at least as good as current state-of-the-art information extractors in mapping data from unstructured or (semi-)structured sources into DB records.

Highlights

► We have developed a tool, ADEx, which automatically extracts data from online ads. ► ADEx applies ML approaches to identify/populate ads of various domains into a DB. ► ADEx classifies ad domains, tags keywords, and determines valid attribute values. ► ADEx is superior in performance compared with existing data extraction approaches.

Introduction

The web is a perfect publication forum for advertisements (ads for short), since ads websites (allow sellers to) post ads for potential buyers worldwide who can freely access archived and newly created ads anytime and anywhere, which cannot be provided by any traditional publication media. According to a report from eMarketer(.com), online advertising surpassed newspaper marketing in 2010 and the margin is widened in 2011, which indicates that online ads are popular and proliferating. Even though these days (online) information access has gone through evolutionary changes, most of the tools employed for accessing ads information still rely on an underlying database (DB) to maintain ads records. We recognize that existing ads information providers employ their own ads structures and formats for information processing. As a result, web users are forced to access ads archived at individual websites using a variety of searching tools provided by each website to look for ads of interest. A unified framework that integrates online ads available from various sources into a single underlying DB should facilitate the process of querying, question answering, and performing various data mining tasks on ads data. Creating an underlying DB from multiple sources, however, is a non-trivial task due to the diversity in the formats and contents of ads in various domains, a problem that we address and solve in this paper.

We introduce ADEx, a machine learning-based tool that automatically extracts and populates ads available at various sources into a unified DB. During the extraction process, ADEx employs effective supervised learning approaches to (i) categorize ads according to their domains, (ii) label non-stop keywords1 in classified ads based on their types such that the essential keywords are either (a) unique identifiers of a product/service P in an ad, (b) properties of P, or (c) qualitative values associated with P, which facilitates the process of identifying valid attribute values in each ad, and (iii) populate the tagged keywords as attribute values in an underlying DB. To populate various ads regardless of their structures/formats and ads domains to which they belong, ADEx analyzes, filters, and extracts (ir)relevant data from online ads using easy-to-implement algorithms to generate a single, unified source of information on ads data.

ADEx advances the current data extraction techniques. Unlike most of the existing information extraction approaches, it extracts data from un-/semi-/fully- structured ads data sources without altering its design. ADEx generalizes the data extraction process by labeling keywords in ads based on their types, as opposed to relying on domain-specific vocabularies or ontologies to identify keywords associated with DB attributes which are only applicable to the respective domains. Conducted empirical studies (see Section 4) have verified that ADEx is highly effective in automating the process of populating a DB with online ads from multiple sources.

ADEx is unique, since it (i) provides a unified tagging framework using its own type definitions on ads, (ii) develops an elegant set of empirically verified features to distinguish essential from useless data in ads, and (iii) introduces an optimal and effective approach which combines the tagging and extraction mechanisms (based on support vector machines and decision trees, respectively) to accurately populate the underlying DB. Moreover, ADEx either outperforms or is comparable with the current state-of-the-art data extraction tools.

The remaining of this paper is organized as follows. In Section 2, we discuss previous work on data extraction, a task performed by ADEx. In Section 3, we detail the design of ADEx. In Section 4, we present the empirical study conducted for verifying the effectiveness of ADEx in populating ads and compare its performance with other state-of-the-art data extraction approaches. In Section 5, we give a conclusion and discuss future work.

Section snippets

Related work

In this section, we discuss recently proposed methodologies for extracting unstructured or (semi-)structured data from online data sources that are closely related to the design of ADEx.

A number of supervised learning approaches [14], [15], [22] have been developed for extracting data from online sources. The machine learning approach in [14] extracts labeled attributes from web form interfaces. It matches a form element, which is an identified value type in our case, with its corresponding

ADEx

In this section, we present the overall process of ADEx, as shown in Fig. 1. We describe the three major, automated, consecutive tasks performed by ADEx during the process of extracting online ads data to populate a DB: (i) classifying online ads into their respective domains (as detailed in Section 3.1), (ii) tagging keywords in classified ads according to their types (as presented in Section 3.2), and (iii) extracting the tagged keywords of different types in ads and populating them as

Experimental results

In this section, we assess the overall performance of ADEx. In 4.1 The dataset, 4.2 Evaluation measures, we introduce the dataset and metrics employed for performance evaluation, respectively. In Section 4.3, we determine the ideal number of keywords, i.e., the size of the vocabulary, for capturing the content of ads in their respective domains for classification. Hereafter, we evaluate the effectiveness of each major task of ADEx, which include classifying ads (in Section 4.4), tagging

Conclusions

With the rise in popularity on online advertising, more web users turn to online sources to locate advertisements (ads for short) of interest. Since these web sources are independently operated and structured using different ads formats on various ads domains tailored for specific information processing, web users are required to access ads archived at different sites by employing a wide variety of searching tools individually. The existence of a unified database (DB), which integrates ads in

References (22)

  • B. Allison, An improved hierarchical Bayesian model of language for document classification, in: Proceedings of COLING,...
  • H. Chieu, H. Ng, Named entity recognition with a maximum entropy approach, in: Proceedings of Conference on Natural...
  • W. Cohen, Fast and effective rule induction, in: Proceedings of ICML, 1995, pp....
  • E. Cortez, A. da Silva, M. Goncalves, E. de Moura, ONDUX: on-demand unsupervised learning for information extraction,...
  • N. Dalvi et al.

    Automatic wrappers for large scale web extraction

    VLDB Endowment

    (2011)
  • M. Hall, E. Frank, Combining Naive Bayes and decision tables, in: Proceedings of Florida Artificial Intelligence...
  • R. Khare, Y. An, An empirical study on using hidden Markov model for search interface segmentation, in: Proceedings of...
  • Y. Liu, Y. Zheng, One-against-all multi-class SVM classification using reliability measures, in: Proceedings of IJCNN,...
  • C. Manning et al.

    Introduction to Information Retrieval

    (2008)
  • C. Manning et al.

    Foundations of Statistical Natural Language Processing

    (2003)
  • G. Miao, J. Tatemura, W. Hsiung, A. Sawires, L. Moser, Extracting data records from the web using tag path clustering,...
  • Cited by (10)

    • Leveraging spatial join for robust tuple extraction from web pages

      2014, Information Sciences
      Citation Excerpt :

      We will verify our claim through extensive experiments in Section 7. Automatic extraction methods extract information from web pages without any user interaction [9,14,17,27,32–35,39–41]. Thus, their extraction quality is lower than that of the interactive semi-automatic methods.

    • Big Data Bot with a Special Reference to Bioinformatics

      2023, Computers, Materials and Continua
    • WEIDJ: Development of a new algorithm for semi-structured web data extraction

      2021, Telkomnika (Telecommunication Computing Electronics and Control)
    View all citing articles on Scopus
    View full text