skip to main content
10.1145/2490257.2490297acmotherconferencesArticle/Chapter ViewAbstractPublication PagesbciConference Proceedingsconference-collections
research-article

DEiXTo: a web data extraction suite

Published: 19 September 2013 Publication History

Abstract

Web data extraction (or web scraping) is the process of collecting unstructured or semi-structured information from the World Wide Web, at different levels of automation. It is an important, valuable and practical approach towards web reuse while at the same time can serve the transition of the web to the semantic web, by providing the structured data required by the latter. In this paper we present DEiXTo, a web data extraction suite that provides an arsenal of features aiming at designing and deploying well-engineered extraction tasks. We focus on presenting the core pattern matching algorithm and the overall architecture, which allows programming of custom-made solutions for hard extraction tasks. DEiXTo consists of both freeware and open source components.

References

[1]
Ferrara, E., de Meo, P., Fiumara, G., and Baumgartner, R. 2012. Web Data Extraction, Applications and Techniques: A Survey. arXiv:1207.0246v2 {cs.IR} 7 Mar 2013.
[2]
Kushmerick, N., Weld, D. S. and Doorenbos, R. B. 1997. Wrapper Induction for Information Extraction. In Proceedings of the 15th International Joint Conference on Artificial Intelligence, Morgan Kaufmann, 729--737.
[3]
Flesca, S., Manco, G., Masciari, E., Rende, E. and Tagarelli, A. 2004. Web wrapper induction: a brief survey. AI Communications, 17, IOS Press, 57--61.
[4]
Ikeda, D., Yamada, Y. and Hirokawa, S. 2003. Expressive Power of Tree and String Based Wrappers. In Kambhampati, S. and Knoblock, C. A. (Eds.), online Proceedings of the IJCAI'03 workshop on Information Integration on the Web (IIWeb-03), 21--26, http://www.isi.edu/info-agents/workshops/ijcai03/proceedings.htm
[5]
Baumgartner, R., Gottlob, G. and Herzog, M. 2003. Visual Programming of Web Data Aggregation Applications, In Kambhampati, S. and Knoblock, A. (Eds.), on-line proceedings of the IJCAI'03 workshop on Information Integration on the Web (IIWeb-03), 137--142, http://www.isi.edu/infoagents/workshops/ijcai03/proceedings.htm.
[6]
Laender, A. H. F., Ribeiro-Neto, B. A. and da Silva, A. S. 2001. DEByE - Data Extraction by Example. Data and Knowledge Engineering, 40(2), 121--154.
[7]
DEiXTo Home Page: http://deixto.com/
[8]
Selenium Home Page: http://docs.seleniumhq.org/

Cited By

View all
  • (2022)On extracting data from tables that are encoded using HTMLKnowledge-Based Systems10.1016/j.knosys.2019.105157190:COnline publication date: 22-Apr-2022
  • (2022)OntoDynS: expediting personalization and diversification in semantic search by facilitating cognitive human interaction through ontology bagging and dynamic ontology alignmentJournal of Ambient Intelligence and Humanized Computing10.1007/s12652-021-03624-914:7(8667-8691)Online publication date: 28-Jan-2022
  • (2020)Design of a Sentiment Lexicon for the Greek Food and Beverage SectorOperational Research in Agriculture and Tourism10.1007/978-3-030-38766-2_3(49-66)Online publication date: 6-May-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
BCI '13: Proceedings of the 6th Balkan Conference in Informatics
September 2013
293 pages
ISBN:9781450318518
DOI:10.1145/2490257
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

  • University of Macedonia
  • Aristotle University of Thessaloniki
  • The University of Sheffield: The University of Sheffield
  • Greek Com Soc: Greek Computer Society
  • SEERC: South-East European Research Centre
  • Alexander TEI of Thessaloniki

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 September 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. pattern matching
  2. web data extraction
  3. web scraping

Qualifiers

  • Research-article

Funding Sources

  • Operational Program Education and Lifelong Learning (International Hellenic University, Thessaloniki, Greece)

Conference

BCI '13
Sponsor:
  • The University of Sheffield
  • Greek Com Soc
  • SEERC
BCI '13: Balkan Conference in Informatics
September 19 - 21, 2013
Thessaloniki, Greece

Acceptance Rates

BCI '13 Paper Acceptance Rate 41 of 103 submissions, 40%;
Overall Acceptance Rate 97 of 250 submissions, 39%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)On extracting data from tables that are encoded using HTMLKnowledge-Based Systems10.1016/j.knosys.2019.105157190:COnline publication date: 22-Apr-2022
  • (2022)OntoDynS: expediting personalization and diversification in semantic search by facilitating cognitive human interaction through ontology bagging and dynamic ontology alignmentJournal of Ambient Intelligence and Humanized Computing10.1007/s12652-021-03624-914:7(8667-8691)Online publication date: 28-Jan-2022
  • (2020)Design of a Sentiment Lexicon for the Greek Food and Beverage SectorOperational Research in Agriculture and Tourism10.1007/978-3-030-38766-2_3(49-66)Online publication date: 6-May-2020
  • (2018)Learning patterns for discovering domain-oriented opinion wordsKnowledge and Information Systems10.1007/s10115-017-1072-y55:1(45-77)Online publication date: 1-Apr-2018
  • (2015)Developing a framework that utilizes intelligent agents to extract multi-lingual web news2015 2nd World Symposium on Web Applications and Networking (WSWAN)10.1109/WSWAN.2015.7210335(1-5)Online publication date: Mar-2015
  • (2015)Crawling images with web browser support2015 IEEE 13th International Scientific Conference on Informatics10.1109/Informatics.2015.7377848(286-289)Online publication date: Nov-2015
  • (2015)SSIE: An Automatic Data Extractor for Sports Management in Athletics Modality2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing10.1109/CIT/IUCC/DASC/PICOM.2015.23(144-151)Online publication date: Oct-2015
  • (2014)DuckyProceedings of the 18th International Database Engineering & Applications Symposium10.1145/2628194.2628244(342-347)Online publication date: 7-Jul-2014
  • (2014)Enhanced OAI-PMH services for metadata sharing in heterogeneous environmentsLibrary Review10.1108/LR-05-2014-005163:6/7(465-489)Online publication date: 26-Aug-2014
  • (2014)Web data extraction, applications and techniquesKnowledge-Based Systems10.1016/j.knosys.2014.07.00770:C(301-323)Online publication date: 1-Nov-2014
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media