demonstration

OXPath: little language, little memory, great value

Authors:
Andrew Jon Sellers

University of Oxford, Oxford, United Kingdom

University of Oxford, Oxford, United Kingdom
View Profile

,
Tim Furche

University of Oxford, Oxford, United Kingdom

University of Oxford, Oxford, United Kingdom
View Profile

,
Georg Gottlob

University of Oxford, Oxford, United Kingdom

University of Oxford, Oxford, United Kingdom
View Profile

,
Giovanni Grasso

University of Oxford, Oxford, United Kingdom

University of Oxford, Oxford, United Kingdom
View Profile

,
Christian Schallhart

University of Oxford, Oxford, United Kingdom

University of Oxford, Oxford, United Kingdom
View Profile

WWW '11: Proceedings of the 20th international conference companion on World wide webMarch 2011Pages 261–264https://doi.org/10.1145/1963192.1963304

Published:28 March 2011Publication History

WWW '11: Proceedings of the 20th international conference companion on World wide web

Pages 261–264

ABSTRACT

Data about everything is readily available on the web-but often only accessible through elaborate user interactions. For automated decision support, extracting that data is essential, but infeasible with existing heavy-weight data extraction systems. In this demonstration, we present OXPath, a novel approach to web extraction, with a system that supports informed job selection and integrates information from several different web sites. By carefully extending XPath, OXPath exploits its familiarity and provides a light-weight interface, which is easy to use and embed. We highlight how OXPath guarantees optimal page buffering, storing only a constant number of pages for non-recursive queries.

References

A. Alba, V. Bhagwan, and T. Grandison. Accessing the deep web: when good ideas go bad. In OOPSLA, 2008. Google ScholarDigital Library
R. Baumgartner, S. Flesca, and G. Gottlob. Visual web information extraction with Lixto. In VLDB, 2001. Google ScholarDigital Library
J. P. Bigham, A. C. Cavender, R. S. Kaminsky, C. M. Prince, and T. S. Robison. Transcendence: enabling a personal view of the deep web. In IUI, 2008. Google ScholarDigital Library
M. Bolin, M. Webber, P. Rha, T. Wilson, and R. C. Miller. Automation and customization of rendered web pages. In UIST, 2005. Google ScholarDigital Library
M. Marx. Conditional XPath. ACM Trans. Database Syst., 30(4), 2005. Google ScholarDigital Library
OXPath. http://www.diadem-project.info/oxpath.Google Scholar
W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB, 2007. Google ScholarDigital Library

Index Terms

OXPath: little language, little memory, great value
1. Information systems
  1. World Wide Web
    1. Web applications
    2. Web services

Recommendations

Effective web scraping with OXPath
WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide Web

Even in the third decade of the Web, scraping web sites remains a challenging task: Most scraping programs are still developed as ad-hoc solutions using a complex stack of languages and tools. Where comprehensive extraction solutions exist, they are ...
Read More
Taking the OXPath down the deep web
EDBT/ICDT '11: Proceedings of the 14th International Conference on Extending Database Technology

Although deep web analysis has been studied extensively, there is no succinct formalism to describe user interactions with AJAX-enabled web applications.

Toward this end, we introduce OXPath as a superset of XPath 1.0. Beyond XPath, OXPath is able (1) ...
Read More
OXPath: A language for scalable data extraction, automation, and crawling on the deep web

The evolution of the web has outpaced itself: A growing wealth of information and increasingly sophisticated interfaces necessitate automated processing, yet existing automation and data extraction technologies have been overwhelmed by this very growth. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '11: Proceedings of the 20th international conference companion on World wide web
March 2011
552 pages
ISBN:9781450306379
DOI:10.1145/1963192
General Chairs:
S. Sadagopan
IIIT-Bangalore, India
,
Krithi Ramamritham
IIT-Bombay, India
,
Arun Kumar
IBM Research, India
,
M. P. Ravindra
Infosys E & R, India
,
Program Chairs:
Elisa Bertino
Purdue University, USA
,
Ravi Kumar
Yahoo! Research, USA
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 March 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
ajax
web automation
web extraction
xpath
Qualifiers
- demonstration
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 168
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

OXPath: little language, little memory, great value

WWW '11: Proceedings of the 20th international conference companion on World wide web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Effective web scraping with OXPath

Taking the OXPath down the deep web

OXPath: A language for scalable data extraction, automation, and crawling on the deep web