ABSTRACT
Data about everything is readily available on the web-but often only accessible through elaborate user interactions. For automated decision support, extracting that data is essential, but infeasible with existing heavy-weight data extraction systems. In this demonstration, we present OXPath, a novel approach to web extraction, with a system that supports informed job selection and integrates information from several different web sites. By carefully extending XPath, OXPath exploits its familiarity and provides a light-weight interface, which is easy to use and embed. We highlight how OXPath guarantees optimal page buffering, storing only a constant number of pages for non-recursive queries.
- A. Alba, V. Bhagwan, and T. Grandison. Accessing the deep web: when good ideas go bad. In OOPSLA, 2008. Google ScholarDigital Library
- R. Baumgartner, S. Flesca, and G. Gottlob. Visual web information extraction with Lixto. In VLDB, 2001. Google ScholarDigital Library
- J. P. Bigham, A. C. Cavender, R. S. Kaminsky, C. M. Prince, and T. S. Robison. Transcendence: enabling a personal view of the deep web. In IUI, 2008. Google ScholarDigital Library
- M. Bolin, M. Webber, P. Rha, T. Wilson, and R. C. Miller. Automation and customization of rendered web pages. In UIST, 2005. Google ScholarDigital Library
- M. Marx. Conditional XPath. ACM Trans. Database Syst., 30(4), 2005. Google ScholarDigital Library
- OXPath. http://www.diadem-project.info/oxpath.Google Scholar
- W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB, 2007. Google ScholarDigital Library
Index Terms
- OXPath: little language, little memory, great value
Recommendations
Effective web scraping with OXPath
WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide WebEven in the third decade of the Web, scraping web sites remains a challenging task: Most scraping programs are still developed as ad-hoc solutions using a complex stack of languages and tools. Where comprehensive extraction solutions exist, they are ...
Taking the OXPath down the deep web
EDBT/ICDT '11: Proceedings of the 14th International Conference on Extending Database TechnologyAlthough deep web analysis has been studied extensively, there is no succinct formalism to describe user interactions with AJAX-enabled web applications.
Toward this end, we introduce OXPath as a superset of XPath 1.0. Beyond XPath, OXPath is able (1) ...
OXPath: A language for scalable data extraction, automation, and crawling on the deep web
The evolution of the web has outpaced itself: A growing wealth of information and increasingly sophisticated interfaces necessitate automated processing, yet existing automation and data extraction technologies have been overwhelmed by this very growth. ...
Comments