DataRover: An Automated System for Extracting Product Information From Online Catalogs

Ahmed, Syed Toufeeq; Vadrevu, Srinivas; Davulcu, Hasan

doi:10.1007/3-540-33880-2_1

DataRover: An Automated System for Extracting Product Information From Online Catalogs

Syed Toufeeq Ahmed⁷,
Srinivas Vadrevu⁷ &
Hasan Davulcu⁷

Chapter

674 Accesses
3 Altmetric

Part of the book series: Studies in Computational Intelligence ((SCI,volume 23))

Abstract

The increasing number of e-commerce Web sites on the Web introduces numerous challenges in organizing and searching the product information across multiple Web sites. This problem is further exacerbated by various presentation templates that different Web sites use in presenting their product information, and different ways of product information they store in their catalogs. This paper describes the DataRover system, which can automatically crawl and extract all products from online catalogs. DataRover is based on pattern mining algorithms and domain specific heuristics which utilize the navigational and presentation regularities to identify taxonomy, list-of-product and single-product segments within an online catalog. Next, it uses the inferred patterns to extract data from all such data segments and to automatically transform an online catalog into a database of categorized products. We also provide experimental results to demonstrate the efficacy of the DataRover.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

J. Hammer, H. Garcia-Molina, S. Nestorov, R. Yerneni, M. M. Breunig, and V. Vas-salos. Template-based wrappers in the tsimmis system. In ACM SIGMOD, 1997.
Google Scholar
Gustavo O. Arocena and Alberto O. Mendelzon. Weboql: Restructuring documents, databases, and webs. In ICDE, pages 24–33, 1998.
Google Scholar
Nickolas Kushmerick, Daniel S. Weld, and Robert B. Doorenbos. Wrapper induction for information extraction. In Intl. Joint Conference on Artificial Intelligence (IJCAI), pages 729–737, 1997.
Google Scholar
Robert B. Doorenbos, Oren Etzioni, and Daniel S. Weld. A scalable comparison-shopping agent for the world-wide web. In W. Lewis Johnson and Barbara Hayes-Roth, editors, Proceedings of the First International Conference on Autonomous Agents (Agents’97), pages 39–48, Marina del Rey, CA, USA, 1997. ACM Press.
Google Scholar
Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In Intl. Conf. on Very Large Data Bases, 2001.
Google Scholar
A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In ACM SIGMOD, 2003.
Google Scholar
Hasan Davulcu, Srinivas Vadrevu, and Saravanakumar Nagarajan. Ontominer: Bootstrapping and populating ontologies from domain specific web sites. IEEE Intelligent Systems, 18(5), September 2003.
Google Scholar
Hasan Davulcu, Sukumar Koduri, and Saravanakumar Nagarajan. Datarover: A taxonomy based crawler for automated data extraction from data-intensive web sites. In Proceedings of the ACM International Workshop on Web Information and Data Management, pages 9–14, 2003.
Google Scholar
D. W. Embley, Y. Jiang, and Y.-K. Ng. Record-boundary discovery in Web documents. pages 467–478, 1999.
Google Scholar
Christina Yip Chung, Michael Gertz, and Neel Sundaresan. Reverse engineering for web data: From visual to semantic structures. In Intl. Conf. on Data Engineering, 2002.
Google Scholar
R. C. Berwick and S. Pilato. Learning syntax by automata induction. In Machine Learning 2, pages 9–38, 1987.
Google Scholar
E. Mark Gold. Complexity of automaton identification from given sets. In Information and Control, pages 37:302–320, 1978.
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Arizona State University, Tempe, AZ, 85287, USA
Syed Toufeeq Ahmed, Srinivas Vadrevu & Hasan Davulcu

Authors

Syed Toufeeq Ahmed
View author publications
You can also search for this author in PubMed Google Scholar
Srinivas Vadrevu
View author publications
You can also search for this author in PubMed Google Scholar
Hasan Davulcu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva, 84105, Israel
Mark Last
Institute of Computer Sciences, Technical University of Lodz, ul. Wolczanska 215, 93-1005, Lodz, Poland
Piotr S. Szczepaniak
Systems Research Institute, Polish Academy of Sciences, ul. Newelska 6, 01-447, Warsaw, Poland
Piotr S. Szczepaniak
Department of Software Engineering, ORT Braude College, POB. 78, 21982, Karmiel, Israel
Zeev Volkovich
Department of Computer Science and Engineering, University of South Florida, 4202 E. Fowler Ave., ENB 118, Tampa, FL, 33620, USA
Abraham Kandel

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ahmed, S.T., Vadrevu, S., Davulcu, H. (2006). DataRover: An Automated System for Extracting Product Information From Online Catalogs. In: Last, M., Szczepaniak, P.S., Volkovich, Z., Kandel, A. (eds) Advances in Web Intelligence and Data Mining. Studies in Computational Intelligence, vol 23. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-33880-2_1

Download citation

DOI: https://doi.org/10.1007/3-540-33880-2_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33879-6
Online ISBN: 978-3-540-33880-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics