Skip to main content

RecipeCrawler: Collecting Recipe Data from WWW Incrementally

  • Conference paper
Advances in Web-Age Information Management (WAIM 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4016))

Included in the following conference series:

Abstract

WWW has posed itself as the largest data repository ever available in the history of humankind. Utilizing the Internet as a data source seems to be natural and many efforts have been made. In this paper we focus on establishing a robust system to collect structured recipe data from the Web incrementally, which, as we believe, is a critical step towards practical, continuous, reliable web data extraction systems and therefore utilizing WWW as data sources for various database applications. The reasons for advocating such an incremental approach are two-fold: (1) it is impractical to crawl all the recipe pages from relevant web sites as the Web is highly dynamic; (2) it is almost impossible to induce a general wrapper for future extraction from the initial batch of recipe web pages. In this paper, we describe such a system called RecipeCrawler which targets at incrementally collecting recipe data from WWW. General issues in establishing an incremental data extraction system are considered and techniques are applied to recipe data collection from the Web. Our RecipeCrawler is actually used as the backend of a fully-fledged multimedia recipe database system being developed jointly by City University of Hong Kong and Renmin University of China.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: Proceedings of the 22th ACM SIGMOD International Conference on Management of Data, pp. 337–348 (2003)

    Google Scholar 

  2. Chang, C.H., Lui, S.C.: Iepad: information extraction based on pattern discovery. In: Proceedings of the 10th International World Wide Web Conference, pp. 681–688 (2001)

    Google Scholar 

  3. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of 27th International Conference on Very Large Data Bases, pp. 109–118 (2001)

    Google Scholar 

  4. Crescenzi, V., Mecca, G., Merialdo, P.: Wrapping-oriented classification of web pages. In: Proceedings of the 17th ACM Symposium on Applied Computing (SAC), pp. 1108–1112 (2002)

    Google Scholar 

  5. Grumbach, S., Mecca, G.: In search of the lost schema. In: ICDT 1999, pp. 314–331 (1999)

    Google Scholar 

  6. Kushmerick, N.: Wrapper verification. World Wide Web 3(2), 79–94 (2000)

    Article  MATH  Google Scholar 

  7. Liu, B., Grossman, R.L., Zhai, Y.: Mining data records in web pages. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–606 (2003)

    Google Scholar 

  8. Meng, X., Hu, D., Li, C.: Schema-guided wrapper maintenance for web-data extraction. In: The 5th ACM CIKM International Workshop on Web Information and Data Management, pp. 1–8 (2003)

    Google Scholar 

  9. Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.H.F.: Automatic web news extraction using tree edit distance. In: Proceedings of the 13th international conference on World Wide Web, pp. 502–511 (2004)

    Google Scholar 

  10. Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: Proceedings of the 12th International World Wide Web Conference, pp. 187–196 (2003)

    Google Scholar 

  11. Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proceedings of the 14th international conference on World Wide Web, pp. 76–85 (2005)

    Google Scholar 

  12. Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.T.: Fully automatic wrapper generation for search engines. In: Proceedings of the 14th international conference on World Wide Web, pp. 66–75 (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Li, Y., Meng, X., Wang, L., Li, Q. (2006). RecipeCrawler: Collecting Recipe Data from WWW Incrementally. In: Yu, J.X., Kitsuregawa, M., Leong, H.V. (eds) Advances in Web-Age Information Management. WAIM 2006. Lecture Notes in Computer Science, vol 4016. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11775300_23

Download citation

  • DOI: https://doi.org/10.1007/11775300_23

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-35225-9

  • Online ISBN: 978-3-540-35226-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics