Skip to main content

Mining Product Features from the Web: A Self-supervised Approach

  • Conference paper

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 140))

Abstract

Mining information available on the Web to automatically build knowledge bases is a field of interest for academic research as well as industry. Existing wrapper induction approaches require manual annotations or aim to build domain-specific extractors that usually do not cope with template changes. In this paper, we tackle the problem of large scale product feature extraction from e-commerce web sites. We propose a novel self-supervised approach that relies on visual clues and a small knowledge base to automatically annotate product features. Our approach does not need an initial set of labeled pages to learn extraction rules and is robust to web site changes. Experimental results with product data extraction from 10 major French e-commerce web sites (roughly 1 000 web pages) show that the proposed method is promising. Moreover, experiments have shown that our method can handle web site template changes without human intervention.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Gibson, D., Punera, K., Tomkins, A.: The volume and evolution of web page templates. In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, pp. 830–839. ACM (2005)

    Google Scholar 

  2. Kushmerick, N.: Wrapper induction for information extraction. PhD thesis, University of Washington (1997)

    Google Scholar 

  3. Chang, C., Lui, S.: Iepad: information extraction based on pattern discovery. In: Proceedings of the 10th International Conference on World Wide Web, pp. 681–688. ACM (2001)

    Google Scholar 

  4. Liu, B., Grossman, R.: Mining data records in web pages. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–606. ACM (2003)

    Google Scholar 

  5. Wang, J., Lochovsky, F.: Wrapper induction based on nested pattern discovery. World Wide Web Internet and Web Information Systems, 1–29 (2002)

    Google Scholar 

  6. Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. In: Proceedings of the 14th International Conference on World Wide Web, pp. 66–75. ACM (2005)

    Google Scholar 

  7. Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 337–348. ACM (2003)

    Google Scholar 

  8. Chang, C.H., Kuo, S.C.: Annotation Free Information Extraction from Semi-structured Documents. Engineering, 1–26 (2007)

    Google Scholar 

  9. Crescenzi, V., Mecca, G., Merialdo, P., et al.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 109–118 (2001)

    Google Scholar 

  10. Rosenfeld, B., Feldman, R.: Using Corpus Statistics on Entities to Improve Semi-supervised Relation Extraction from the Web. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, pp. 600–607 (2007)

    Google Scholar 

  11. Senellart, P., Mittal, A., Muschick, D., Gilleron, R., Tommasi, M.: Automatic wrapper induction from hidden-web sources with domain knowledge. In: Proceedings of the 10th ACM Workshop on Web Information and Data Management, pp. 9–16. ACM (2008)

    Google Scholar 

  12. Wong, T., Lam, W.: Adapting web information extraction knowledge via mining site-invariant and site-dependent features. ACM Transactions on Internet Technology (TOIT) 7 (2007)

    Google Scholar 

  13. Zhao, S., Betz, J.: Corroborate and learn facts from the web. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 995–1003. ACM (2007)

    Google Scholar 

  14. Wong, Y., Widdows, D., Lokovic, T., Nigam, K.: Scalable attribute-value extraction from semi-structured text. In: Proceedings of the 2009 IEEE International Conference on Data Mining Workshops, pp. 302–307. IEEE (2009)

    Google Scholar 

  15. Wong, T., Lam, W., Wong, T.: An unsupervised framework for extracting and normalizing product attributes from multiple web sites. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 35–42. ACM (2008)

    Google Scholar 

  16. Wu, B., Cheng, X., Wang, Y., Guo, Y., Song, L.: Simultaneous product attribute name and value extraction from web pages. In: Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, vol. 7, pp. 295–298. IEEE Computer Society (2009)

    Google Scholar 

  17. Muslea, I., Minton, S., Knoblock, C.: Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems 4, 93–114 (2001)

    Article  Google Scholar 

  18. Kosala, R., Van den Bussche, J., Bruynooghe, M., Blockeel, H.: Information Extraction in Structured Documents Using Tree Automata Induction. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) PKDD 2002. LNCS (LNAI), vol. 2431, p. 299. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  19. Gottron, T.: Clustering Template Based Web Documents. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 40–51. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ferrez, R., de Groc, C., Couto, J. (2013). Mining Product Features from the Web: A Self-supervised Approach. In: Cordeiro, J., Krempels, KH. (eds) Web Information Systems and Technologies. WEBIST 2012. Lecture Notes in Business Information Processing, vol 140. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36608-6_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-36608-6_19

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-36607-9

  • Online ISBN: 978-3-642-36608-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics