Mining Product Features from the Web: A Self-supervised Approach

Ferrez, Rémi; de Groc, Clément; Couto, Javier

doi:10.1007/978-3-642-36608-6_19

Mining Product Features from the Web: A Self-supervised Approach

Rémi Ferrez⁸,
Clément de Groc^8,9 &
Javier Couto^8,10

Conference paper

2461 Accesses
1 Citations

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 140))

Abstract

Mining information available on the Web to automatically build knowledge bases is a field of interest for academic research as well as industry. Existing wrapper induction approaches require manual annotations or aim to build domain-specific extractors that usually do not cope with template changes. In this paper, we tackle the problem of large scale product feature extraction from e-commerce web sites. We propose a novel self-supervised approach that relies on visual clues and a small knowledge base to automatically annotate product features. Our approach does not need an initial set of labeled pages to learn extraction rules and is robust to web site changes. Experimental results with product data extraction from 10 major French e-commerce web sites (roughly 1 000 web pages) show that the proposed method is promising. Moreover, experiments have shown that our method can handle web site template changes without human intervention.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Gibson, D., Punera, K., Tomkins, A.: The volume and evolution of web page templates. In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, pp. 830–839. ACM (2005)
Google Scholar
Kushmerick, N.: Wrapper induction for information extraction. PhD thesis, University of Washington (1997)
Google Scholar
Chang, C., Lui, S.: Iepad: information extraction based on pattern discovery. In: Proceedings of the 10th International Conference on World Wide Web, pp. 681–688. ACM (2001)
Google Scholar
Liu, B., Grossman, R.: Mining data records in web pages. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–606. ACM (2003)
Google Scholar
Wang, J., Lochovsky, F.: Wrapper induction based on nested pattern discovery. World Wide Web Internet and Web Information Systems, 1–29 (2002)
Google Scholar
Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. In: Proceedings of the 14th International Conference on World Wide Web, pp. 66–75. ACM (2005)
Google Scholar
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 337–348. ACM (2003)
Google Scholar
Chang, C.H., Kuo, S.C.: Annotation Free Information Extraction from Semi-structured Documents. Engineering, 1–26 (2007)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P., et al.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 109–118 (2001)
Google Scholar
Rosenfeld, B., Feldman, R.: Using Corpus Statistics on Entities to Improve Semi-supervised Relation Extraction from the Web. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, pp. 600–607 (2007)
Google Scholar
Senellart, P., Mittal, A., Muschick, D., Gilleron, R., Tommasi, M.: Automatic wrapper induction from hidden-web sources with domain knowledge. In: Proceedings of the 10th ACM Workshop on Web Information and Data Management, pp. 9–16. ACM (2008)
Google Scholar
Wong, T., Lam, W.: Adapting web information extraction knowledge via mining site-invariant and site-dependent features. ACM Transactions on Internet Technology (TOIT) 7 (2007)
Google Scholar
Zhao, S., Betz, J.: Corroborate and learn facts from the web. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 995–1003. ACM (2007)
Google Scholar
Wong, Y., Widdows, D., Lokovic, T., Nigam, K.: Scalable attribute-value extraction from semi-structured text. In: Proceedings of the 2009 IEEE International Conference on Data Mining Workshops, pp. 302–307. IEEE (2009)
Google Scholar
Wong, T., Lam, W., Wong, T.: An unsupervised framework for extracting and normalizing product attributes from multiple web sites. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 35–42. ACM (2008)
Google Scholar
Wu, B., Cheng, X., Wang, Y., Guo, Y., Song, L.: Simultaneous product attribute name and value extraction from web pages. In: Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, vol. 7, pp. 295–298. IEEE Computer Society (2009)
Google Scholar
Muslea, I., Minton, S., Knoblock, C.: Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems 4, 93–114 (2001)
Article Google Scholar
Kosala, R., Van den Bussche, J., Bruynooghe, M., Blockeel, H.: Information Extraction in Structured Documents Using Tree Automata Induction. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) PKDD 2002. LNCS (LNAI), vol. 2431, p. 299. Springer, Heidelberg (2002)
Chapter Google Scholar
Gottron, T.: Clustering Template Based Web Documents. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 40–51. Springer, Heidelberg (2008)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Syllabs, Paris, France
Rémi Ferrez, Clément de Groc & Javier Couto
Univ. Paris Sud & LIMSI-CNRS, Orsay, France
Clément de Groc
MoDyCo, UMR 7114, CNRS-Université de Paris Ouest Nanterre La Défense, France
Javier Couto

Authors

Rémi Ferrez
View author publications
You can also search for this author in PubMed Google Scholar
Clément de Groc
View author publications
You can also search for this author in PubMed Google Scholar
Javier Couto
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute for Systems and Technologies of Information, Control and Communication (INSTICC), and Instituto Politécnico de Setúbal (IPS), Setúbal, Portugal
José Cordeiro
RWTH Aachen University, Ahornstr. 55, 52074, Aachen, Germany
Karl-Heinz Krempels

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ferrez, R., de Groc, C., Couto, J. (2013). Mining Product Features from the Web: A Self-supervised Approach. In: Cordeiro, J., Krempels, KH. (eds) Web Information Systems and Technologies. WEBIST 2012. Lecture Notes in Business Information Processing, vol 140. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36608-6_19

Download citation

DOI: https://doi.org/10.1007/978-3-642-36608-6_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36607-9
Online ISBN: 978-3-642-36608-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics