Skip to main content

Robust Web Data Extraction Based on Unsupervised Visual Validation

  • Conference paper
  • First Online:
Intelligent Information and Database Systems (ACIIDS 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11431))

Included in the following conference series:

Abstract

Visual validation is the process of validating sets of extracted entities by means of visual information. The main advantage of visual validation is to make use of visual information for web information extraction without impacting on the robustness of extractors. In this paper, we show that unsupervised visual validation can be used to create robust web data extractors. More precisely, we evaluate the performance of visual validation on a corpus of visually heterogeneous documents. The selected extraction task consists in extracting the price, name, description, and SKU of unspecified products from unseen documents. Our corpus contains 1000 various products from 100 different sources, which we render public. Results also show that visual validation improves web data extraction even when the extractor is trained with visual features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Files can be downloaded at this link: https://drive.google.com/drive/folders/1GYU6ZgZOXsNq4–F7o8v3qjr47DcLvK3.

  2. 2.

    Tested classifiers for this task were: a Gaussian Naive Bayes classifier, a k-nearest neighbor classifier, a multi-class SVM classifier (one-versus-one), and the two selected classifiers.

References

  1. Apostolova, E., Pourashraf, P., Sack, J.: Digital leafleting: extracting structured data from multimedia online flyers. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 283–292 (2015)

    Google Scholar 

  2. Berners-Lee, T., Hendler, J., Lassila, O., et al.: The semantic web. Sci. Am. 284(5), 28–37 (2001)

    Article  Google Scholar 

  3. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  Google Scholar 

  4. Burget, R., Rudolfova, I.: Web page element classification based on visual features. In: First Asian Conference on Intelligent Information and Database Systems, ACIIDS 2009, pp. 67–72. IEEE (2009)

    Google Scholar 

  5. Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)

    Article  Google Scholar 

  6. Ferrara, E., Baumgartner, R.: Automatic wrapper adaptation by tree edit distance matching. In: Hatzilygeroudis, I., Prentzas, J. (eds.) Combinations of Intelligent Methods and Applications. SIST, vol. 8, pp. 41–54. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19618-8_3

    Chapter  Google Scholar 

  7. Ferrara, E., De Meo, P., Fiumara, G., Baumgartner, R.: Web data extraction, applications and techniques: a survey. Knowl.-Based Syst. 70, 301–323 (2014)

    Article  Google Scholar 

  8. Gatterbauer, W., Bohunsky, P., Herzog, M., Krüpl, B., Pollak, B.: Towards domain-independent information extraction from web tables. In: Proceedings of the 16th International Conference on World Wide Web, pp. 71–80. ACM (2007)

    Google Scholar 

  9. Gogar, T., Hubacek, O., Sedivy, J.: Deep neural networks for web page information extraction. In: Iliadis, L., Maglogiannis, I. (eds.) AIAI 2016. IAICT, vol. 475, pp. 154–163. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44944-9_14

    Chapter  Google Scholar 

  10. Grassi, M., Morbidoni, C., Nucci, M., Fonda, S., Ledda, G.: Pundit: semantically structured annotations for web contents and digital libraries. In: SDA, pp. 49–60 (2012)

    Google Scholar 

  11. Han, H., Noro, T., Tokuda, T.: An automatic web news article contents extraction system based on RSS feeds. J. Web Eng. 8(3), 268 (2009)

    Google Scholar 

  12. Kang, J., Choi, J.: Detecting informative web page blocks for efficient information extraction using visual block segmentation. In: International Symposium on Information Technology Convergence, ISITC 2007, pp. 306–310. IEEE (2007)

    Google Scholar 

  13. Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 441–450. ACM (2010)

    Google Scholar 

  14. Krüpl-Sypien, B., Fayzrakhmanov, R.R., Holzinger, W., Panzenböck, M., Baumgartner, R.: A versatile model for web page representation, information extraction and content re-packaging. In: Proceedings of the 11th ACM Symposium on Document Engineering, pp. 129–138. ACM (2011)

    Google Scholar 

  15. Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: Eighth IEEE International Conference on Data Mining, ICDM 2008, pp. 413–422. IEEE (2008)

    Google Scholar 

  16. Liu, L., Özsu, M.T.: Encyclopedia of Database Systems, vol. 6. Springer, New York (2009)

    Book  Google Scholar 

  17. Parameswaran, A., Dalvi, N., Garcia-Molina, H., Rastogi, R.: Optimal schemesfor robust web extraction. Proc. VLDB Conf. 4(11)(2011)

    Google Scholar 

  18. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12(Oct), 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  19. Potvin, B., Villemaire, R.: When different is wrong: visual unsupervised validation for web information extraction. In: Perner, P. (ed.) MLDM 2018. LNCS (LNAI), vol. 10935, pp. 132–146. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-96133-0_10

    Chapter  Google Scholar 

  20. Tang, J., Hong, M., Zhang, D.L., Li, J.: Information extraction: methodologies and applications. In: Emerging Technologies of Text Mining: Techniques and Applications, pp. 1–33. IGI Global (2008)

    Google Scholar 

  21. Wang, J., et al.: Can we learn a template-independent wrapper for news article extraction from a single training site? In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1345–1354. ACM (2009)

    Google Scholar 

  22. Wang, R.C., Cohen, W.W.: Language-independent set expansion of named entities using the web. In: ICDM, pp. 342–350. IEEE (2007)

    Google Scholar 

  23. Weninger, T., Palacios, R., Crescenzi, V., Gottron, T., Merialdo, P.: Web content extraction: a metaanalysis of its past and thoughts on its future. ACM SIGKDD Explor. Newsl. 17(2), 17–23 (2016)

    Article  Google Scholar 

  24. Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machinelearning Tools and Techniques. Morgan Kaufmann, Burlington (2016)

    Google Scholar 

Download references

Acknowledgments

The authors gratefully acknowledge the financial support of the Natural Sciences and Engineering Research Council of Canada (NSERC).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Benoit Potvin or Roger Villemaire .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Potvin, B., Villemaire, R. (2019). Robust Web Data Extraction Based on Unsupervised Visual Validation. In: Nguyen, N., Gaol, F., Hong, TP., Trawiński, B. (eds) Intelligent Information and Database Systems. ACIIDS 2019. Lecture Notes in Computer Science(), vol 11431. Springer, Cham. https://doi.org/10.1007/978-3-030-14799-0_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-14799-0_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-14798-3

  • Online ISBN: 978-3-030-14799-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics