skip to main content
research-article

Scraping Relevant Images from Web Pages without Download

Published:11 October 2023Publication History
Skip Abstract Section

Abstract

Automatically scraping relevant images from web pages is an error-prone and time-consuming task, leading experts to prefer manually preparing extraction patterns for a website. Existing web scraping tools are built on these patterns. However, this manual approach is laborious and requires specialized knowledge. Automatic extraction approaches, while a potential solution, require large training datasets and numerous features, including width, height, pixels, and file size, that can be difficult and time-consuming to obtain. To address these challenges, we propose a semi-automatic approach that does not require an expert, utilizes small training datasets, and has a low error rate while saving time and storage. Our approach involves clustering web pages from a website and suggesting several pages for a non-expert to annotate relevant images. The approach then uses these annotations to construct a learning model based on textual data from the HTML elements. In the experiments, we used a dataset of 635,015 images from 200 news websites, each containing 100 pages, with 22,632 relevant images. When comparing several machine learning methods for both automatic approaches and our proposed approach, the AdaBoost method yields the best performance results. When using automatic extraction approaches, the best f-Measure that can be achieved is 0.805 with a learning model constructed from a large training dataset consisting of 120 websites (12,000 web pages). In contrast, our approach achieved an average f-Measure of 0.958 for 200 websites with only six web pages annotated per website. This means that a non-expert only needs to examine 1,200 web pages to determine the relevant images for 200 websites. Our approach also saves time and storage space by not requiring the download of images and can be easily integrated into currently available web scraping tools, because it is based on textual data.

REFERENCES

  1. [1] Agun Hayri Volkan and Uzun Erdinç. 2023. An efficient regular expression inference approach for relevant image extraction. Appl. Soft Comput. 135 (2023), 110030. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Alarte Julian, Insa David, Silva Josep, and Tamarit Salvador. 2018. Main content extraction from heterogeneous webpages. In Web Information Systems Engineering (WISE’18), Hacid Hakim, Cellary Wojciech, Wang Hua, Paik Hye-Young, and Zhou Rui (Eds.). Springer International Publishing, Cham, 393407.Google ScholarGoogle Scholar
  3. [3] Aslam Naseer, Tahir Bilal, Shafiq Hafiz Muhammad, and Mehmood Muhammad Amir. 2019. Web-AM: An efficient boilerplate removal algorithm for web articles. In International Conference on Frontiers of Information Technology (FIT’19). IEEE, 2872875. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Bar-Yossef Ziv and Rajagopalan Sridhar. 2002. Template detection via data mining and its applications. In 11th International Conference on World Wide Web (WWW’02). Association for Computing Machinery, New York, NY, 580591. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Barbado Rodrigo, Araque Oscar, and Iglesias Carlos A.. 2019. A framework for fake review detection in online consumer electronics retailers. Inf. Process. Manag. 56, 4 (2019), 12341244. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Bhardwaj Aanshi and Mangat Veenu. 2014. An improvised algorithm for relevant content extraction from web pages. J. Emerg. Technol. Web Intell. 6, 2 (May 2014), 226230. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Bing Lidong, Wong Tak-Lam, and Lam Wai. 2016. Unsupervised extraction of popular product attributes from e-commerce web sites by considering customer reviews. ACM Trans. Internet Technol. 16, 2, Article 12 (Apr. 2016), 17 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Ester Martin, Kriegel Hans-Peter, Sander Jörg, and Xu Xiaowei. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In 2nd International Conference on Knowledge Discovery and Data Mining (KDD’96). AAAI Press, 226231.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Estuka Fadwa and Miller James. 2019. A pure visual approach for automatically extracting and aligning structured web data. ACM Trans. Internet Technol. 19, 4, Article 51 (Nov. 2019), 26 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Fazal Nancy, Nguyen Khue, and Fränti Pasi. 2019. Efficiency of web crawling for geotagged image retrieval. Webology 16 (2019), 1639. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Ferrara Emilio and Baumgartner Robert. 2011. Automatic wrapper adaptation by tree edit distance matching. In Combinations of Intelligent Methods and Applications. Springer, UK, 41–54.Google ScholarGoogle Scholar
  12. [12] Figueiredo Leandro Neiva Lopes, de Assis Guilherme Tavares, and Ferreira Anderson A.. 2017. DERIN: A data extraction method based on rendering information and n-gram. Inf. Process. Manag. 53, 5 (2017), 11201138. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Friedl Jeffrey E. F. and Oram Andy. 2002. Mastering Regular Expressions (2nd ed.). O’Reilly & Associates, Inc.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Gali Najlah, Tabarcea Andrei, and Fränti Pasi. 2015. Extracting representative image from web page. In 11th International Conference on Web Information Systems and Technologies (WEBIST’15). INSTICC, SciTePress, Portugal, 411419. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Haider Waqar and Yesilada Yeliz. 2022. Classification of layout vs. relational tables on the web: Machine learning with rendered pages. ACM Trans. Web 17, 1, Article 1 (Dec. 2022), 23 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Han Wook-Shin, Kwak Wooseong, Yu Hwanjo, Lee Jeong-Hoon, and Kim Min-Soo. 2014. Leveraging spatial join for robust tuple extraction from web pages. Inf. Sci. 261 (2014), 132148. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Helfman Jonathan I. and Hollan James D.. 2000. Image representations for accessing and organizing web information. In Internet Imaging II, Beretta Giordano B. and Schettini Raimondo (Eds.), Vol. 4311. International Society for Optics and Photonics, SPIE, San Jose, CA, 91101. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Hsu Chun-Nan and Dung Ming-Tzung. 1998. Generating finite-state transducers for semi-structured data extraction from the web. Inf. Syst. 23, 8 (1998), 521538. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Islam Imranul. 2021. Representative Image Extraction from Web Page. Master’s Thesis. University of Eastern Finland, Faculty of Science and Forestry, Joensuu School of Computing.Google ScholarGoogle Scholar
  20. [20] Jiménez Patricia, Roldán Juan C., and Corchuelo Rafael. 2021. A clustering approach to extract data from HTML tables. Inf. Process. Manag. 58, 6 (2021), 102683. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Kim Yeonjung, Park Jeahyun, Kim Taehwan, and Choi Joongmin. 2007. Web information extraction by HTML tree edit distance matching. In International Conference on Convergence Information Technology (ICCIT’07). IEEE, 24552460. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Kohlschütter Christian, Fankhauser Peter, and Nejdl Wolfgang. 2010. Boilerplate detection using shallow text features. In 3rd ACM International Conference on Web Search and Data Mining (WSDM’10). Association for Computing Machinery, New York, NY, 441450. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Levenshtein Vladimir Iosifovich. 1966. Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10, 8 (1966), 707710.Google ScholarGoogle Scholar
  24. [24] Liu Bing. 2011. Web Data Mining Exploring Hyperlinks, Contents, and Usage Data. Springer, Berlin. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Liu Qingtang, Shao Mingbo, Wu Linjing, Zhao Gang, Fan Guilin, and Li Jun. 2017. Main content extraction from web pages based on node characteristics. J. Comput. Sci. Eng. 11 (06 2017), 3948. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Lopez-Garcia Pedro, Masegosa Antonio D., Osaba Eneko, Onieva Enrique, and Perallos Asier. 2019. Ensemble classification for imbalanced data based on feature space partitioning and hybrid metaheuristics. Appl. Intell. 49, 8 (Aug. 2019), 28072822. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Mahrishi Mehul, Morwal Sudha, Dahiya Nidhi, and Nankani Hanisha. 2021. A framework for index point detection using effective title extraction from video thumbnails. Int. J. Syst. Assur. Eng. Manag. (June 2021), 16. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Manica Edimar, Dorneles Carina Friedrich, and Galante Renata. 2019. Combining URL and HTML features for entity discovery in the web. ACM Trans. Web 13, 4, Article 20 (Dec. 2019), 27 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Muslea Ion, Minton Steve, and Knoblock Craig. 1999. A hierarchical approach to wrapper induction. In 3rd Annual Conference on Autonomous Agents (AGENTS’99). Association for Computing Machinery, New York, NY, 190197. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Reis D. C., Golgher P. B., Silva A. S., and Laender A. F.. 2004. Automatic web news extraction using tree edit distance. In 13th International Conference on World Wide Web (WWW’04). Association for Computing Machinery, New York, NY, 502511. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Sahuguet Arnaud and Azavant Fabien. 1999. Building light-weight wrappers for legacy web data-sources using W4F. In 25th International Conference on Very Large Data Bases (VLDB’99). Morgan Kaufmann Publishers Inc., San Francisco, CA, 738741.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Schäfer Roland. 2017. Accurate and efficient general-purpose boilerplate detection for crawled web corpora. Lang. Resour. Eval. 51 (2017), 873889.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Schapire Robert E. and Freund Yoav. 2012. Boosting: Foundations and Algorithms. The MIT Press, London, England.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Soto Andrés, Mora Héctor, and Riascos Jaime A.. 2022. Web generator: An open-source software for synthetic web-based user interface dataset generation. SoftwareX 17 (2022), 100985. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Uzun Erdinç. 2020. A regular expression generator based on CSS selectors for efficient extraction from HTML pages. Turk. J. Electric. Eng. Comput. Sci. 28 (2020), 33893401. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Uzun Erdinç, Agun Hayri Volkan, and Yerlikaya Tarik. 2013. A hybrid approach for extracting informative content from web pages. Inf. Process. Manag. 49, 4 (2013), 928944.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Uzun Erdinç, Buluş Halil Nusret, Doruk Alpay, and Özhan Erkan. 2017. Evaluation of HAP, AngleSharp and HtmlDocument in web content extraction. In International Scientific Conference (UNITECH’17). UNITECH, 275278.Google ScholarGoogle Scholar
  38. [38] Uzun Erdinç, Güner Edip Serdar, Kılıçaslan Yılmaz, Yerlikaya Tarık, and Agun Hayri Volkan. 2014. An effective and efficient web content extractor for optimizing the crawling process. Softw.: Pract. Exper. 44, 10 (2014), 11811199. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Uzun Erdinç, Yerlikaya Tarık, and Kırat Oğuz. 2018. Comparison of Python libraries used for web data extraction. J. Technic. Univ. - Sofia Plovdiv branch, Bulgar. 24 (2018), 8792.Google ScholarGoogle Scholar
  40. [40] Uzun Erdinç and Özhan Erkan. 2018. Examining the impact of feature selection on classification of user reviews in web pages. In International Conference on Artificial Intelligence and Data Processing (IDAP’18). IEEE, 18. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Uzun Erdinç, Özhan Erkan, Agun Hayri Volkan, Yerlikaya Tarik, and Buluş Halil Nusret. 2020. Automatically discovering relevant images from web pages. IEEE Access 8 (2020), 208910208921. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Uçar Erdem, Uzun Erdinç, and Tüfekci Pınar. 2017. A novel algorithm for extracting the user reviews from web pages. J. Inf. Sci. 43, 5 (2017), 696712. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Vishwakarma Dinesh Kumar, Varshney Deepika, and Yadav Ashima. 2019. Detection and veracity analysis of fake news via scrapping and authenticating the web search. Cognit. Syst. Res. 58 (2019), 217229. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Vogels Thijs, Ganea Octavian-Eugen, and Eickhoff Carsten. 2018. Web2Text: Deep structured boilerplate removal. In Advances in Information Retrieval, Pasi Gabriella, Piwowarski Benjamin, Azzopardi Leif, and Hanbury Allan (Eds.). Springer International Publishing, Cham, 167179.Google ScholarGoogle Scholar
  45. [45] Vyas Krishna and Frasincar Flavius. 2020. Determining the most representative image on a web page. Inf. Sci. 512 (2020), 12341248. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Wu Yu-Chieh. 2016. Language independent web news extraction system based on text detection framework. Inf. Sci. 342 (2016), 132149. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Xue Junxiao, Wang Yabo, Tian Yichen, Li Yafei, Shi Lei, and Wei Lin. 2021. Detecting fake news by exploring the consistency of multimodal data. Inf. Process. Manag. 58, 5 (2021), 102610. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Zhai Yanhong and Liu Bing. 2006. Structured data extraction from the web based on partial tree alignment. IEEE Trans. Knowl. Data Eng. 18, 12 (2006), 16141628. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Zhang Shuo and Balog Krisztian. 2020. Web table extraction, retrieval, and augmentation: A survey. ACM Trans. Intell. Syst. Technol. 11, 2, Article 13 (Jan. 2020), 35 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Zhao Huiliang, Liu Zhenghong, Yao Xuemei, and Yang Qin. 2021. A machine learning-based sentiment analysis of online product reviews with a novel term weighting and feature selection approach. Inf. Process. Manag. 58, 5 (2021), 102656. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Zhao Ziyuan, Zhu Huiying, Xue Zehao, Liu Zhao, Tian Jing, Chua Matthew Chin Heng, and Liu Maofu. 2019. An image-text consistency driven multimodal sentiment analysis approach for social media. Inf. Process. Manag. 56, 6 (2019), 102097. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Özhan Erkan and Uzun Erdinç. 2018. Performance evaluation of classification methods in layout prediction of web pages. In International Conference on Artificial Intelligence and Data Processing (IDAP’18). IEEE, 17. DOI:Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Scraping Relevant Images from Web Pages without Download

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on the Web
          ACM Transactions on the Web  Volume 18, Issue 1
          February 2024
          448 pages
          ISSN:1559-1131
          EISSN:1559-114X
          DOI:10.1145/3613532
          • Editor:
          • Ryen White
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 11 October 2023
          • Online AM: 19 August 2023
          • Accepted: 2 August 2023
          • Revised: 1 June 2023
          • Received: 11 August 2022
          Published in tweb Volume 18, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)305
          • Downloads (Last 6 weeks)18

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text