research-article

Scraping Relevant Images from Web Pages without Download

Author:
Erdinç Uzun

Tekirdağ Namık Kemal University, Turkey

Tekirdağ Namık Kemal University, Turkey

0000-0003-4351-2244
View Profile

Authors Info & Claims

ACM Transactions on the Web Volume 18 Issue 1Article No.: 1pp 1–27https://doi.org/10.1145/3616849

Published:11 October 2023Publication History

ACM Transactions on the Web

Abstract

Automatically scraping relevant images from web pages is an error-prone and time-consuming task, leading experts to prefer manually preparing extraction patterns for a website. Existing web scraping tools are built on these patterns. However, this manual approach is laborious and requires specialized knowledge. Automatic extraction approaches, while a potential solution, require large training datasets and numerous features, including width, height, pixels, and file size, that can be difficult and time-consuming to obtain. To address these challenges, we propose a semi-automatic approach that does not require an expert, utilizes small training datasets, and has a low error rate while saving time and storage. Our approach involves clustering web pages from a website and suggesting several pages for a non-expert to annotate relevant images. The approach then uses these annotations to construct a learning model based on textual data from the HTML elements. In the experiments, we used a dataset of 635,015 images from 200 news websites, each containing 100 pages, with 22,632 relevant images. When comparing several machine learning methods for both automatic approaches and our proposed approach, the AdaBoost method yields the best performance results. When using automatic extraction approaches, the best f-Measure that can be achieved is 0.805 with a learning model constructed from a large training dataset consisting of 120 websites (12,000 web pages). In contrast, our approach achieved an average f-Measure of 0.958 for 200 websites with only six web pages annotated per website. This means that a non-expert only needs to examine 1,200 web pages to determine the relevant images for 200 websites. Our approach also saves time and storage space by not requiring the download of images and can be easily integrated into currently available web scraping tools, because it is based on textual data.

REFERENCES

[1] Agun Hayri Volkan and Uzun Erdinç. 2023. An efficient regular expression inference approach for relevant image extraction. Appl. Soft Comput. 135 (2023), 110030. DOI:Google ScholarDigital Library
[2] Alarte Julian, Insa David, Silva Josep, and Tamarit Salvador. 2018. Main content extraction from heterogeneous webpages. In Web Information Systems Engineering (WISE’18), Hacid Hakim, Cellary Wojciech, Wang Hua, Paik Hye-Young, and Zhou Rui (Eds.). Springer International Publishing, Cham, 393–407.Google Scholar
[3] Aslam Naseer, Tahir Bilal, Shafiq Hafiz Muhammad, and Mehmood Muhammad Amir. 2019. Web-AM: An efficient boilerplate removal algorithm for web articles. In International Conference on Frontiers of Information Technology (FIT’19). IEEE, 287–2875. DOI:Google ScholarCross Ref
[4] Bar-Yossef Ziv and Rajagopalan Sridhar. 2002. Template detection via data mining and its applications. In 11th International Conference on World Wide Web (WWW’02). Association for Computing Machinery, New York, NY, 580–591. DOI:Google ScholarDigital Library
[5] Barbado Rodrigo, Araque Oscar, and Iglesias Carlos A.. 2019. A framework for fake review detection in online consumer electronics retailers. Inf. Process. Manag. 56, 4 (2019), 1234–1244. DOI:Google ScholarDigital Library
[6] Bhardwaj Aanshi and Mangat Veenu. 2014. An improvised algorithm for relevant content extraction from web pages. J. Emerg. Technol. Web Intell. 6, 2 (May 2014), 226–230. DOI:Google ScholarCross Ref
[7] Bing Lidong, Wong Tak-Lam, and Lam Wai. 2016. Unsupervised extraction of popular product attributes from e-commerce web sites by considering customer reviews. ACM Trans. Internet Technol. 16, 2, Article 12 (Apr. 2016), 17 pages. DOI:Google ScholarDigital Library
[8] Ester Martin, Kriegel Hans-Peter, Sander Jörg, and Xu Xiaowei. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In 2nd International Conference on Knowledge Discovery and Data Mining (KDD’96). AAAI Press, 226–231.Google ScholarDigital Library
[9] Estuka Fadwa and Miller James. 2019. A pure visual approach for automatically extracting and aligning structured web data. ACM Trans. Internet Technol. 19, 4, Article 51 (Nov. 2019), 26 pages. DOI:Google ScholarDigital Library
[10] Fazal Nancy, Nguyen Khue, and Fränti Pasi. 2019. Efficiency of web crawling for geotagged image retrieval. Webology 16 (2019), 16–39. DOI:Google ScholarCross Ref
[11] Ferrara Emilio and Baumgartner Robert. 2011. Automatic wrapper adaptation by tree edit distance matching. In Combinations of Intelligent Methods and Applications. Springer, UK, 41–54.Google Scholar
[12] Figueiredo Leandro Neiva Lopes, de Assis Guilherme Tavares, and Ferreira Anderson A.. 2017. DERIN: A data extraction method based on rendering information and n-gram. Inf. Process. Manag. 53, 5 (2017), 1120–1138. DOI:Google ScholarCross Ref
[13] Friedl Jeffrey E. F. and Oram Andy. 2002. Mastering Regular Expressions (2nd ed.). O’Reilly & Associates, Inc.Google ScholarDigital Library
[14] Gali Najlah, Tabarcea Andrei, and Fränti Pasi. 2015. Extracting representative image from web page. In 11th International Conference on Web Information Systems and Technologies (WEBIST’15). INSTICC, SciTePress, Portugal, 411–419. DOI:Google ScholarCross Ref
[15] Haider Waqar and Yesilada Yeliz. 2022. Classification of layout vs. relational tables on the web: Machine learning with rendered pages. ACM Trans. Web 17, 1, Article 1 (Dec. 2022), 23 pages. DOI:Google ScholarDigital Library
[16] Han Wook-Shin, Kwak Wooseong, Yu Hwanjo, Lee Jeong-Hoon, and Kim Min-Soo. 2014. Leveraging spatial join for robust tuple extraction from web pages. Inf. Sci. 261 (2014), 132–148. DOI:Google ScholarDigital Library
[17] Helfman Jonathan I. and Hollan James D.. 2000. Image representations for accessing and organizing web information. In Internet Imaging II, Beretta Giordano B. and Schettini Raimondo (Eds.), Vol. 4311. International Society for Optics and Photonics, SPIE, San Jose, CA, 91–101. DOI:Google ScholarCross Ref
[18] Hsu Chun-Nan and Dung Ming-Tzung. 1998. Generating finite-state transducers for semi-structured data extraction from the web. Inf. Syst. 23, 8 (1998), 521–538. DOI:Google ScholarCross Ref
[19] Islam Imranul. 2021. Representative Image Extraction from Web Page. Master’s Thesis. University of Eastern Finland, Faculty of Science and Forestry, Joensuu School of Computing.Google Scholar
[20] Jiménez Patricia, Roldán Juan C., and Corchuelo Rafael. 2021. A clustering approach to extract data from HTML tables. Inf. Process. Manag. 58, 6 (2021), 102683. DOI:Google ScholarDigital Library
[21] Kim Yeonjung, Park Jeahyun, Kim Taehwan, and Choi Joongmin. 2007. Web information extraction by HTML tree edit distance matching. In International Conference on Convergence Information Technology (ICCIT’07). IEEE, 2455–2460. DOI:Google ScholarDigital Library
[22] Kohlschütter Christian, Fankhauser Peter, and Nejdl Wolfgang. 2010. Boilerplate detection using shallow text features. In 3rd ACM International Conference on Web Search and Data Mining (WSDM’10). Association for Computing Machinery, New York, NY, 441–450. DOI:Google ScholarDigital Library
[23] Levenshtein Vladimir Iosifovich. 1966. Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10, 8 (1966), 707–710.Google Scholar
[24] Liu Bing. 2011. Web Data Mining Exploring Hyperlinks, Contents, and Usage Data. Springer, Berlin. DOI:Google ScholarCross Ref
[25] Liu Qingtang, Shao Mingbo, Wu Linjing, Zhao Gang, Fan Guilin, and Li Jun. 2017. Main content extraction from web pages based on node characteristics. J. Comput. Sci. Eng. 11 (06 2017), 39–48. DOI:Google ScholarCross Ref
[26] Lopez-Garcia Pedro, Masegosa Antonio D., Osaba Eneko, Onieva Enrique, and Perallos Asier. 2019. Ensemble classification for imbalanced data based on feature space partitioning and hybrid metaheuristics. Appl. Intell. 49, 8 (Aug. 2019), 2807–2822. DOI:Google ScholarDigital Library
[27] Mahrishi Mehul, Morwal Sudha, Dahiya Nidhi, and Nankani Hanisha. 2021. A framework for index point detection using effective title extraction from video thumbnails. Int. J. Syst. Assur. Eng. Manag. (June 2021), 1–6. DOI:Google ScholarCross Ref
[28] Manica Edimar, Dorneles Carina Friedrich, and Galante Renata. 2019. Combining URL and HTML features for entity discovery in the web. ACM Trans. Web 13, 4, Article 20 (Dec. 2019), 27 pages. DOI:Google ScholarDigital Library
[29] Muslea Ion, Minton Steve, and Knoblock Craig. 1999. A hierarchical approach to wrapper induction. In 3rd Annual Conference on Autonomous Agents (AGENTS’99). Association for Computing Machinery, New York, NY, 190–197. DOI:Google ScholarDigital Library
[30] Reis D. C., Golgher P. B., Silva A. S., and Laender A. F.. 2004. Automatic web news extraction using tree edit distance. In 13th International Conference on World Wide Web (WWW’04). Association for Computing Machinery, New York, NY, 502–511. DOI:Google ScholarDigital Library
[31] Sahuguet Arnaud and Azavant Fabien. 1999. Building light-weight wrappers for legacy web data-sources using W4F. In 25th International Conference on Very Large Data Bases (VLDB’99). Morgan Kaufmann Publishers Inc., San Francisco, CA, 738–741.Google ScholarDigital Library
[32] Schäfer Roland. 2017. Accurate and efficient general-purpose boilerplate detection for crawled web corpora. Lang. Resour. Eval. 51 (2017), 873–889.Google ScholarDigital Library
[33] Schapire Robert E. and Freund Yoav. 2012. Boosting: Foundations and Algorithms. The MIT Press, London, England.Google ScholarCross Ref
[34] Soto Andrés, Mora Héctor, and Riascos Jaime A.. 2022. Web generator: An open-source software for synthetic web-based user interface dataset generation. SoftwareX 17 (2022), 100985. DOI:Google ScholarCross Ref
[35] Uzun Erdinç. 2020. A regular expression generator based on CSS selectors for efficient extraction from HTML pages. Turk. J. Electric. Eng. Comput. Sci. 28 (2020), 3389–3401. DOI:Google ScholarCross Ref
[36] Uzun Erdinç, Agun Hayri Volkan, and Yerlikaya Tarik. 2013. A hybrid approach for extracting informative content from web pages. Inf. Process. Manag. 49, 4 (2013), 928–944.Google ScholarDigital Library
[37] Uzun Erdinç, Buluş Halil Nusret, Doruk Alpay, and Özhan Erkan. 2017. Evaluation of HAP, AngleSharp and HtmlDocument in web content extraction. In International Scientific Conference (UNITECH’17). UNITECH, 275–278.Google Scholar
[38] Uzun Erdinç, Güner Edip Serdar, Kılıçaslan Yılmaz, Yerlikaya Tarık, and Agun Hayri Volkan. 2014. An effective and efficient web content extractor for optimizing the crawling process. Softw.: Pract. Exper. 44, 10 (2014), 1181–1199. DOI:Google ScholarDigital Library
[39] Uzun Erdinç, Yerlikaya Tarık, and Kırat Oğuz. 2018. Comparison of Python libraries used for web data extraction. J. Technic. Univ. - Sofia Plovdiv branch, Bulgar. 24 (2018), 87–92.Google Scholar
[40] Uzun Erdinç and Özhan Erkan. 2018. Examining the impact of feature selection on classification of user reviews in web pages. In International Conference on Artificial Intelligence and Data Processing (IDAP’18). IEEE, 1–8. DOI:Google ScholarCross Ref
[41] Uzun Erdinç, Özhan Erkan, Agun Hayri Volkan, Yerlikaya Tarik, and Buluş Halil Nusret. 2020. Automatically discovering relevant images from web pages. IEEE Access 8 (2020), 208910–208921. DOI:Google ScholarCross Ref
[42] Uçar Erdem, Uzun Erdinç, and Tüfekci Pınar. 2017. A novel algorithm for extracting the user reviews from web pages. J. Inf. Sci. 43, 5 (2017), 696–712. DOI:Google ScholarDigital Library
[43] Vishwakarma Dinesh Kumar, Varshney Deepika, and Yadav Ashima. 2019. Detection and veracity analysis of fake news via scrapping and authenticating the web search. Cognit. Syst. Res. 58 (2019), 217–229. DOI:Google ScholarDigital Library
[44] Vogels Thijs, Ganea Octavian-Eugen, and Eickhoff Carsten. 2018. Web2Text: Deep structured boilerplate removal. In Advances in Information Retrieval, Pasi Gabriella, Piwowarski Benjamin, Azzopardi Leif, and Hanbury Allan (Eds.). Springer International Publishing, Cham, 167–179.Google Scholar
[45] Vyas Krishna and Frasincar Flavius. 2020. Determining the most representative image on a web page. Inf. Sci. 512 (2020), 1234–1248. DOI:Google ScholarDigital Library
[46] Wu Yu-Chieh. 2016. Language independent web news extraction system based on text detection framework. Inf. Sci. 342 (2016), 132–149. DOI:Google ScholarDigital Library
[47] Xue Junxiao, Wang Yabo, Tian Yichen, Li Yafei, Shi Lei, and Wei Lin. 2021. Detecting fake news by exploring the consistency of multimodal data. Inf. Process. Manag. 58, 5 (2021), 102610. DOI:Google ScholarDigital Library
[48] Zhai Yanhong and Liu Bing. 2006. Structured data extraction from the web based on partial tree alignment. IEEE Trans. Knowl. Data Eng. 18, 12 (2006), 1614–1628. DOI:Google ScholarDigital Library
[49] Zhang Shuo and Balog Krisztian. 2020. Web table extraction, retrieval, and augmentation: A survey. ACM Trans. Intell. Syst. Technol. 11, 2, Article 13 (Jan. 2020), 35 pages. DOI:Google ScholarDigital Library
[50] Zhao Huiliang, Liu Zhenghong, Yao Xuemei, and Yang Qin. 2021. A machine learning-based sentiment analysis of online product reviews with a novel term weighting and feature selection approach. Inf. Process. Manag. 58, 5 (2021), 102656. DOI:Google ScholarDigital Library
[51] Zhao Ziyuan, Zhu Huiying, Xue Zehao, Liu Zhao, Tian Jing, Chua Matthew Chin Heng, and Liu Maofu. 2019. An image-text consistency driven multimodal sentiment analysis approach for social media. Inf. Process. Manag. 56, 6 (2019), 102097. DOI:Google ScholarDigital Library
[52] Özhan Erkan and Uzun Erdinç. 2018. Performance evaluation of classification methods in layout prediction of web pages. In International Conference on Artificial Intelligence and Data Processing (IDAP’18). IEEE, 1–7. DOI:Google ScholarCross Ref

Index Terms

Scraping Relevant Images from Web Pages without Download
1. Information systems
  1. World Wide Web
    1. Web mining
      1. Data extraction and integration
      2. Site wrapping
2. Software and its engineering
  1. Software creation and management
    1. Designing software
      1. Software design engineering

Recommendations

Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web Engineering

Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
Read More
ViDE: A Vision-Based Approach for Deep Web Data Extraction

Deep Web contents are accessed by queries submitted to Web databases and the returned data records are enwrapped in dynamically generated Web pages (they will be called deep Web pages in this paper). Extracting structured data from deep Web pages is a ...
Read More
STAVIES: A System for Information Extraction from Unknown Web Data Sources through Automatic Web Wrapper Generation Using Clustering Techniques

A fully automated wrapper for information extraction from Web pages is presented. The motivation behind such systems lies in the emerging need for going beyond the concept of "human browsing. The World Wide Web is today the main "all kind of information ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on the Web Volume 18, Issue 1
February 2024
448 pages
ISSN:1559-1131
EISSN:1559-114X
DOI:10.1145/3613532
Editor:
Ryen White
Microsoft Research, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 October 2023
- Online AM: 19 August 2023
- Accepted: 2 August 2023
- Revised: 1 June 2023
- Received: 11 August 2022
Published in tweb Volume 18, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Web mining
relevant images
web data extraction
crawler design and evaluation
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 305
  Total Downloads
- Downloads (Last 12 months)305
- Downloads (Last 6 weeks)18
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

Scraping Relevant Images from Web Pages without Download

ACM Transactions on the Web

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Current challenges in web crawling

ViDE: A Vision-Based Approach for Deep Web Data Extraction

STAVIES: A System for Information Extraction from Unknown Web Data Sources through Automatic Web Wrapper Generation Using Clustering Techniques

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Full Text

Caption

Scraping Relevant Images from Web Pages without Download

ACM Transactions on the Web

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Current challenges in web crawling

ViDE: A Vision-Based Approach for Deep Web Data Extraction

STAVIES: A System for Information Extraction from Unknown Web Data Sources through Automatic Web Wrapper Generation Using Clustering Techniques

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

Share this Publication link

Share on Social Media