Abstract
Automatically scraping relevant images from web pages is an error-prone and time-consuming task, leading experts to prefer manually preparing extraction patterns for a website. Existing web scraping tools are built on these patterns. However, this manual approach is laborious and requires specialized knowledge. Automatic extraction approaches, while a potential solution, require large training datasets and numerous features, including width, height, pixels, and file size, that can be difficult and time-consuming to obtain. To address these challenges, we propose a semi-automatic approach that does not require an expert, utilizes small training datasets, and has a low error rate while saving time and storage. Our approach involves clustering web pages from a website and suggesting several pages for a non-expert to annotate relevant images. The approach then uses these annotations to construct a learning model based on textual data from the HTML elements. In the experiments, we used a dataset of 635,015 images from 200 news websites, each containing 100 pages, with 22,632 relevant images. When comparing several machine learning methods for both automatic approaches and our proposed approach, the AdaBoost method yields the best performance results. When using automatic extraction approaches, the best f-Measure that can be achieved is 0.805 with a learning model constructed from a large training dataset consisting of 120 websites (12,000 web pages). In contrast, our approach achieved an average f-Measure of 0.958 for 200 websites with only six web pages annotated per website. This means that a non-expert only needs to examine 1,200 web pages to determine the relevant images for 200 websites. Our approach also saves time and storage space by not requiring the download of images and can be easily integrated into currently available web scraping tools, because it is based on textual data.
- [1] . 2023. An efficient regular expression inference approach for relevant image extraction. Appl. Soft Comput. 135 (2023), 110030.
DOI: Google ScholarDigital Library - [2] . 2018. Main content extraction from heterogeneous webpages. In Web Information Systems Engineering (WISE’18), , , , , and (Eds.). Springer International Publishing, Cham, 393–407.Google Scholar
- [3] . 2019. Web-AM: An efficient boilerplate removal algorithm for web articles. In International Conference on Frontiers of Information Technology (FIT’19). IEEE, 287–2875.
DOI: Google ScholarCross Ref - [4] . 2002. Template detection via data mining and its applications. In 11th International Conference on World Wide Web (WWW’02). Association for Computing Machinery, New York, NY, 580–591.
DOI: Google ScholarDigital Library - [5] . 2019. A framework for fake review detection in online consumer electronics retailers. Inf. Process. Manag. 56, 4 (2019), 1234–1244.
DOI: Google ScholarDigital Library - [6] . 2014. An improvised algorithm for relevant content extraction from web pages. J. Emerg. Technol. Web Intell. 6, 2 (
May 2014), 226–230.DOI: Google ScholarCross Ref - [7] . 2016. Unsupervised extraction of popular product attributes from e-commerce web sites by considering customer reviews. ACM Trans. Internet Technol. 16, 2, Article
12 (Apr. 2016), 17 pages.DOI: Google ScholarDigital Library - [8] . 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In 2nd International Conference on Knowledge Discovery and Data Mining (KDD’96). AAAI Press, 226–231.Google ScholarDigital Library
- [9] . 2019. A pure visual approach for automatically extracting and aligning structured web data. ACM Trans. Internet Technol. 19, 4, Article
51 (Nov. 2019), 26 pages.DOI: Google ScholarDigital Library - [10] . 2019. Efficiency of web crawling for geotagged image retrieval. Webology 16 (2019), 16–39.
DOI: Google ScholarCross Ref - [11] . 2011. Automatic wrapper adaptation by tree edit distance matching. In Combinations of Intelligent Methods and Applications. Springer, UK, 41–54.Google Scholar
- [12] . 2017. DERIN: A data extraction method based on rendering information and n-gram. Inf. Process. Manag. 53, 5 (2017), 1120–1138.
DOI: Google ScholarCross Ref - [13] . 2002. Mastering Regular Expressions (2nd ed.). O’Reilly & Associates, Inc.Google ScholarDigital Library
- [14] 2015. Extracting representative image from web page. In 11th International Conference on Web Information Systems and Technologies (WEBIST’15). INSTICC, SciTePress, Portugal, 411–419.
DOI: Google ScholarCross Ref - [15] . 2022. Classification of layout vs. relational tables on the web: Machine learning with rendered pages. ACM Trans. Web 17, 1, Article
1 (Dec. 2022), 23 pages.DOI: Google ScholarDigital Library - [16] . 2014. Leveraging spatial join for robust tuple extraction from web pages. Inf. Sci. 261 (2014), 132–148.
DOI: Google ScholarDigital Library - [17] . 2000. Image representations for accessing and organizing web information. In Internet Imaging II, and (Eds.), Vol. 4311. International Society for Optics and Photonics, SPIE, San Jose, CA, 91–101.
DOI: Google ScholarCross Ref - [18] . 1998. Generating finite-state transducers for semi-structured data extraction from the web. Inf. Syst. 23, 8 (1998), 521–538.
DOI: Google ScholarCross Ref - [19] . 2021. Representative Image Extraction from Web Page. Master’s Thesis. University of Eastern Finland, Faculty of Science and Forestry, Joensuu School of Computing.Google Scholar
- [20] . 2021. A clustering approach to extract data from HTML tables. Inf. Process. Manag. 58, 6 (2021), 102683.
DOI: Google ScholarDigital Library - [21] . 2007. Web information extraction by HTML tree edit distance matching. In International Conference on Convergence Information Technology (ICCIT’07). IEEE, 2455–2460.
DOI: Google ScholarDigital Library - [22] . 2010. Boilerplate detection using shallow text features. In 3rd ACM International Conference on Web Search and Data Mining (WSDM’10). Association for Computing Machinery, New York, NY, 441–450.
DOI: Google ScholarDigital Library - [23] . 1966. Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10, 8 (1966), 707–710.Google Scholar
- [24] . 2011. Web Data Mining Exploring Hyperlinks, Contents, and Usage Data. Springer, Berlin.
DOI: Google ScholarCross Ref - [25] . 2017. Main content extraction from web pages based on node characteristics. J. Comput. Sci. Eng. 11 (
06 2017), 39–48.DOI: Google ScholarCross Ref - [26] . 2019. Ensemble classification for imbalanced data based on feature space partitioning and hybrid metaheuristics. Appl. Intell. 49, 8 (
Aug. 2019), 2807–2822.DOI: Google ScholarDigital Library - [27] . 2021. A framework for index point detection using effective title extraction from video thumbnails. Int. J. Syst. Assur. Eng. Manag. (
June 2021), 1–6.DOI: Google ScholarCross Ref - [28] . 2019. Combining URL and HTML features for entity discovery in the web. ACM Trans. Web 13, 4, Article
20 (Dec. 2019), 27 pages.DOI: Google ScholarDigital Library - [29] . 1999. A hierarchical approach to wrapper induction. In 3rd Annual Conference on Autonomous Agents (AGENTS’99). Association for Computing Machinery, New York, NY, 190–197.
DOI: Google ScholarDigital Library - [30] . 2004. Automatic web news extraction using tree edit distance. In 13th International Conference on World Wide Web (WWW’04). Association for Computing Machinery, New York, NY, 502–511.
DOI: Google ScholarDigital Library - [31] . 1999. Building light-weight wrappers for legacy web data-sources using W4F. In 25th International Conference on Very Large Data Bases (VLDB’99). Morgan Kaufmann Publishers Inc., San Francisco, CA, 738–741.Google ScholarDigital Library
- [32] . 2017. Accurate and efficient general-purpose boilerplate detection for crawled web corpora. Lang. Resour. Eval. 51 (2017), 873–889.Google ScholarDigital Library
- [33] . 2012. Boosting: Foundations and Algorithms. The MIT Press, London, England.Google ScholarCross Ref
- [34] . 2022. Web generator: An open-source software for synthetic web-based user interface dataset generation. SoftwareX 17 (2022), 100985.
DOI: Google ScholarCross Ref - [35] . 2020. A regular expression generator based on CSS selectors for efficient extraction from HTML pages. Turk. J. Electric. Eng. Comput. Sci. 28 (2020), 3389–3401.
DOI: Google ScholarCross Ref - [36] . 2013. A hybrid approach for extracting informative content from web pages. Inf. Process. Manag. 49, 4 (2013), 928–944.Google ScholarDigital Library
- [37] . 2017. Evaluation of HAP, AngleSharp and HtmlDocument in web content extraction. In International Scientific Conference (UNITECH’17). UNITECH, 275–278.Google Scholar
- [38] . 2014. An effective and efficient web content extractor for optimizing the crawling process. Softw.: Pract. Exper. 44, 10 (2014), 1181–1199.
DOI: Google ScholarDigital Library - [39] . 2018. Comparison of Python libraries used for web data extraction. J. Technic. Univ. - Sofia Plovdiv branch, Bulgar. 24 (2018), 87–92.Google Scholar
- [40] . 2018. Examining the impact of feature selection on classification of user reviews in web pages. In International Conference on Artificial Intelligence and Data Processing (IDAP’18). IEEE, 1–8.
DOI: Google ScholarCross Ref - [41] . 2020. Automatically discovering relevant images from web pages. IEEE Access 8 (2020), 208910–208921.
DOI: Google ScholarCross Ref - [42] . 2017. A novel algorithm for extracting the user reviews from web pages. J. Inf. Sci. 43, 5 (2017), 696–712.
DOI: Google ScholarDigital Library - [43] . 2019. Detection and veracity analysis of fake news via scrapping and authenticating the web search. Cognit. Syst. Res. 58 (2019), 217–229.
DOI: Google ScholarDigital Library - [44] . 2018. Web2Text: Deep structured boilerplate removal. In Advances in Information Retrieval, , , , and (Eds.). Springer International Publishing, Cham, 167–179.Google Scholar
- [45] . 2020. Determining the most representative image on a web page. Inf. Sci. 512 (2020), 1234–1248.
DOI: Google ScholarDigital Library - [46] . 2016. Language independent web news extraction system based on text detection framework. Inf. Sci. 342 (2016), 132–149.
DOI: Google ScholarDigital Library - [47] . 2021. Detecting fake news by exploring the consistency of multimodal data. Inf. Process. Manag. 58, 5 (2021), 102610.
DOI: Google ScholarDigital Library - [48] . 2006. Structured data extraction from the web based on partial tree alignment. IEEE Trans. Knowl. Data Eng. 18, 12 (2006), 1614–1628.
DOI: Google ScholarDigital Library - [49] . 2020. Web table extraction, retrieval, and augmentation: A survey. ACM Trans. Intell. Syst. Technol. 11, 2, Article
13 (Jan. 2020), 35 pages.DOI: Google ScholarDigital Library - [50] . 2021. A machine learning-based sentiment analysis of online product reviews with a novel term weighting and feature selection approach. Inf. Process. Manag. 58, 5 (2021), 102656.
DOI: Google ScholarDigital Library - [51] . 2019. An image-text consistency driven multimodal sentiment analysis approach for social media. Inf. Process. Manag. 56, 6 (2019), 102097.
DOI: Google ScholarDigital Library - [52] . 2018. Performance evaluation of classification methods in layout prediction of web pages. In International Conference on Artificial Intelligence and Data Processing (IDAP’18). IEEE, 1–7.
DOI: Google ScholarCross Ref
Index Terms
- Scraping Relevant Images from Web Pages without Download
Recommendations
Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web EngineeringWeb crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
ViDE: A Vision-Based Approach for Deep Web Data Extraction
Deep Web contents are accessed by queries submitted to Web databases and the returned data records are enwrapped in dynamically generated Web pages (they will be called deep Web pages in this paper). Extracting structured data from deep Web pages is a ...
STAVIES: A System for Information Extraction from Unknown Web Data Sources through Automatic Web Wrapper Generation Using Clustering Techniques
A fully automated wrapper for information extraction from Web pages is presented. The motivation behind such systems lies in the emerging need for going beyond the concept of "human browsing. The World Wide Web is today the main "all kind of information ...
Comments