Abstract
Web scraping is a technique used to extract data from websites and it is the pillar of information retrieval in a world wide web that is ever growing. There are two main ways of extracting data from a website: static and dynamic scraping. Static scraping requires input beyond the target website because the user needs to inspect the HTML content of the target and find certain patterns in the templates that are then used to extract data. Static scraping is also very vulnerable to changes in the template of the web page. Dynamic scraping is a very broad topic and it has been tackled from many different angles: tree-based, natural language processing (NLP), computer vision or machine learning techniques. For most websites, the problem can be broken in two big steps: finding the template for the pages we want to extract data from and then removing irrelevant text such as ads, text from controls or JavaScript code.
This paper proposes a solution for dynamic scraping that uses AngleSharp for HTML retrieval and involves a slightly modified approach of the graph technique mentioned in for template finding. Once we find a number of pages then several heuristics can be applied for content extraction and noise filtering. Such heuristics can include: text and hyperlink density, but also removing common content between multiple pages (usually text from controls, static JavaScript) and then of final layer of NLP techniques (breaking the content into sentences, tokenization and part-of-speech tagging).
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Alarte, J., Insa, D., Silva, J., Tamarit, S.: Web template extraction based on hyperlink analysis. In: Escobar, S. (ed.): XIV Jornadas sobre Programación Y Lenguajes, PROLE 2014, Revised Selected Papers EPTCS, vol. 173, pp. 16–26 (2015)
Liu, Q., Shao, M., Wu, L., Zhao, G., Fan, G.: Main content extraction from web pages based on node characteristics. J. Comput. Sci. Eng. 11(2), 39–48 (2017)
Ferrara, E., De Meob, P., Fiumarac, G., Baumgartnerd, R.F.: Web Data Extraction, Applications and Techniques: A Survey. arXiv:1207.0246v4 [cs.IR], 10 June 2014
OpenNLP Documentation. https://opennlp.apache.org/docs/. Accessed 2 Nov 2019
Uzun, E., Doruk, A., Nusret Buluş, H., Özhan, E.: Evaluation of HAP, AngleSharp and HTML document in web content extraction. In: International Scientific Conference, Gabrovo, 18 November 2017 (2017)
Soderland, S.: Learning information extraction rules for semi-structured and free text. Mach. Learn. 34(1–3), 233–272 (1999)
Muslea, I., Minton, S., Knoblock, C.: Hierarchical wrapper induction for semistructured information sources. Auton. Agents Multi-Agent Syst. 93–114 (2001). https://doi.org/10.1023/A%3A1010022931168
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Cristian-Catalin, N., Dragan, M. (2020). A Dynamic Approach for Template and Content Extraction in Websites. In: Rocha, Á., Adeli, H., Reis, L., Costanzo, S., Orovic, I., Moreira, F. (eds) Trends and Innovations in Information Systems and Technologies. WorldCIST 2020. Advances in Intelligent Systems and Computing, vol 1159. Springer, Cham. https://doi.org/10.1007/978-3-030-45688-7_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-45688-7_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-45687-0
Online ISBN: 978-3-030-45688-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)