Skip to main content

A Dynamic Approach for Template and Content Extraction in Websites

  • Conference paper
  • First Online:
  • 2165 Accesses

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1159))

Abstract

Web scraping is a technique used to extract data from websites and it is the pillar of information retrieval in a world wide web that is ever growing. There are two main ways of extracting data from a website: static and dynamic scraping. Static scraping requires input beyond the target website because the user needs to inspect the HTML content of the target and find certain patterns in the templates that are then used to extract data. Static scraping is also very vulnerable to changes in the template of the web page. Dynamic scraping is a very broad topic and it has been tackled from many different angles: tree-based, natural language processing (NLP), computer vision or machine learning techniques. For most websites, the problem can be broken in two big steps: finding the template for the pages we want to extract data from and then removing irrelevant text such as ads, text from controls or JavaScript code.

This paper proposes a solution for dynamic scraping that uses AngleSharp for HTML retrieval and involves a slightly modified approach of the graph technique mentioned in for template finding. Once we find a number of pages then several heuristics can be applied for content extraction and noise filtering. Such heuristics can include: text and hyperlink density, but also removing common content between multiple pages (usually text from controls, static JavaScript) and then of final layer of NLP techniques (breaking the content into sentences, tokenization and part-of-speech tagging).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Alarte, J., Insa, D., Silva, J., Tamarit, S.: Web template extraction based on hyperlink analysis. In: Escobar, S. (ed.): XIV Jornadas sobre Programación Y Lenguajes, PROLE 2014, Revised Selected Papers EPTCS, vol. 173, pp. 16–26 (2015)

    Google Scholar 

  2. Liu, Q., Shao, M., Wu, L., Zhao, G., Fan, G.: Main content extraction from web pages based on node characteristics. J. Comput. Sci. Eng. 11(2), 39–48 (2017)

    Article  Google Scholar 

  3. Ferrara, E., De Meob, P., Fiumarac, G., Baumgartnerd, R.F.: Web Data Extraction, Applications and Techniques: A Survey. arXiv:1207.0246v4 [cs.IR], 10 June 2014

  4. OpenNLP Documentation. https://opennlp.apache.org/docs/. Accessed 2 Nov 2019

  5. Uzun, E., Doruk, A., Nusret Buluş, H., Özhan, E.: Evaluation of HAP, AngleSharp and HTML document in web content extraction. In: International Scientific Conference, Gabrovo, 18 November 2017 (2017)

    Google Scholar 

  6. Soderland, S.: Learning information extraction rules for semi-structured and free text. Mach. Learn. 34(1–3), 233–272 (1999)

    Article  Google Scholar 

  7. Muslea, I., Minton, S., Knoblock, C.: Hierarchical wrapper induction for semistructured information sources. Auton. Agents Multi-Agent Syst. 93–114 (2001). https://doi.org/10.1023/A%3A1010022931168

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicolae Cristian-Catalin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cristian-Catalin, N., Dragan, M. (2020). A Dynamic Approach for Template and Content Extraction in Websites. In: Rocha, Á., Adeli, H., Reis, L., Costanzo, S., Orovic, I., Moreira, F. (eds) Trends and Innovations in Information Systems and Technologies. WorldCIST 2020. Advances in Intelligent Systems and Computing, vol 1159. Springer, Cham. https://doi.org/10.1007/978-3-030-45688-7_2

Download citation

Publish with us

Policies and ethics