Skip to main content

Automatically Injecting Semantic Annotations into Online Articles

  • Conference paper
  • First Online:
Advanced Information Networking and Applications (AINA 2021)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 227))

  • 962 Accesses

Abstract

Extracting information from the Web is one of the most trending subjects for data analysts and scientists. Web scraping is one of the prominent means to do so by parsing HTML pages and extracting data from their embedded tags. To that end, software developers write customized scripts for each target Website. Crafting such scripts is a challenging task due to the different structures of Web sites and their dynamic rendering. Yet, a more challenging task is to infer semantic annotations from a news article. It is interesting, for example, to have a word such as the US in an already published article to be automatically annotated semantically by its population and area. We argue that these annotations should not be mere hyperlinks to be consumed by human readers but rather they can be machine readable using the Web standard RDF (Resource Description Framework). Embedding such RDF inside the HTML page of the news article should enrich it with automatically generated semantics that can boost its SEO (Search Engine Optimization) and are ready to be consumed by conventional Web scrapers. As a proof of concept, we built a prototype that focuses on the plain text of an article, without stipulating the existence of structured tags and attributes, to generate a new Web document that is augmented with semantic annotations using the RDF markup language, readable by humans and machine-consumable by Web scrapers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 259.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Brin, S., Motwani, R., Page, L., Winograd, T.: What can you do with a web in your pocket? IEEE Data Eng. Bull. 21(2), 37–47 (1998). http://dblp.uni-trier.de/db/journals/debu/debu21.html#BrinMPW98

  2. Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Rec. 31(2), 8493 (2002). https://doi.org/10.1145/565117.565137

  3. Malik, S.K., Rizvi, S.: Information extraction using web usage mining, web scrapping and semantic annotation. In: 2011 International Conference on Computational Intelligence and Communication Networks, pp. 465–469 (2011)

    Google Scholar 

  4. Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Sci. Am. 284(5), 34–43 (2001)

    Article  Google Scholar 

  5. Pérez, J., Arenas, M., Gutierrez, C.: Semantics and complexity of SPARQL. ACM Trans. Database Syst. (TODS) 34(3), 1–45 (2009)

    Article  Google Scholar 

  6. Gómez-Pérez, A., Corcho, O.: Ontology languages for the semantic web. IEEE Intell. Syst. 17(1), 54–60 (2002)

    Article  Google Scholar 

  7. Dolog, P., Nejdl, W.: Challenges and benefits of the semantic web for user modelling. In: Proceedings of the Workshop on Adaptive Hypermedia and Adaptive Web-Based Systems (AH2003) at 12th International World Wide Web Conference. Citeseer, Budapest (2003)

    Google Scholar 

  8. Adida, B., Birbeck, M.: RDFA core 1.1 (2007)

    Google Scholar 

  9. Malik, S.K., Rizvi, S.A.: Information extraction using web usage mining, web scrapping and semantic annotation. In: 2011 International Conference on Computational Intelligence and Communication Networks, pp. 465–469. IEEE (2011)

    Google Scholar 

  10. Ontotext, K.: Platform (2011)

    Google Scholar 

  11. Ferrara, E., De Meo, P., Fiumara, G., Baumgartner, R.: Web data extraction, applications and techniques: A survey. Knowledge-based systems 70, 301–323 (2014)

    Article  Google Scholar 

  12. Chen, H., Chau, M., Zeng, D.: Ci spider: a tool for competitive intelligence on the web. Decision Support Syst. 34(1), 1–17 (2002)

    Article  Google Scholar 

  13. Fernández-Villamor, J.I., Iglesias, C.A., Garijo, M.: First-order logic rule induction for information extraction in web resources. Int. J. Artif. Intell. Tools 21(06), 1250032 (2012)

    Article  Google Scholar 

  14. Beno, M., Filtz, E., Kirrane, S., Polleres, A.: Doc2RDFa: semantic annotation for web documents (2019)

    Google Scholar 

  15. Salem, H., Mazzara, M.: Pattern matching-based scraping of news websites. J. Phys. Conf. Ser. 1694, 012011 (2020). https://doi.org/10.1088/1742-6596/1694/1/012011

  16. Karkar, R., Nagdev, S., Gangrade, P., Gatade, D.D.: Transformation of sentimental impact for documents. Transformation 5(04) (2018)

    Google Scholar 

  17. Perkins, J.: Python text processing with NLTK 2.0 cookbook. Packt Publishing Ltd. (2010)

    Google Scholar 

  18. Richardson, L.: Beautiful soup documentation. Dosegljivo (2007). https://www.crummy.com/software/BeautifulSoup/bs4/doc/. Dostopano 7 July 2018

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Hamza Salem , Manuel Mazzara or Said Elnaffar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Salem, H., Mazzara, M., Elnaffar, S. (2021). Automatically Injecting Semantic Annotations into Online Articles. In: Barolli, L., Woungang, I., Enokido, T. (eds) Advanced Information Networking and Applications. AINA 2021. Lecture Notes in Networks and Systems, vol 227. Springer, Cham. https://doi.org/10.1007/978-3-030-75078-7_61

Download citation

Publish with us

Policies and ethics