Abstract
Extracting information from the Web is one of the most trending subjects for data analysts and scientists. Web scraping is one of the prominent means to do so by parsing HTML pages and extracting data from their embedded tags. To that end, software developers write customized scripts for each target Website. Crafting such scripts is a challenging task due to the different structures of Web sites and their dynamic rendering. Yet, a more challenging task is to infer semantic annotations from a news article. It is interesting, for example, to have a word such as the US in an already published article to be automatically annotated semantically by its population and area. We argue that these annotations should not be mere hyperlinks to be consumed by human readers but rather they can be machine readable using the Web standard RDF (Resource Description Framework). Embedding such RDF inside the HTML page of the news article should enrich it with automatically generated semantics that can boost its SEO (Search Engine Optimization) and are ready to be consumed by conventional Web scrapers. As a proof of concept, we built a prototype that focuses on the plain text of an article, without stipulating the existence of structured tags and attributes, to generate a new Web document that is augmented with semantic annotations using the RDF markup language, readable by humans and machine-consumable by Web scrapers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Brin, S., Motwani, R., Page, L., Winograd, T.: What can you do with a web in your pocket? IEEE Data Eng. Bull. 21(2), 37–47 (1998). http://dblp.uni-trier.de/db/journals/debu/debu21.html#BrinMPW98
Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Rec. 31(2), 8493 (2002). https://doi.org/10.1145/565117.565137
Malik, S.K., Rizvi, S.: Information extraction using web usage mining, web scrapping and semantic annotation. In: 2011 International Conference on Computational Intelligence and Communication Networks, pp. 465–469 (2011)
Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Sci. Am. 284(5), 34–43 (2001)
Pérez, J., Arenas, M., Gutierrez, C.: Semantics and complexity of SPARQL. ACM Trans. Database Syst. (TODS) 34(3), 1–45 (2009)
Gómez-Pérez, A., Corcho, O.: Ontology languages for the semantic web. IEEE Intell. Syst. 17(1), 54–60 (2002)
Dolog, P., Nejdl, W.: Challenges and benefits of the semantic web for user modelling. In: Proceedings of the Workshop on Adaptive Hypermedia and Adaptive Web-Based Systems (AH2003) at 12th International World Wide Web Conference. Citeseer, Budapest (2003)
Adida, B., Birbeck, M.: RDFA core 1.1 (2007)
Malik, S.K., Rizvi, S.A.: Information extraction using web usage mining, web scrapping and semantic annotation. In: 2011 International Conference on Computational Intelligence and Communication Networks, pp. 465–469. IEEE (2011)
Ontotext, K.: Platform (2011)
Ferrara, E., De Meo, P., Fiumara, G., Baumgartner, R.: Web data extraction, applications and techniques: A survey. Knowledge-based systems 70, 301–323 (2014)
Chen, H., Chau, M., Zeng, D.: Ci spider: a tool for competitive intelligence on the web. Decision Support Syst. 34(1), 1–17 (2002)
Fernández-Villamor, J.I., Iglesias, C.A., Garijo, M.: First-order logic rule induction for information extraction in web resources. Int. J. Artif. Intell. Tools 21(06), 1250032 (2012)
Beno, M., Filtz, E., Kirrane, S., Polleres, A.: Doc2RDFa: semantic annotation for web documents (2019)
Salem, H., Mazzara, M.: Pattern matching-based scraping of news websites. J. Phys. Conf. Ser. 1694, 012011 (2020). https://doi.org/10.1088/1742-6596/1694/1/012011
Karkar, R., Nagdev, S., Gangrade, P., Gatade, D.D.: Transformation of sentimental impact for documents. Transformation 5(04) (2018)
Perkins, J.: Python text processing with NLTK 2.0 cookbook. Packt Publishing Ltd. (2010)
Richardson, L.: Beautiful soup documentation. Dosegljivo (2007). https://www.crummy.com/software/BeautifulSoup/bs4/doc/. Dostopano 7 July 2018
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Salem, H., Mazzara, M., Elnaffar, S. (2021). Automatically Injecting Semantic Annotations into Online Articles. In: Barolli, L., Woungang, I., Enokido, T. (eds) Advanced Information Networking and Applications. AINA 2021. Lecture Notes in Networks and Systems, vol 227. Springer, Cham. https://doi.org/10.1007/978-3-030-75078-7_61
Download citation
DOI: https://doi.org/10.1007/978-3-030-75078-7_61
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-75077-0
Online ISBN: 978-3-030-75078-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)