Abstract
Web content filtering is one among many techniques to limit the exposure of selective content on the Internet. It has gotten trivial with time, yet filtering of multilingual web content is still a difficult task, especially while considering big data landscape. The enormity of data increases the challenge of developing an effective content filtering system that can work in real time. There are several systems which can filter the URLs based on artificial intelligence techniques to identify the site with objectionable content. Most of these systems classify the URLs only in the English language. These systems either fail to respond when multilingual URLs are processed, or over-blocking is experienced. This paper introduces a filtering system that can classify multilingual URLs based on predefined criteria for URL, title, and metadata of a web page. Ontological approaches along with local multilingual dictionaries are used as the knowledge base to facilitate the challenging task of blocking URLs not meeting the filtering criteria. The proposed work shows high accuracy in classifying multilingual URLs into two categories, white and black. Evaluation results conducted on a large dataset show that the proposed system achieves promising accuracy, which is on a par with those achieved in state-of-the-art literature on semantic-based URL filtering.










Similar content being viewed by others
Notes
All the tests have been carried out using Java SE and Protégé 4.3 ontology editor. The system was Core i-7, 2.2-GHz laptop with 8 GB RAM and 1 Mb internet connection.
References
Dalek J, Haselton B, Noman H, Senft A, Crete-Nishihata M, Gill P, Deibert RJ (2013) A method for identifying and confirming the use of URL filtering products for censorship. In: Proceedings of the 2013 Conference on Internet Measurement Conference. ACM, pp 23–30
Ma J, Saul LK, Savage S, Voelker GM (2009) Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp 1245–1254
Cowings D, Hoogstrate D, Jensen S, Medlar A, Schneider K (2012) U.S. Patent No. 8,145,710. U.S. Patent and Trademark Office, Washington
Srivastava M, Garg R, Mishra P (2014) Preprocessing techniques in web usage mining: a survey. Int J Comput Appl 97(18):1–9
Huang D, Xu K, Pei J (2014) Malicious URL detection by dynamically mining patterns without pre-defined elements. World Wide Web 17(6):1375–1394
Chandrinos K, Androutsopoulos I, Paliouras G, Spyropoulos C (2000) Automatic web rating: filtering obscene content on the web. In: Research and Advanced Technology for Digital Libraries, pp 403–406
Lee LH, Juan YC, Chen HH, Tseng YH (2013) Objectionable content filtering by click-through data. In: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. ACM, pp 1581–1584
Zhou Z, Song T, Jia Y (2010) A high-performance url lookup engine for url filtering systems. In: 2010 IEEE International Conference on Communications (ICC). IEEE, pp 1–5
Zheng H, Liu H, Daoudi M (2004) Blocking objectionable images: adult images and harmful symbols. In: 2004 IEEE International Conference on Multimedia and Expo, 2004. ICME’04, vol. 2. IEEE, pp 1223–1226
Liu BB, Su JY, Lu ZM, Li Z (2008) Pornographic images detection based on CBIR and skin analysis. In: Fourth International Conference on Semantics, Knowledge and Grid, 2008. SKG’08. IEEE, pp 487–488
Imeshev S Cacheonix the big cache for big data. https://www.cacheonix.org/products/cacheonix/. Accessed 09 Aug 2017
Forte M, de Souza WL, do Prado AF (2006) A content classification and filtering server for the Internet. In: Proceedings of the 2006 ACM symposium on applied computing. ACM, pp 1166–1171
Thangaraj M, Karthikeyan VKT (2014) KT-grand: an algorithm for web content filtering. J Adva Resea Comp Sci Mana Stud 2(9):371–376
Rajalakshmi R, Aravindan C (2011) Naive Bayes approach for website classification. In: Das VV, Thomas G, Lumban Gaol F (eds) Information technology and mobile communication. Communications in computer and information science, vol 147. Springer, Berlin, Heidelberg
Neshatian K, Zhang M, Andreae P (2012) A filter approach to multiple feature construction for symbolic learning classifiers using genetic programming. IEEE Trans Evol Comput 16(5):645–661
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1725–1732
Zhang JB, Xu ZM, Xiu KL, Pan QS (2010) A web site classification approach based on its topological structure. Int J Asian Lang Proc 20(2):75–86
Chou C, Condron L, Belland JC (2005) A review of the research on Internet addiction. Psychol Rev 17(4):363–388
Pai A (2011) FCC guide: children’s internet protection act. Federal Communications Commission
Cisco (2005) Content-control software. https://www.opendns.com/. Accessed 15 Aug 2017
Lee LH, Juan YC, Tseng WL, Chen HH, Tseng YH (2015) Mining browsing behaviors for objectionable content filtering. J Assoc Inf Sci Technol 66(5):930–942
Mahmood K, Takahashi H, Raza A, Qaiser A, Farooqui A (2015) Semantic based highly accurate autonomous decentralized URL classification system for Web filtering. In: 2015 IEEE twelfth international symposium on autonomous decentralized systems (ISADS). IEEE, pp 17–24
Feroz MN, Mengel S (2015). Phishing URL detection using URL ranking. In: 2015 IEEE international congress on Big Data (BigData congress). IEEE, pp 635–638
AOL (2016) “DMOZ,” AOL. http://www.dmoz.org/. Accessed 10 Aug 2017
“PhishTank.” https://www.phishtank.com/. Accessed 10 Aug 2017
Microsoft Corporation (2010) Microsoft reputation services. https://www.microsoft.com/emea/endtoend/sv-se/vision/reputation.aspx. Accessed 15 Aug 2017
Astrakhantsev N, Fedorenko D, Turdakov D (2014) Automatic enrichment of informal ontology by analyzing a domain-specific text collection. In: Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue, vol. 13, pp 29–42
Barve A, Divakar S (2011) An efficient soft clustering algorithm for web page prediction. J Adv Eng Sci 1(1):3–6
Thomas K, Grier C, Ma J, Paxson V, Song D (2011) Design and evaluation of a real-time url spam filtering service. In: 2011 IEEE symposium on security and privacy (SP). IEEE, pp 447–462
Khare R (1999) Anatomy of a URL (and other internet-scale namespaces, part 1). IEEE Internet Comput 3(5):78
McGuinness DL, Van Harmelen F (2004) OWL web ontology language overview. W3C Recomm 10(10):20
Pasin M, Motta E (2011) Ontological requirements for annotation and navigation of philosophical resources. Synthese 182(2):235–267
Noy NF, Sintek M, Decker S, Crubézy M, Fergerson RW, Musen MA (2001) Creating semantic web contents with protege-2000. IEEE Intell Syst 16(2):60–71
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hussain, M., Ahmed, M., Khattak, H.A. et al. Towards ontology-based multilingual URL filtering: a big data problem. J Supercomput 74, 5003–5021 (2018). https://doi.org/10.1007/s11227-018-2338-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-018-2338-1