Abstract
Since the web is increasingly used by terrorist organizations for propaganda, disinformation, and other purposes, the ability to automatically detect terrorist-related content in multiple languages can be extremely useful. In this paper we describe a new, classification-based approach to multi-lingual detection of terrorist documents. The proposed approach builds upon the recently developed graph-based web document representation model combined with the popular C4.5 decision-tree classification algorithm. Evaluation is performed on a collection of 648 web documents in Arabic language. The results demonstrate that documents downloaded from several known terrorist sites can be reliably discriminated from the content of Arabic news reports using a simple decision tree.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Han, J., Kamber, M.: Data Mining Concepts and Techniques. Morgan Kaufmann, San Francisco (2001)
Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)
Aljlayl, M., Frieder, O.: Effective Arabic-English Cross-Language Information Retrieval via Machine-Readable Dictionaries and Machine Translation. In: Tenth International Conference on Information and Knowledge Management (October 2001)
Larkey, L.S., Feng, F., Connell, M., Lavrenko, V.: Language-Specific Models in Multilingual Topic Tracking. In: 27th Annual International Conference on Research and Development in Information Retrieval (July 2004)
Larson, R., Gey, F., Chen, A.: Harvesting Translingual Vocabulary Mappings for Multilingual Digital Libraries. In: 2nd ACM/IEEE-CS joint conference on Digital libraries (July 2002)
Markov, A., Last, M.: A Simple, Structure-Sensitive Approach for Web Document Classification. In: Szczepaniak, P.S., Kacprzyk, J., Niewiadomski, A. (eds.) AWIC 2005. LNCS (LNAI), vol. 3528, pp. 293–298. Springer, Heidelberg (2005)
Ramakrishna, K., Tan, S.S. (eds.): After Bali, the Threat of Terrorism in Southeast Asia. World Scientific, Singapore (2003)
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys (1999)
Maria, N., Silva, M.J.: Theme-based Retrieval of Web news. In: 23rd Annual International ACM SIGIR Conference on Research and Development In Information Retrieval (July 2000)
Carreira, R., Crato, J.M., Gonçalves, D., Jorge, J.A.: Evaluating Adaptive User Profiles for News Classification. In: 9th International Conference on Intelligent User Interface (January 2004)
McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: AAAI–1998 Workshop on Learning for Text Categorization (1998)
Reis, D., Golgher, P., Leander, A., Silva, A.: Automatic Web News Extraction Using Tree Edit Distance. In: 13th International Conference on World Wide Web (2004)
Amati, G., Crestani, F.: Probabilistic Learning for Selective Dissemination of Information. Information Processing and Management 35(5), 633–654 (1999)
Tauritz, D., Kok, J., Sprinkhuizen-Kuyper, I.: Adaptive Information Filtering Using Evolutionary Computation. Information Sciences 122(2–4), 121–140 (2000)
Dumais, S., Chen, H.: Hierarchical classification of Web content. In: 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (July 2000)
Eirinaki, M., Vazirgiannis, M.: Web Mining for Web Personalization. In: ACM Transactions on Internet Technology (TOIT) (February 2003)
Mulvenna, M., Anands, S., Buchner, A.: Personalization on the Net Using Web Mining. Communications of the ACM (August 2000)
Eirinaki, M., Vazirgiannis, M., Varlamis, I.: Sewep: Using Site Semantics and a Taxonomy to Enhance the Web Personalization Process. In: Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (August 2003)
Weiss, S.M., Apte, C., Damerau, F.J., Johnson, D.E., Oles, F.J., Goetz, T., Hampp, T.: Maximizing Text-Mining Performance. IEEE Intelligent Systems 14(4), 63–69 (1999)
Salton, G., Wong, A., Yang, C.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18(11), 613–620 (1971)
Tzeras, K., Hartmann, S.: Automatic Indexing Based on Bayesian Inference Networks. In: 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (July 1993)
Salton, G., Buckley, C.: Term Weighting Approaches in Automatic Text Retrieval, Technical Report: TR87-881 (1987)
Schenker, A., Bunke, H., Last, M., Kandel, A.: Graph-Theoretic Techniques for Web Content Mining. Series in Machine Perception and Artificial Intelligence, vol. 62. World Scientific, Singapore (2005)
Schenker, M., Last, H., Bunke, A.: Classification of Web Documents Using Graph Matching. International Journal of Pattern Recognition and Artificial Intelligence, Special Issue on Graph Matching in Computer Vision and Pattern Recognition 18(3), 475–496 (2004)
Kuramochi, M., Karypis, G.: An Efficient Algorithm for Discovering Frequent Subgraphs, Technical Report TR# 02-26, Dept. of Computer Science and Engineering, University of Minnesota (2002)
Yang, Y., Slattery, S., Ghani, R.: A Study of Approaches to Hypertext Categorization. Journal of Intelligent Information Systems (March 2002)
Yan, X., Han, J.: gSpan: Graph-Based Substructure Pattern Mining. In: IEEE International Conference on Data Mining (ICDM 2002) (December 2002)
Quinlan, J.R.: Induction of Decision Trees. Machine Learning 1, 81–106 (1986)
Quinlan, J.R.: C4.5: Programs for Machine Learning (1993)
Ahmed, C.J., David, F., William, O.: UCLIR: a Multilingual Information Retrieval tool. Multilingual Information Access and Natural Language Processing (November 2002)
Ripplinger, B.: The Use of NLP Techniques in CLIR. In: Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation (September 2000)
Maimon, O., Last, M.: Knowledge Discovery and Data Mining – The Info-Fuzzy Network (IFN) Methodology. Massive Computing Series. Kluwer Academic Publishers, Dordrecht (2000)
Larkey, L.S., Ballesteros, L., Connell, M.E.: Improving Stemming for Arabic Information Retrieval: Light Stemming and Co-occurrence Analysis. In: SIGIR (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Last, M., Markov, A., Kandel, A. (2006). Multi-lingual Detection of Terrorist Content on the Web. In: Chen, H., Wang, FY., Yang, C.C., Zeng, D., Chau, M., Chang, K. (eds) Intelligence and Security Informatics. WISI 2006. Lecture Notes in Computer Science, vol 3917. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11734628_3
Download citation
DOI: https://doi.org/10.1007/11734628_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33361-6
Online ISBN: 978-3-540-33362-3
eBook Packages: Computer ScienceComputer Science (R0)