Abstract
Fraudulent websites pose as legitimate sources of information, goods, product and services are propagating and resulted in loss of billions of dollars. Due to several undesirable impacts of Internet fraud and scam, several studies and approaches are focused to identify fraudulent Internet websites, yet none of them managed to offer an efficient solution to suppress these fraudulent activities. With this regard, this research proposes a fraudulent website detection model based on sentiment analysis of the textual contents of a given website, natural language processing and supervised machine learning techniques. The proposed model consists of four primary phases which are data acquisition phase, preprocessing phase, feature extraction phase and classification phase. Crawler is used to obtained data from Internet and data was cleaned to remove non-discriminative noises and reshape into desired format. Later, meaningful and discriminative patterns are extracted. Finally classification phase consists of supervised machine learning techniques to construct the fraudulent website detection model. This research employs 10-fold stratified cross validation technique in order to validate the performance of the proposed model. Experimental results show that the proposed fraudulent website detection model with cross validated accuracy of 97.67% and FPR of 3.49% achieved satisfactory results and served the aim of this research.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Perner, P.: Advances in Data Mining: Applications and Theoretical Aspects. In: Proceedings of 10th Industrial Conference, ICDM 2010, 12–14 July 2010, vol. 6171. Springer, Heidelberg (2010)
Abbasi, A., Chen, H.: A comparison of tools for detecting fake websites. Computer 42(10), 78–86 (2009)
Abbasi, A., Chen, H.: Detecting fake escrow websites using rich fraud cues and kernel based methods. In: Annual Workshop on Information Technologies and Systems, pp. 1–6 (2007)
Mohammad, R.M., Thabtah, F., McCluskey, L.: Tutorial and critical analysis of phishing websites methods. Sci. Rev. 17, 1–24 (2015)
Phua, C., Lee, V., Smith, K., Gayler, R.: A comprehensive survey of data mining-based fraud detection research. In: 2010 International Conference on Intelligent Computation Technology and Automation, ICICTA 2010, vol. 1, pp. 50–53 (2010)
Le, A. and Markopoulou, A.: PhishDef: url names say it all. In: INFOCOM Proceedings IEEE, pp. 191–195 (2010)
Abbasi, A., Zhang, Z., Zimbra, D., Chen, H., Nunamaker Jr., J.F.: Detecting fake websites: the contribution of statistical learning theory. MIS Q. 34(3), 435–461 (2010)
Martines-romo, J., Araujo, L.: Web spam identification through language model analysis. In: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web, pp. 21–28 (2009)
Urvoy, T., Lavergne, T., Filoche, P.: Tracking web spam with hidden style similarity. In: AIRWeb, pp. 25–31 (2006)
Ntoulas, A., Hall., B., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages through content analysis. In: Proceedings of 15th International Conference on World Wide Web, pp. 83–92 (2006)
Shen, G., Gao, B. Liu, T. Y., Feng, G., Song, S., Li, H.: Detecting link spam using temporal information. In: Proceedings of IEEE International Conference on Data Mining, ICDM, vol. 49, pp. 1049–1053 (2006)
Becchetti, L., Donato, D., Baeza-yates, R., Leonardi, S.: Link analysis for web spam detection. ACM Trans. Web. 2(1), 1–42 (2007)
Drost, I., Scheffer, T.: Thwarting the nigritude ultramarine: learning to identify link spam. In: European Conference on Machine Learning. LNCS(LNAI), vol. 3720, pp. 96–107 (2005)
Abbasi, A.: Detecting fake medical web sites using recursive trust labeling. ACM Trans. Inf. Syst. 30(4), 1–22 (2012)
Liu, W., Deng, X., Huang, G., Fu, A.Y.: An antiphishing strategy based on visual similarity assessment. IEEE Internet Comput. 10(2), 58–65 (2006)
Chou, N., Ledesma, R., Teraguchi, Y. Boneh, D., Mitchell, J.C., Ca, S.: Client-side defense against web-based identity theft. In: NDSS, pp. 1–16 (2004)
Abbasi, A., Zhang, Z., Chen., H.: A Statistical Learning Based System for Fake Website Detection, no. 4, pp. 3–4 (2008)
Ignatow, G., Mihalcea, R.: Text Mining: A Guidebook for the Social Sciences. Sage Publication, Los Angeles (2016)
Brill, E.: A simple rule-based part of speech tagger. In: Proceedings of the workshop on Speech and Natural Language 1992, pp. 112–116 (1992)
Acknowledgement
This work is supported by the Ministry of Higher Education (MOHE) and Research Management Centre (RMC) at the Universiti Teknologi Malaysia (UTM) under Fundamental Research Grant (FRGS) VOT R.J130000.7828.4F809.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Maktabar, M., Zainal, A., Maarof, M.A., Kassim, M.N. (2018). Content Based Fraudulent Website Detection Using Supervised Machine Learning Techniques. In: Abraham, A., Muhuri, P., Muda, A., Gandhi, N. (eds) Hybrid Intelligent Systems. HIS 2017. Advances in Intelligent Systems and Computing, vol 734. Springer, Cham. https://doi.org/10.1007/978-3-319-76351-4_30
Download citation
DOI: https://doi.org/10.1007/978-3-319-76351-4_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-76350-7
Online ISBN: 978-3-319-76351-4
eBook Packages: EngineeringEngineering (R0)