Abstract
Detecting automated users’ agent activities at any web application through users’ web access logs is a challenging issue. Many machines learning based automated solutions exist to address this issue. However, the existing supervised learning methods are heavily dependent on fully labeled data. But the scarcity of labeled log access data and the cost of labeling still make the issue challenging. Some unsupervised learning-based solutions are also proposed, but their performance accuracy is questionable. The semi-supervised based self-training method works with a small set of partially labeled data but lacks a suitable selection metric for a set of predictions with a high degree of confidence and a reliable base learner. In this paper, we propose a new semi-supervised learning based self-training method using probability-based selection criteria with Mahalanobis distance, named DIstance-based SElf-Training (DISET) for detecting automated users’ agent activities. The DISET used probability-based selection criteria with Mahalanobis distance to achieve high-confidence subset selection. The DISET framework works in four steps. First, it performs the data cleaning, session identification, feature extraction, and session labeling during the data preprocessing step. The second step segments the data into labeled and unlabeled datasets. The third step of model self-training performs the subset selection using six different supervised base learners independently. Lastly, the fourth step tests the performance of the used model. The performance of DISET is evaluated on NASA95 and E-commerce weblog datasets using three-fold cross-validation training and testing. The used datasets are also divided into different ratios of labeled and unlabeled instances for experiments. The performance is recorded on the accuracy, precision, recall, and the f-1 score, and the Matthews Correlation Coefficient (MCC) measures and compares the model’s performance with six different base classifiers. We also plotted the ROC and PR curves to confirm and compare the performance of different base learners with the DISET method. Out of the six-base learners, XGBoost outperformed both datasets in the 30:70 data segmentation ratio. The results show that DISET achieves a minimum percentage improvement of 1.91% in accuracy, 2.70% in precision, 3.65% in sensitivity, and 1.00% in F-1 score with large unlabeled datasets.
Similar content being viewed by others
References
Abubakar H, Souley B, Gital AYu (2020) An improved captcha - based intrusion detection system based on redirector model. J Theor Appl Inf Technol 98:429–440
Agarwal AK, Wadhwa S, Chandra S (2016) XGBoost a scalable tree boosting system. J Assoc Physicians India 42:665
Akamai-2022 (2022) Akamai’s bot manager - advanced strategies to flexibly manage the long-term business and IT impact of bots. https://www.akamai.com/site/en/documents/product-brief/bot-manager-product-brief.pdf. Accessed 20 Jul 2022
Algiryage N, Dias G, Jayasena S (2018) Distinguishing real web crawlers from fakes: Googlebot example. MERCon 2018–4th Int Multidiscip Moratuwa. Eng Res Conf, pp 13–18. https://doi.org/10.1109/MERCon.2018.8421894
Alipour M, Harris DK (2020) A big data analytics strategy for scalable urban infrastructure condition assessment using semi-supervised multi-transform self-training. J Civ Struct Heal Monit 10:313–332. https://doi.org/10.1007/s13349-020-00386-4
Alnoamany Y, Weigle MC, Nelson ML (2013) Access patterns for robots and humans in web archives. Proc ACM/IEEE Jt Conf Digit Libr 339–348. https://doi.org/10.1145/2467696.2467722
Arlitt M, Williamson C (1996) NASA website access log data. ftp://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html. Accessed 24 Aug 2021
Bellet A, Habrard A, Sebban M (2013) A survey on metric learning for feature vectors and structured data, pp 1–59. https://doi.org/10.48550/arXiv.1306.6709
Bhatti UA, Huang M, Wu D et al (2019) Recommendation system using feature extraction and pattern recognition in clinical care systems. Enterp Inf Syst 13:329–351. https://doi.org/10.1080/17517575.2018.1557256
Bhatti UA, Yu Z, Li J et al (2020) Hybrid watermarking algorithm using clifford algebra with arnold scrambling and chaotic encryption. IEEE Access 8:76386–76398. https://doi.org/10.1109/ACCESS.2020.2988298
Bomhardt C, Gaul W, Schmidt-Thieme L (2005) Web robot detection - preprocessing web logfìles for robot detection. Stud Classif Data Anal Knowl Organ 0:113–124. https://doi.org/10.1007/3-540-27373-5_14
Breiman L (2001) Random forests. Mach Learn 45:5–32. https://doi.org/10.1023/A:1010933404324
Cabri A, Suchacka G, Rovetta S, Masulli F(2019) Online web bot detection using a sequential classification approach. Proc – 20th Int Conf High Perform Comput Commun 16th Int Conf Smart City 4th Int Conf Data Sci Syst HPCC/SmartCity/DSS 2018 1536–1540. https://doi.org/10.1109/HPCC/SmartCity/DSS.2018.00252
Chen H, He H, Starr A (2020) An overview of web robots detection techniques. Int Conf Cyber Secur Prot Digit Serv Cyber Secur 2020, pp 1–6. https://doi.org/10.1109/CyberSecurity49315.2020.9138856
Chicco D, Jurman G (2020) The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21:1–13. https://doi.org/10.1186/s12864-019-6413-7
Courtney L, Li X, Xu R, Coffman J (2021) Data science techniques to detect fraudulent resource consumption in the cloud. 2021 IEEE 11th Annu Comput Commun Work Conf CCWC 2021, pp 451–457. https://doi.org/10.1109/CCWC51732.2021.9375938
CVE Details (2022) Vulnerabilities by types. https://www.cvedetails.com/vulnerabilities-by-types.php. Accessed 20 Jan 2022
Doran D, Gokhale SS (2012) Detecting web robots using resource request patterns. Proc – 2012 11th Int Conf Mach Learn Appl ICMLA 2012 1, pp 7–12. https://doi.org/10.1109/ICMLA.2012.11
Doran D, Gokhale SS (2016) An integrated method for real time and offline web robot detection. Expert Syst 33:592–606. https://doi.org/10.1111/exsy.12184
Fu J, Li L, Wang Y et al (2019) Web scanner detection based on behavioral differences. In: Communications in computer and information science. Springer Singapore, pp 1–16
Guo Y, Shi J, Cao Z et al (2019) Machine learning based cloudbot detection using multi-layer traffic statistics. Proc – 21st IEEE Int Conf High Perform Comput Commun 17th IEEE Int Conf Smart City 5th IEEE Int Conf Data Sci Syst HPCC/SmartCity/DSS 2019, pp 2428–2435. https://doi.org/10.1109/HPCC/SmartCity/DSS.2019.00339
Hamidzadeh J, Zabihimayvan M, Sadeghi R (2018) Detection of Web site visitors based on fuzzy rough sets. Soft Comput 22:2175–2188. https://doi.org/10.1007/s00500-016-2476-4
Hou YT, Chang Y, Chen T et al (2010) Malicious web content detection by machine learning. Expert Syst Appl 37:55–60. https://doi.org/10.1016/j.eswa.2009.05.023
Iliou C, Kostoulas T, Tsikrika T et al (2019) Towards a framework for detecting advanced web bots. In: ACM international conference proceeding series, pp 1–10
Imperva (2021) Bad bot report 2021. https://www.imperva.com/blog/bad-bot-report-2020-bad-bots-strike-back/. Accessed 20 Jan 2022
Imperva-2022 (2022) Imperva advanced bot protection management. https://www.imperva.com/products/advanced-bot-protection-management/. Accessed 20 Jul 2022
Krzywinski M, Altman N (2017) Classification and regression trees. Nat Methods 14:757–758. https://doi.org/10.1038/nmeth.4370
Kwon S, Kim YG, Cha S (2012) Web robot detection based on pattern-matching technique. J Inf Sci 38:118–126. https://doi.org/10.1177/0165551511435969
Lagopoulos A, Tsoumakas G, Papadopoulos G (2018) Web robot detection: a semantic approach. Proc - Int Conf Tools with Artif Intell ICTAI 2018-Novem, pp 968–974. https://doi.org/10.1109/ICTAI.2018.00150
Lee J, Cha S, Lee D, Lee H (2009) Classification of web robots: an empirical study based on over one billion requests. Comput Secur 28:795–802. https://doi.org/10.1016/j.cose.2009.05.004
Lewandowski P, Janiszewski M, Felkner A (2020) SpiderTrap - an innovative approach to analyze activity of internet bots on a website. IEEE Access 8:141292–141309. https://doi.org/10.1109/ACCESS.2020.3012969
Liao K, Liu G, Xiao L, Liu C (2013) A sample-based hierarchical adaptive K-means clustering method for large-scale video retrieval. Knowledge-Based Syst 49:123–133. https://doi.org/10.1016/j.knosys.2013.05.003
Livieris IE, Kanavos A, Tampakas V, Pintelas P (2018) An auto-adjustable semi-supervised self-training algorithm. Algorithms 11:1–16. https://doi.org/10.3390/a11090139
Mittal M, Sharma RK, Singh VP (2014) Validation of k -means and threshold based clustering methodering Method. Int J Adv Technol 5:153–160
Mucherino A, Papajorgji PJ, Pardalos PM (2009). In: Mucherino A, Papajorgji PJ, Pardalos PM (eds) k-Nearest neighbor classification BT - data mining in agriculture. Springer New York, New York, pp 83–106
Rahman RU, Tomar DS (2021) Threats of price scraping on e-commerce websites: attack model and its detection using neural network. J Comput Virol Hacking Tech: 75–89. https://doi.org/10.1007/s11416-020-00368-6
Renuka Devi S (2012) Detection of application layer DDOS Attacks using information theory based metrics, pp 217–223. https://doi.org/10.5121/csit.2012.2223
Rustogi R, Agarwal A, Prasad A, Saurabh S (2019) Machine learning based web-traffic analysis for detection of fraudulent resource consumption attack in cloud. Proc – 2019 IEEE/WIC/ACM Int Conf Web Intell WI 2019, pp 456–460. https://doi.org/10.1145/3350546.3352567
Sahu S, Kumar R, Mohdshafi P et al (2022) A hybrid recommendation system of upcoming movies using sentiment analysis of YouTube trailer reviews. Mathematics 10:1–22. https://doi.org/10.3390/math10091568
Sahu S, Kumar R, Pathan MS et al (2022) Movie popularity and target audience prediction using the content-based recommender system. IEEE Access 10:42030–42046. https://doi.org/10.1109/ACCESS.2022.3168161
Sardar TH, Ansari Z (2014) Detection and confirmation of web robot requests for cleaning the voluminous web log data. 2014 Int Conf IMpact E-Technology US, IMPETUS 2014, pp 13–19. https://doi.org/10.1109/IMPETUS.2014.6775871
Schapire RE (2013) Explaining AdaBoost. In: Empirical inference. Springer Berlin Heidelberg, Berlin, pp 37–52
Silhavy R, Senkerik R, Silhavy P et al (2014) UAC: a lightweight and scalable approach to detect malicious web pages. Adv Intell Syst Comput 285:241–261. https://doi.org/10.1007/978-3-319-06740-7
Sisodia DS, Verma N (2018) Framework for preprocessing and feature extraction from weblogs for identification of HTTP flood request attacks. 2018 Int Conf Adv Comput Telecommun ICACAT 2018, pp 8–11. https://doi.org/10.1109/ICACAT.2018.8933587
Sisodia DS, Verma S, Vyas OP (2015) Agglomerative approach for identification and elimination of web robots from web server logs to extract knowledge about actual visitors. J Data Anal Inf Process 03:1–10. https://doi.org/10.4236/jdaip.2015.31001
Stassopoulou A, Dikaiakos MD (2009) Web robot detection: a probabilistic reasoning approach. Comput Netw 53:265–278. https://doi.org/10.1016/j.comnet.2008.09.021
Stevanovic D, Vlajic N, An A (2011) Unsupervised clustering of web sessions to detect malicious and non-malicious website users. Procedia Comput Sci 5:123–131. https://doi.org/10.1016/j.procs.2011.07.018
Stevanovic D, An A, Vlajic N (2012) Feature evaluation for web crawler detection with data mining techniques. Expert Syst Appl 39:8707–8717. https://doi.org/10.1016/j.eswa.2012.01.210
Stevanovic D, Vlajic N, An A (2013) Detection of malicious and non-malicious website visitors using unsupervised neural network learning. Appl Soft Comput J 13:698–708. https://doi.org/10.1016/j.asoc.2012.08.028
Suchacka G, Iwański J (2020) Identifying legitimate web users and bots with different traffic profiles — an information bottleneck approach. Knowledge-Based Syst 197:105875. https://doi.org/10.1016/j.knosys.2020.105875
Tan PN, Kumar V (2002) Discovery of web robot sessions based on their navigational patterns. Data Min Knowl Discov 6:9–35. https://doi.org/10.1023/A:1013228602957
Tanaka T, Niibori H, Li S et al (2020) Bot detection model using user agent and user behavior for web log analysis. Procedia Comput Sci 176:1621–1625. https://doi.org/10.1016/j.procs.2020.09.185
Triguero I, García S, Herrera F (2015) Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowl Inf Syst 42:245–284. https://doi.org/10.1007/s10115-013-0706-y
udger (2022) User agents. https://udger.com/. Accessed 10 May 2022
Wan S, Li Y, Sun K (2019) PathMarker: protecting web contents against inside crawlers. Cybersecurity 2:1–17. https://doi.org/10.1186/s42400-019-0023-1
Webb GI (2010). In: Sammut C, Webb GI (eds) Naïve bayes BT - encyclopedia of machine learning. Springer US, Boston, pp 713–714
Zabihimayvan M, Sadeghi R, Rude HN, Doran D (2017) A soft computing approach for benign and malicious web robot detection. Expert Syst Appl 87:129–140. https://doi.org/10.1016/j.eswa.2017.06.004
Zaker F (2019) Online shopping store - web server logs. https://doi.org/10.7910/DVN/3QBYB5. Accessed 25 Aug 2021
Zhu X (2008) Semi-supervised learning literature survey contents. Sci York 10:10. https://doi.org/10.1.1.146.2352
Zhu W, Gao H, He Z et al (2019) A hybrid approach for recognizing web crawlers. Wireless algorithms, systems, and applications. WASA 2019. Lecture Notes in Computer Science. Springer International Publishing, pp 507–519
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest/Competing interests
None; we declare that we have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jagat, R.R., Sisodia, D.S. & Singh, P. DISET: a distance based semi-supervised self-training for automated users’ agent activity detection from web access log. Multimed Tools Appl 82, 19853–19876 (2023). https://doi.org/10.1007/s11042-022-14258-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-14258-0