Abstract
Adversaries initiate their cyberattacks towards different entities such as healthcare or business institutes, and a successful attack causes data breaches. They publish their success stories in public forums for ranking purposes. The victim entities can be informed early about the data breach event if these forums are analyzed properly. Though few studies already focused on this sector, their data sets and codes are not public. Most importantly, the sources of their data sets do not exist today, which makes their novelty unclear and unreliable. To address and handle the above concerns, this study reinvestigates this domain with Machine Learning, Ensemble Learning, and Deep Learning. A web crawler is developed for downloading the dataset from the public forum of Nulled website. Feature extraction is done using TF-IDF and GloVe. Performance analysis showed that SVM achieved at most 90.80% accuracy with linear kernel. Implementations are published with a GitHub link.
This research work is supported by University of Asia Pacific.
A. Younus, M. H. Al Kawser and N. Adhikary—All of them contributed equally.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Keshta, I., Odeh, A.: Security and privacy of electronic health records: concerns and challenges. Egypt. Inf. J. 22(2), 177–183 (2021)
Ong, R., Sabapathy, S.: Hong Kong’s data breach notification scheme: from the stakeholders’ perspectives. Comput. Law Sec. Rev. 42, 105579 (2021)
D’Arcy, J., Adjerid, I., Angst, C.M., Glavas, A.: Too good to be true: firm social performance and the risk of data breach. Inf. Syst. Res. 31(4), 1200–1223 (2020)
Fang, Y., Guo, Y., Huang, C., Liu, L.: Analyzing and identifying data breaches in underground forums. IEEE Access 7, 48770–48777 (2019)
Haque, R.U., et al.: Privacy-preserving K-nearest neighbors training over blockchain-based encrypted health data. Electronics 9(12), 2096 (2020)
Haque, R.U., Hasan, A.S.M.T.: Privacy-preserving multivariant regression analysis over blockchain-based encrypted IoMT data. In: Maleh, Y., Baddi, Y., Alazab, M., Tawalbeh, L., Romdhani, I. (eds.) Artificial Intelligence and Blockchain for Future Cybersecurity Applications. SBD, vol. 90, pp. 45–59. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-74575-2_3
Haque, R.U., Hasan, A.S.M.T., Nishat, T., Adnan, M.A.: Privacy-preserving k-means clustering over blockchain-based encrypted IoMT data. In: Maleh, Y., Tawalbeh, L., Motahhir, S., Hafid, A.S. (eds.) Advances in Blockchain Technology for Cyber Physical Systems. IT, pp. 109–123. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-93646-4_5
Haque, R.U., Hasan, A.S.M.T.: Overview of blockchain-based privacy preserving machine learning for IoMT. In: Baddi, Y., Gahi, Y., Maleh, Y., Alazab, M., Tawalbeh, L. (eds.) Big Data Intelligence for Smart Applications. SCI, vol. 994, pp. 265–278. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-87954-9_12
Papadimitriou, P., Garcia-Molina, H.: Data leakage detection. IEEE Trans. Knowl. Data Eng. 23(1), 51–63 (2010)
Kale, S.A., Kulkarni, S.V.: Data leakage detection. Int. J. Adv. Res. Comput. Commun. Eng. 1(9), 668–678 (2012)
Lu, M., Chang, P., Li, J., Fan, T., Zhu, W.: Data leakage prevention for resource limited device, U.S. Patent 8 286 253 B1, 9 October 2012
Brown, T.G., Mann, B.S.: System and method for data leakage prevention, U.S. Patent 8 578 504 B2, 5 November 2013
Katz, G., Elovici, Y., Shapira, B.: CoBan: a context based model for data leakage prevention. Inf. Sci. 262, 137–158 (2014)
Onaolapo, J., Mariconti, E., Stringhini, G.: What happens after you are PWND: understanding the use of leaked Webmail credentials in the wild. In: Proceedings of the Internet Measurement Conference, pp. 65–79 (2016)
Jaeger, D., Graupner, H., Sapegin, A., Cheng, F., Meinel, C.: Gathering and analyzing identity leaks for security awareness. In: Mjølsnes, S.F. (ed.) PASSWORDS 2014. LNCS, vol. 9393, pp. 102–115. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24192-0_7
Thomas, K., et al.: Data breaches, phishing, or malware?: understanding the risks of stolen credentials. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, pp. 1421–1434 (2017)
Shu, X., Tian, K., Ciambrone, A., Yao, D.: Breaking the target: an analysis of target data breach and lessons learned. (2017). https://arxiv.org/abs/1701.04940
Butler, B., Wardman, B., Pratt, N.: REAPER: an automated, scalable solution for mass credential harvesting and OSINT. In: Proceedings APWG Symposium on Electronic Crime Research (eCrime), pp. 1–10 (2016)
Li, W., Yin, J., Chen, H.: Targeting key data breach services in underground supply chain. In: Proceedings of the IEEE Conference Intelligence and Security Informatics (ISI), pp. 322–324 (2016)
Overdorf, R., Troncoso, C., Greenstadt, R., McCoy, D.: Under the underground: predicting private interactions in underground forums (2018). https://arxiv.org/abs/1805.04494
Zhang, Y., Fan, Y., Hou, S., Liu, J., Ye, Y., Bourlai, T.: iDetector: automate underground forum analysis based on heterogeneous information network. In: Proceedings IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 1071–1078 (2018)
Portnoff, R.S., et al.: Tools for automated analysis of cybercriminal markets. In: Proceedings 26th International Conference World Wide Web Steering Committee, pp. 657–666 (2017)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Ramage, D., Hall, D., Nallapati, R., Manning, C.D.: Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the Conference Empirical Methods Natural Lang. Processing, Association for Computational Linguistics, vol. 1, pp. 248–256 (2009)
Tasci, S., Gungor, T.: LDA-based keyword selection in text categorization. In: Proceedings of the 24th International Symposium on Computer and Information Sciences (ISCIS), pp. 230–235 (2009)
Cui, L., Meng, F., Shi, Y., Li, M., Liu, A.: A hierarchy method based on LDA and SVM for news classification. In: Proceedings of the IEEE International Conference Data Mining Workshop (ICDMW), pp. 60–64 (2014)
Wei, Y., Wang, W., Wang, B., Yang, B., Liu, Y.: A method for topic classification of web pages using LDA-SVM model. In: Deng, Z. (ed.) CIAC 2017. LNEE, vol. 458, pp. 589–596. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-6445-6_64
Quercia, D., Askham, H., Crowcroft, J.: TweetLDA: supervised topic classification and link prediction in twitter. In: Proceedings of the 4th Annual ACM Web Science Conference, pp. 247–250 (2012)
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Dey, A., Jenamani, M., Thakkar, J.J.: Lexical TF-IDF: an n-gram feature space for cross-domain classification of sentiment reviews. In: Shankar, B.U., Ghosh, K., Mandal, D.P., Ray, S.S., Zhang, D., Pal, S.K. (eds.) PReMI 2017. LNCS, vol. 10597, pp. 380–386. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69900-4_48
Nulled. https://www.Nulled.to/. Accessed 14 Sep 2021
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Adnan, M.A., Younus, A., Kawser, M.H.A., Adhikary, N., Habib, A., Haque, R.U. (2022). Identification of Data Breaches from Public Forums. In: Ryan, P.Y., Toma, C. (eds) Innovative Security Solutions for Information Technology and Communications. SecITC 2021. Lecture Notes in Computer Science, vol 13195. Springer, Cham. https://doi.org/10.1007/978-3-031-17510-7_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-17510-7_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17509-1
Online ISBN: 978-3-031-17510-7
eBook Packages: Computer ScienceComputer Science (R0)