Skip to main content

Identification of Data Breaches from Public Forums

  • Conference paper
  • First Online:
Innovative Security Solutions for Information Technology and Communications (SecITC 2021)

Abstract

Adversaries initiate their cyberattacks towards different entities such as healthcare or business institutes, and a successful attack causes data breaches. They publish their success stories in public forums for ranking purposes. The victim entities can be informed early about the data breach event if these forums are analyzed properly. Though few studies already focused on this sector, their data sets and codes are not public. Most importantly, the sources of their data sets do not exist today, which makes their novelty unclear and unreliable. To address and handle the above concerns, this study reinvestigates this domain with Machine Learning, Ensemble Learning, and Deep Learning. A web crawler is developed for downloading the dataset from the public forum of Nulled website. Feature extraction is done using TF-IDF and GloVe. Performance analysis showed that SVM achieved at most 90.80% accuracy with linear kernel. Implementations are published with a GitHub link.

This research work is supported by University of Asia Pacific.

A. Younus, M. H. Al Kawser and N. Adhikary—All of them contributed equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Keshta, I., Odeh, A.: Security and privacy of electronic health records: concerns and challenges. Egypt. Inf. J. 22(2), 177–183 (2021)

    Google Scholar 

  2. Ong, R., Sabapathy, S.: Hong Kong’s data breach notification scheme: from the stakeholders’ perspectives. Comput. Law Sec. Rev. 42, 105579 (2021)

    Article  Google Scholar 

  3. D’Arcy, J., Adjerid, I., Angst, C.M., Glavas, A.: Too good to be true: firm social performance and the risk of data breach. Inf. Syst. Res. 31(4), 1200–1223 (2020)

    Article  Google Scholar 

  4. Fang, Y., Guo, Y., Huang, C., Liu, L.: Analyzing and identifying data breaches in underground forums. IEEE Access 7, 48770–48777 (2019)

    Article  Google Scholar 

  5. Haque, R.U., et al.: Privacy-preserving K-nearest neighbors training over blockchain-based encrypted health data. Electronics 9(12), 2096 (2020)

    Article  Google Scholar 

  6. Haque, R.U., Hasan, A.S.M.T.: Privacy-preserving multivariant regression analysis over blockchain-based encrypted IoMT data. In: Maleh, Y., Baddi, Y., Alazab, M., Tawalbeh, L., Romdhani, I. (eds.) Artificial Intelligence and Blockchain for Future Cybersecurity Applications. SBD, vol. 90, pp. 45–59. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-74575-2_3

    Chapter  Google Scholar 

  7. Haque, R.U., Hasan, A.S.M.T., Nishat, T., Adnan, M.A.: Privacy-preserving k-means clustering over blockchain-based encrypted IoMT data. In: Maleh, Y., Tawalbeh, L., Motahhir, S., Hafid, A.S. (eds.) Advances in Blockchain Technology for Cyber Physical Systems. IT, pp. 109–123. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-93646-4_5

    Chapter  Google Scholar 

  8. Haque, R.U., Hasan, A.S.M.T.: Overview of blockchain-based privacy preserving machine learning for IoMT. In: Baddi, Y., Gahi, Y., Maleh, Y., Alazab, M., Tawalbeh, L. (eds.) Big Data Intelligence for Smart Applications. SCI, vol. 994, pp. 265–278. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-87954-9_12

    Chapter  Google Scholar 

  9. Papadimitriou, P., Garcia-Molina, H.: Data leakage detection. IEEE Trans. Knowl. Data Eng. 23(1), 51–63 (2010)

    Article  Google Scholar 

  10. Kale, S.A., Kulkarni, S.V.: Data leakage detection. Int. J. Adv. Res. Comput. Commun. Eng. 1(9), 668–678 (2012)

    Google Scholar 

  11. Lu, M., Chang, P., Li, J., Fan, T., Zhu, W.: Data leakage prevention for resource limited device, U.S. Patent 8 286 253 B1, 9 October 2012

    Google Scholar 

  12. Brown, T.G., Mann, B.S.: System and method for data leakage prevention, U.S. Patent 8 578 504 B2, 5 November 2013

    Google Scholar 

  13. Katz, G., Elovici, Y., Shapira, B.: CoBan: a context based model for data leakage prevention. Inf. Sci. 262, 137–158 (2014)

    Article  MathSciNet  Google Scholar 

  14. Onaolapo, J., Mariconti, E., Stringhini, G.: What happens after you are PWND: understanding the use of leaked Webmail credentials in the wild. In: Proceedings of the Internet Measurement Conference, pp. 65–79 (2016)

    Google Scholar 

  15. Jaeger, D., Graupner, H., Sapegin, A., Cheng, F., Meinel, C.: Gathering and analyzing identity leaks for security awareness. In: Mjølsnes, S.F. (ed.) PASSWORDS 2014. LNCS, vol. 9393, pp. 102–115. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24192-0_7

    Chapter  Google Scholar 

  16. Thomas, K., et al.: Data breaches, phishing, or malware?: understanding the risks of stolen credentials. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, pp. 1421–1434 (2017)

    Google Scholar 

  17. Shu, X., Tian, K., Ciambrone, A., Yao, D.: Breaking the target: an analysis of target data breach and lessons learned. (2017). https://arxiv.org/abs/1701.04940

  18. Butler, B., Wardman, B., Pratt, N.: REAPER: an automated, scalable solution for mass credential harvesting and OSINT. In: Proceedings APWG Symposium on Electronic Crime Research (eCrime), pp. 1–10 (2016)

    Google Scholar 

  19. Li, W., Yin, J., Chen, H.: Targeting key data breach services in underground supply chain. In: Proceedings of the IEEE Conference Intelligence and Security Informatics (ISI), pp. 322–324 (2016)

    Google Scholar 

  20. Overdorf, R., Troncoso, C., Greenstadt, R., McCoy, D.: Under the underground: predicting private interactions in underground forums (2018). https://arxiv.org/abs/1805.04494

  21. Zhang, Y., Fan, Y., Hou, S., Liu, J., Ye, Y., Bourlai, T.: iDetector: automate underground forum analysis based on heterogeneous information network. In: Proceedings IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 1071–1078 (2018)

    Google Scholar 

  22. Portnoff, R.S., et al.: Tools for automated analysis of cybercriminal markets. In: Proceedings 26th International Conference World Wide Web Steering Committee, pp. 657–666 (2017)

    Google Scholar 

  23. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  24. Ramage, D., Hall, D., Nallapati, R., Manning, C.D.: Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the Conference Empirical Methods Natural Lang. Processing, Association for Computational Linguistics, vol. 1, pp. 248–256 (2009)

    Google Scholar 

  25. Tasci, S., Gungor, T.: LDA-based keyword selection in text categorization. In: Proceedings of the 24th International Symposium on Computer and Information Sciences (ISCIS), pp. 230–235 (2009)

    Google Scholar 

  26. Cui, L., Meng, F., Shi, Y., Li, M., Liu, A.: A hierarchy method based on LDA and SVM for news classification. In: Proceedings of the IEEE International Conference Data Mining Workshop (ICDMW), pp. 60–64 (2014)

    Google Scholar 

  27. Wei, Y., Wang, W., Wang, B., Yang, B., Liu, Y.: A method for topic classification of web pages using LDA-SVM model. In: Deng, Z. (ed.) CIAC 2017. LNEE, vol. 458, pp. 589–596. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-6445-6_64

    Chapter  Google Scholar 

  28. Quercia, D., Askham, H., Crowcroft, J.: TweetLDA: supervised topic classification and link prediction in twitter. In: Proceedings of the 4th Annual ACM Web Science Conference, pp. 247–250 (2012)

    Google Scholar 

  29. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

    Google Scholar 

  30. Dey, A., Jenamani, M., Thakkar, J.J.: Lexical TF-IDF: an n-gram feature space for cross-domain classification of sentiment reviews. In: Shankar, B.U., Ghosh, K., Mandal, D.P., Ray, S.S., Zhang, D., Pal, S.K. (eds.) PReMI 2017. LNCS, vol. 10597, pp. 380–386. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69900-4_48

    Chapter  Google Scholar 

  31. Nulled. https://www.Nulled.to/. Accessed 14 Sep 2021

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rakib Ul Haque .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Adnan, M.A., Younus, A., Kawser, M.H.A., Adhikary, N., Habib, A., Haque, R.U. (2022). Identification of Data Breaches from Public Forums. In: Ryan, P.Y., Toma, C. (eds) Innovative Security Solutions for Information Technology and Communications. SecITC 2021. Lecture Notes in Computer Science, vol 13195. Springer, Cham. https://doi.org/10.1007/978-3-031-17510-7_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-17510-7_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-17509-1

  • Online ISBN: 978-3-031-17510-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics