Skip to main content

Beneath the Cream: Unveiling Relevant Information Points from CrimeBB with Its Ground Truth Labels

  • Conference paper
  • First Online:
Cyber Security, Cryptology, and Machine Learning (CSCML 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15349))

  • 137 Accesses

Abstract

In the realm of cybersecurity, identifying and mitigating the exploitation of vulnerabilities is crucial. Building on prior research that analyzes underground hacking forums, this study refines methodologies for detecting vulnerability exploitation within underground discussion forums. Using the CrimeBB dataset, previous works employed machine learning approaches to extract insights, label textual information, build predictive models, and classify forum posts discussing Common Vulnerabilities and Exposures (CVE). Recently, the PostCog framework was released to facilitate navigation of the CrimeBB data. The current study integrates the PostCog extension with ChatGPT, enhancing the labeling of posts by type, intent, and crime category into new classifications such as Proof-of-Concept (PoC), Weaponization, and Exploitation. Additionally, using the SHAP explanation method, we uncover insights into the keywords frequently found in the text—such as “fud”, “sell”, “buy”, and “pm”—which have emerged as significant indicators of exploitation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The title of this paper is a reference to [15], indicating that it is a follow-up.

  2. 2.

    A preliminary version of this work was presented at the Cambridge Cybercrime Conference https://www.cambridgecybercrime.uk/conference2024.html.

  3. 3.

    The full version of this paper is available at https://tinyurl.com/cscmlmoreno.

  4. 4.

    In [15] we already had all expert labels considered in this work, but we used only three of them (exploitation, PoC and weaponization).

  5. 5.

    Accounting for slang and abbreviations that are typical in those communities is left as a subject for future work.

  6. 6.

    Library to preprocess NLP raw texts https://github.com/fmorenovr/nlpToolkit.

  7. 7.

    A fully undetectable exploit is an exploit that is not detected by any of the known antivirus tools.

References

  1. Ahmed, T., Devanbu, P.: Few-shot training LLMs for project-specific code-summarization. In: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, pp. 1–5 (2022)

    Google Scholar 

  2. Allodi, L.: Economic factors of vulnerability trade and exploitation. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 1483–1499 (2017)

    Google Scholar 

  3. Anderson, R., et al.: Measuring the changing cost of cybercrime. In: The 2019 Workshop on the Economics of Information Security (2019)

    Google Scholar 

  4. Basheer, R., Alkhatib, B.: Threats from the dark: a review over dark web investigation research for cyber threat intelligence. J. Comput. Netw. Commun. 2021, 1–21 (2021)

    Article  Google Scholar 

  5. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pp. 144–152. ACM (1992)

    Google Scholar 

  6. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)

    Article  Google Scholar 

  7. Caines, A., Pastrana, S., Hutchings, A., Buttery, P.: Automatically identifying the function and intent of posts in underground forums. Crime Sci. 7, 19 (2018)

    Article  Google Scholar 

  8. Campobasso, M., Allodi, L.: Threat/crawl: a trainable, highly-reusable, and extensible automated method and tool to crawl criminal underground forums. In: APWG eCrime 2022 (2022). arXiv:2212.03641

  9. Chen, D.D., Woo, M., Brumley, D., Egele, M.: Towards automated dynamic analysis for Linux-based embedded firmware. In: Network and Distributed System Security Symposium (2016)

    Google Scholar 

  10. Deguara, N., et al.: Threat miner: a text analysis engine for threat identification using dark web data. In: Big Data, pp. 3043–3052 (2022)

    Google Scholar 

  11. Edkrantz, M., Truvé, S., Said, A.: Predicting vulnerability exploits in the wild. In: 2015 IEEE 2nd International Conference on Cyber Security and Cloud Computing, pp. 513–514 (2015)

    Google Scholar 

  12. Liang, H., Pei, X., Jia, X., Shen, W., Zhang, J.: Fuzzing: state of the art. IEEE Trans. Reliab. 67, 1199–1218 (2018)

    Article  Google Scholar 

  13. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Neural Information Processing Systems (2017)

    Google Scholar 

  14. Moreno-Vera, F.: Inferring discussion topics about exploitation of vulnerabilities from underground hacking forums. In: ICTC, pp. 816–821 (2023)

    Google Scholar 

  15. Moreno-Vera, F., et al.: Cream skimming the underground: identifying relevant information points from online forums. In: 2023 IEEE International Conference on Cyber Security and Resilience (CSR), pp. 66–71 (2023)

    Google Scholar 

  16. OpenAI: ChatGPT: chat generative pre-trained transformer (2024). https://www.openai.com/chatgpt

  17. Pastrana, S., Hutchings, A., et al.: Measuring ewhoring. In: Proceedings of the Internet Measurement Conference, pp. 463–477 (2019)

    Google Scholar 

  18. Pastrana, S., Thomas, D.R., et al.: CrimeBB: enabling cybercrime research on underground forums at scale. In: Proceedings of the 2018 World Wide Web Conference, pp. 1845–1854 (2018)

    Google Scholar 

  19. Pete, I., et al.: PostCog: a tool for interdisciplinary research into underground forums at scale. In: 2022 IEEE European Symposium on Security and Privacy Workshops (EuroS &PW), pp. 93–104 (2022)

    Google Scholar 

  20. Rahman, M.R., et al.: What are the attackers doing now? Automating cyberthreat intelligence extraction from text on pace with the changing threat landscape: a survey. ACM Comput. Surv. 55(12), 1–36 (2021)

    Article  Google Scholar 

  21. Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should i trust you?”: Explaining the predictions of any classifier. In: SIGKDD (2016)

    Google Scholar 

  22. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24, 513–523 (1988)

    Article  Google Scholar 

  23. Siu, G.A., Collier, B., Hutchings, A.: Follow the money: the relationship between currency exchange and illicit behaviour in an underground forum. In: EuroS &PW, pp. 191–201 (2021)

    Google Scholar 

  24. Speybroeck, N.: Classification and regression trees. Int. J. Public Health 57, 243–246 (2012)

    Article  Google Scholar 

  25. Tikhonov, A.N.: On the stability of inverse problems. In: Dokl. Akad. Nauk SSSR, vol. 39, pp. 195–198 (1943)

    Google Scholar 

Download references

Acknowledgment

This project was sponsored by CAPES, CNPq, and FAPERJ (315110/2020-1, E-26/211.144/2019, and E-26/201.376/2021).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Felipe Moreno-Vera .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Moreno-Vera, F., Menasché, D.S., Lima, C. (2025). Beneath the Cream: Unveiling Relevant Information Points from CrimeBB with Its Ground Truth Labels. In: Dolev, S., Elhadad, M., Kutyłowski, M., Persiano, G. (eds) Cyber Security, Cryptology, and Machine Learning. CSCML 2024. Lecture Notes in Computer Science, vol 15349. Springer, Cham. https://doi.org/10.1007/978-3-031-76934-4_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-76934-4_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-76933-7

  • Online ISBN: 978-3-031-76934-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics