Abstract
In the realm of cybersecurity, identifying and mitigating the exploitation of vulnerabilities is crucial. Building on prior research that analyzes underground hacking forums, this study refines methodologies for detecting vulnerability exploitation within underground discussion forums. Using the CrimeBB dataset, previous works employed machine learning approaches to extract insights, label textual information, build predictive models, and classify forum posts discussing Common Vulnerabilities and Exposures (CVE). Recently, the PostCog framework was released to facilitate navigation of the CrimeBB data. The current study integrates the PostCog extension with ChatGPT, enhancing the labeling of posts by type, intent, and crime category into new classifications such as Proof-of-Concept (PoC), Weaponization, and Exploitation. Additionally, using the SHAP explanation method, we uncover insights into the keywords frequently found in the text—such as “fud”, “sell”, “buy”, and “pm”—which have emerged as significant indicators of exploitation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The title of this paper is a reference to [15], indicating that it is a follow-up.
- 2.
A preliminary version of this work was presented at the Cambridge Cybercrime Conference https://www.cambridgecybercrime.uk/conference2024.html.
- 3.
The full version of this paper is available at https://tinyurl.com/cscmlmoreno.
- 4.
In [15] we already had all expert labels considered in this work, but we used only three of them (exploitation, PoC and weaponization).
- 5.
Accounting for slang and abbreviations that are typical in those communities is left as a subject for future work.
- 6.
Library to preprocess NLP raw texts https://github.com/fmorenovr/nlpToolkit.
- 7.
A fully undetectable exploit is an exploit that is not detected by any of the known antivirus tools.
References
Ahmed, T., Devanbu, P.: Few-shot training LLMs for project-specific code-summarization. In: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, pp. 1–5 (2022)
Allodi, L.: Economic factors of vulnerability trade and exploitation. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 1483–1499 (2017)
Anderson, R., et al.: Measuring the changing cost of cybercrime. In: The 2019 Workshop on the Economics of Information Security (2019)
Basheer, R., Alkhatib, B.: Threats from the dark: a review over dark web investigation research for cyber threat intelligence. J. Comput. Netw. Commun. 2021, 1–21 (2021)
Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pp. 144–152. ACM (1992)
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Caines, A., Pastrana, S., Hutchings, A., Buttery, P.: Automatically identifying the function and intent of posts in underground forums. Crime Sci. 7, 19 (2018)
Campobasso, M., Allodi, L.: Threat/crawl: a trainable, highly-reusable, and extensible automated method and tool to crawl criminal underground forums. In: APWG eCrime 2022 (2022). arXiv:2212.03641
Chen, D.D., Woo, M., Brumley, D., Egele, M.: Towards automated dynamic analysis for Linux-based embedded firmware. In: Network and Distributed System Security Symposium (2016)
Deguara, N., et al.: Threat miner: a text analysis engine for threat identification using dark web data. In: Big Data, pp. 3043–3052 (2022)
Edkrantz, M., Truvé, S., Said, A.: Predicting vulnerability exploits in the wild. In: 2015 IEEE 2nd International Conference on Cyber Security and Cloud Computing, pp. 513–514 (2015)
Liang, H., Pei, X., Jia, X., Shen, W., Zhang, J.: Fuzzing: state of the art. IEEE Trans. Reliab. 67, 1199–1218 (2018)
Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Neural Information Processing Systems (2017)
Moreno-Vera, F.: Inferring discussion topics about exploitation of vulnerabilities from underground hacking forums. In: ICTC, pp. 816–821 (2023)
Moreno-Vera, F., et al.: Cream skimming the underground: identifying relevant information points from online forums. In: 2023 IEEE International Conference on Cyber Security and Resilience (CSR), pp. 66–71 (2023)
OpenAI: ChatGPT: chat generative pre-trained transformer (2024). https://www.openai.com/chatgpt
Pastrana, S., Hutchings, A., et al.: Measuring ewhoring. In: Proceedings of the Internet Measurement Conference, pp. 463–477 (2019)
Pastrana, S., Thomas, D.R., et al.: CrimeBB: enabling cybercrime research on underground forums at scale. In: Proceedings of the 2018 World Wide Web Conference, pp. 1845–1854 (2018)
Pete, I., et al.: PostCog: a tool for interdisciplinary research into underground forums at scale. In: 2022 IEEE European Symposium on Security and Privacy Workshops (EuroS &PW), pp. 93–104 (2022)
Rahman, M.R., et al.: What are the attackers doing now? Automating cyberthreat intelligence extraction from text on pace with the changing threat landscape: a survey. ACM Comput. Surv. 55(12), 1–36 (2021)
Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should i trust you?”: Explaining the predictions of any classifier. In: SIGKDD (2016)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24, 513–523 (1988)
Siu, G.A., Collier, B., Hutchings, A.: Follow the money: the relationship between currency exchange and illicit behaviour in an underground forum. In: EuroS &PW, pp. 191–201 (2021)
Speybroeck, N.: Classification and regression trees. Int. J. Public Health 57, 243–246 (2012)
Tikhonov, A.N.: On the stability of inverse problems. In: Dokl. Akad. Nauk SSSR, vol. 39, pp. 195–198 (1943)
Acknowledgment
This project was sponsored by CAPES, CNPq, and FAPERJ (315110/2020-1, E-26/211.144/2019, and E-26/201.376/2021).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Moreno-Vera, F., Menasché, D.S., Lima, C. (2025). Beneath the Cream: Unveiling Relevant Information Points from CrimeBB with Its Ground Truth Labels. In: Dolev, S., Elhadad, M., Kutyłowski, M., Persiano, G. (eds) Cyber Security, Cryptology, and Machine Learning. CSCML 2024. Lecture Notes in Computer Science, vol 15349. Springer, Cham. https://doi.org/10.1007/978-3-031-76934-4_19
Download citation
DOI: https://doi.org/10.1007/978-3-031-76934-4_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-76933-7
Online ISBN: 978-3-031-76934-4
eBook Packages: Computer ScienceComputer Science (R0)