Beneath the Cream: Unveiling Relevant Information Points from CrimeBB with Its Ground Truth Labels

Moreno-Vera, Felipe; Menasché, Daniel Sadoc; Lima, Cabral

doi:10.1007/978-3-031-76934-4_19

Felipe Moreno-Vera¹¹,
Daniel Sadoc Menasché¹¹ &
Cabral Lima¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15349))

Included in the following conference series:

International Symposium on Cyber Security, Cryptology, and Machine Learning

137 Accesses

Abstract

In the realm of cybersecurity, identifying and mitigating the exploitation of vulnerabilities is crucial. Building on prior research that analyzes underground hacking forums, this study refines methodologies for detecting vulnerability exploitation within underground discussion forums. Using the CrimeBB dataset, previous works employed machine learning approaches to extract insights, label textual information, build predictive models, and classify forum posts discussing Common Vulnerabilities and Exposures (CVE). Recently, the PostCog framework was released to facilitate navigation of the CrimeBB data. The current study integrates the PostCog extension with ChatGPT, enhancing the labeling of posts by type, intent, and crime category into new classifications such as Proof-of-Concept (PoC), Weaponization, and Exploitation. Additionally, using the SHAP explanation method, we uncover insights into the keywords frequently found in the text—such as “fud”, “sell”, “buy”, and “pm”—which have emerged as significant indicators of exploitation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The title of this paper is a reference to [15], indicating that it is a follow-up.
2.
A preliminary version of this work was presented at the Cambridge Cybercrime Conference https://www.cambridgecybercrime.uk/conference2024.html.
3.
The full version of this paper is available at https://tinyurl.com/cscmlmoreno.
4.
In [15] we already had all expert labels considered in this work, but we used only three of them (exploitation, PoC and weaponization).
5.
Accounting for slang and abbreviations that are typical in those communities is left as a subject for future work.
6.
Library to preprocess NLP raw texts https://github.com/fmorenovr/nlpToolkit.
7.
A fully undetectable exploit is an exploit that is not detected by any of the known antivirus tools.

References

Ahmed, T., Devanbu, P.: Few-shot training LLMs for project-specific code-summarization. In: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, pp. 1–5 (2022)
Google Scholar
Allodi, L.: Economic factors of vulnerability trade and exploitation. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 1483–1499 (2017)
Google Scholar
Anderson, R., et al.: Measuring the changing cost of cybercrime. In: The 2019 Workshop on the Economics of Information Security (2019)
Google Scholar
Basheer, R., Alkhatib, B.: Threats from the dark: a review over dark web investigation research for cyber threat intelligence. J. Comput. Netw. Commun. 2021, 1–21 (2021)
Article Google Scholar
Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pp. 144–152. ACM (1992)
Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Article Google Scholar
Caines, A., Pastrana, S., Hutchings, A., Buttery, P.: Automatically identifying the function and intent of posts in underground forums. Crime Sci. 7, 19 (2018)
Article Google Scholar
Campobasso, M., Allodi, L.: Threat/crawl: a trainable, highly-reusable, and extensible automated method and tool to crawl criminal underground forums. In: APWG eCrime 2022 (2022). arXiv:2212.03641
Chen, D.D., Woo, M., Brumley, D., Egele, M.: Towards automated dynamic analysis for Linux-based embedded firmware. In: Network and Distributed System Security Symposium (2016)
Google Scholar
Deguara, N., et al.: Threat miner: a text analysis engine for threat identification using dark web data. In: Big Data, pp. 3043–3052 (2022)
Google Scholar
Edkrantz, M., Truvé, S., Said, A.: Predicting vulnerability exploits in the wild. In: 2015 IEEE 2nd International Conference on Cyber Security and Cloud Computing, pp. 513–514 (2015)
Google Scholar
Liang, H., Pei, X., Jia, X., Shen, W., Zhang, J.: Fuzzing: state of the art. IEEE Trans. Reliab. 67, 1199–1218 (2018)
Article Google Scholar
Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Neural Information Processing Systems (2017)
Google Scholar
Moreno-Vera, F.: Inferring discussion topics about exploitation of vulnerabilities from underground hacking forums. In: ICTC, pp. 816–821 (2023)
Google Scholar
Moreno-Vera, F., et al.: Cream skimming the underground: identifying relevant information points from online forums. In: 2023 IEEE International Conference on Cyber Security and Resilience (CSR), pp. 66–71 (2023)
Google Scholar
OpenAI: ChatGPT: chat generative pre-trained transformer (2024). https://www.openai.com/chatgpt
Pastrana, S., Hutchings, A., et al.: Measuring ewhoring. In: Proceedings of the Internet Measurement Conference, pp. 463–477 (2019)
Google Scholar
Pastrana, S., Thomas, D.R., et al.: CrimeBB: enabling cybercrime research on underground forums at scale. In: Proceedings of the 2018 World Wide Web Conference, pp. 1845–1854 (2018)
Google Scholar
Pete, I., et al.: PostCog: a tool for interdisciplinary research into underground forums at scale. In: 2022 IEEE European Symposium on Security and Privacy Workshops (EuroS &PW), pp. 93–104 (2022)
Google Scholar
Rahman, M.R., et al.: What are the attackers doing now? Automating cyberthreat intelligence extraction from text on pace with the changing threat landscape: a survey. ACM Comput. Surv. 55(12), 1–36 (2021)
Article Google Scholar
Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should i trust you?”: Explaining the predictions of any classifier. In: SIGKDD (2016)
Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24, 513–523 (1988)
Article Google Scholar
Siu, G.A., Collier, B., Hutchings, A.: Follow the money: the relationship between currency exchange and illicit behaviour in an underground forum. In: EuroS &PW, pp. 191–201 (2021)
Google Scholar
Speybroeck, N.: Classification and regression trees. Int. J. Public Health 57, 243–246 (2012)
Article Google Scholar
Tikhonov, A.N.: On the stability of inverse problems. In: Dokl. Akad. Nauk SSSR, vol. 39, pp. 195–198 (1943)
Google Scholar

Download references

Acknowledgment

This project was sponsored by CAPES, CNPq, and FAPERJ (315110/2020-1, E-26/211.144/2019, and E-26/201.376/2021).

Author information

Authors and Affiliations

Federal University of Rio de Janeiro (UFRJ), Rio de Janeiro, Brazil
Felipe Moreno-Vera, Daniel Sadoc Menasché & Cabral Lima

Authors

Felipe Moreno-Vera
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Sadoc Menasché
View author publications
You can also search for this author in PubMed Google Scholar
Cabral Lima
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Felipe Moreno-Vera .

Editor information

Editors and Affiliations

Ben-Gurion University of the Negev, Beer Sheva, Israel
Shlomi Dolev
Ben-Gurion University of the Negev, Be'er Sheva, Israel
Michael Elhadad
NASK National Research Institute, Warszawa, Poland
Mirosław Kutyłowski
Università di Salerno, Fisciano SA, Italy
Giuseppe Persiano

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Moreno-Vera, F., Menasché, D.S., Lima, C. (2025). Beneath the Cream: Unveiling Relevant Information Points from CrimeBB with Its Ground Truth Labels. In: Dolev, S., Elhadad, M., Kutyłowski, M., Persiano, G. (eds) Cyber Security, Cryptology, and Machine Learning. CSCML 2024. Lecture Notes in Computer Science, vol 15349. Springer, Cham. https://doi.org/10.1007/978-3-031-76934-4_19

Download citation

DOI: https://doi.org/10.1007/978-3-031-76934-4_19
Published: 12 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-76933-7
Online ISBN: 978-3-031-76934-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Beneath the Cream: Unveiling Relevant Information Points from CrimeBB with Its Ground Truth Labels