Skip to main content

From Cracked Accounts to Fake IDs: User Profiling on German Telegram Black Market Channels

  • Conference paper
  • First Online:
Data Management Technologies and Applications (DATA 2022, DATA 2021)

Abstract

Messenger apps like WhatsApp and Telegram are frequently used for everyday communication, but they can also be utilized as a platform for illegal activity. Telegram allows public groups with up to 200.000 participants. Criminals use these public groups for trading illegal commodities and services, which becomes a concern for law enforcement agencies, who manually monitor suspicious activity in these chat rooms. This research demonstrates how natural language processing (NLP) can assist in analyzing these chat rooms, providing an explorative overview of the domain and facilitating purposeful analyses of user behavior. We provide a publicly available corpus of annotated text messages with entities and relations from four self-proclaimed black market chat rooms. Our pipeline approach aggregates the extracted product attributes from user messages to profiles and uses these with their sold products as features for clustering. The extracted structured information is the foundation for further data exploration, such as identifying the top vendors or fine-granular price analyses. Our evaluation shows that pretrained word vectors perform better for unsupervised clustering than state-of-the-art transformer models, while the latter is still superior for sequence labeling.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/Abuesgen/From-Cracked-Accounts-to-Fake-IDs.git.

  2. 2.

    Google request for the keyphrase “telegram groups” gives many results for search engines.

  3. 3.

    https://core.telegram.org/api/mentions.

  4. 4.

    https://core.telegram.org/.

  5. 5.

    The lower boundary of the second-best score.

  6. 6.

    https://mlco2.github.io/impact#compute.

  7. 7.

    Following Krippendorff’s alpha.

  8. 8.

    At the time of writing this paper, the monthly premium package prices are 5.29 € for NordVPN and 30,00 € for the first year, and 66,90 € afterward for Sky in Germany.

References

  1. Sklearn.cluster.AgglomerativeClustering. https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html. Accessed 01 Mar 2022

  2. T-Systems-onsite/cross-en-de-roberta-sentence-transformer \(\cdot \) Hugging Face. https://huggingface.co/T-Systems-onsite/cross-en-de-roberta-sentence-transformer. Accessed 14 Dec 2022

  3. Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S., Vollgraf, R.: FLAIR: an easy-to-use framework for state-of-the-art NLP. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp. 54–59. Association for Computational Linguistics, Minneapolis, June 2019. https://doi.org/10.18653/v1/N19-4010. https://aclanthology.org/N19-4010

  4. Akbik, A., Blythe, D., Vollgraf, R.: Contextual string embeddings for sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638–1649. Association for Computational Linguistics, Santa Fe, August 2018. https://aclanthology.org/C18-1139

  5. Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M.: Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, pp. 2623–2631. Association for Computing Machinery, New York, July 2019. https://doi.org/10.1145/3292500.3330701

  6. Baravalle, A., Lopez, M.S., Lee, S.W.: Mining the dark web: drugs and fake ids. In: 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), pp. 350–356, December 2016. https://doi.org/10.1109/ICDMW.2016.0056

  7. Benikova, D., Biemann, C., Reznicek, M.: NoSta-D named entity annotation for German: guidelines and dataset. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 2524–2531. European Language Resources Association (ELRA), Reykjavik, May 2014. http://www.lrec-conf.org/proceedings/lrec2014/pdf/276_Paper.pdf

  8. Bitkom: Neun von zehn Internetnutzern verwenden Messenger | Bitkom Main (2018). http://www.bitkom.org/Presse/Presseinformation/Neun-von-zehn-Internetnutzern-verwenden-Messenger.html. Accessed 18 Feb 2022

  9. Blankers, M., van der Gouwe, D., Stegemann, L., Smit-Rigter, L.: Changes in online psychoactive substance trade via telegram during the COVID-19 pandemic. Eur. Addict. Res. 27(6), 469–474 (2021). https://doi.org/10.1159/000516853. https://www.karger.com/Article/FullText/516853

  10. Büsgen, A., Klöser, L., Kohl, P., Schmidts, O., Kraft, B., Zündorf, A.: Exploratory analysis of chat-based black market profiles with natural language processing. In: Proceedings of the 11th International Conference on Data Science, Technology and Applications, pp. 83–94. SCITEPRESS - Science and Technology Publications, Lisbon (2022). https://doi.org/10.5220/0011271400003269. https://www.scitepress.org/DigitalLibrary/Link.aspx?doi=10.5220/0011271400003269

  11. Camacho-Collados, J., Doval, Y., Martínez-Cámara, E., Espinosa-Anke, L., Barbieri, F., Schockaert, S.: Learning cross-lingual embeddings from Twitter via distant supervision, March 2020. http://arxiv.org/abs/1905.07358

  12. Chan, B., Schweter, S., Möller, T.: German’s next language model. arXiv:2010.10906 [cs], December 2020

  13. Chauhan, P., Sharma, N., Sikka, G.: The emergence of social media data and sentiment analysis in election prediction. J. Ambient Intell. Human. Comput. 12(2), 2601–2627 (2021). https://doi.org/10.1007/s12652-020-02423-y

    Article  Google Scholar 

  14. Christin, N.: Traveling the silk road: a measurement analysis of a large anonymous online marketplace. In: Proceedings of the 22nd International Conference on World Wide Web (2013). https://doi.org/10.1145/2488388.2488408

  15. Dangi, D., Dixit, D.K., Bhagat, A.: Sentiment analysis of COVID-19 social media data through machine learning. Multimedia Tools Appl. 81(29), 42261–42283 (2022). https://doi.org/10.1007/s11042-022-13492-w

    Article  Google Scholar 

  16. Dargahi Nobari, A., Sarraf, M., Neshati, M., Daneshvar, F.: Characteristics of viral messages on Telegram; the world’s largest hybrid public and private messenger. Expert Syst. Appl. 168, 114303 (2020). https://doi.org/10.1016/j.eswa.2020.114303

    Article  Google Scholar 

  17. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [cs], May 2019

  18. Doddington, G., Mitchell, A., Przybocki, M.A., Ramshaw, L., Strassel, S., Weischedel, R.: The automatic content extraction (ACE) program - tasks, data, and evaluation. In: International Conference on Language Resources and Evaluation (2004). https://www.semanticscholar.org/paper/The-Automatic-Content-Extraction-(ACE)-Program-and-Doddington-Mitchell/0617dd6924df7a3491c299772b70e90507b195dc

  19. Eberts, M., Ulges, A.: Span-based joint entity and relation extraction with transformer pre-training, June 2021. https://doi.org/10.3233/FAIA200321. http://arxiv.org/abs/1909.07755

  20. Gomathi, C.: Social tagging system for community detecting using NLP technique. Int. J. Res. Appl. Sci. Eng. Technol. 6, 1665–1671 (2018). https://doi.org/10.22214/ijraset.2018.4279

    Article  Google Scholar 

  21. Griffith, V., Xu, Y., Ratti, C.: Graph theoretic properties of the darkweb. arXiv:1704.07525 [cs] (2017)

  22. Hennig, L., Truong, P.T., Gabryszak, A.: MobIE: a German dataset for named entity recognition, entity linking and relation extraction in the mobility domain. In: Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021), pp. 223–227. KONVENS 2021 Organizers, Düsseldorf (2021). https://aclanthology.org/2021.konvens-1.22

  23. Hoseini, M., Melo, P., Benevenuto, F., Feldmann, A., Zannettou, S.: On the globalization of the QAnon conspiracy theory through Telegram. ArXiv, May 2021. https://www.semanticscholar.org/paper/On-the-Globalization-of-the-QAnon-Conspiracy-Theory-Hoseini-Melo/1b0f3a6da334b898ddb070657c980349d31be4e2

  24. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv:1508.01991 [cs], August 2015

  25. Jin, D., et al.: A survey of community detection approaches: from statistical modeling to deep learning. IEEE Trans. Knowl. Data Eng. 35(2), 1149–1170 (2021). https://doi.org/10.1109/TKDE.2021.3104155. https://ieeexplore.ieee.org/document/9511798/

  26. Kartal, G.: What’s up with WhatsApp? a critical analysis of mobile instant messaging research in language learning. Int. J. Contemp. Educ. Res. 6(2), 352–365 (2019). https://doi.org/10.33200/ijcer.599138. https://dergipark.org.tr/en/doi/10.33200/ijcer.599138

  27. Klöser, L., Kohl, P., Kraft, B., Zündorf, A.: Multi-attribute relation extraction (MARE) - simplifying the application of relation extraction. In: Proceedings of the 2nd International Conference on Deep Learning Theory and Applications, pp. 148–156 (2021). https://doi.org/10.5220/0010559201480156. http://arxiv.org/abs/2111.09035

  28. Krippendorff, K.: Reliability. In: Content Analysis: An Introduction to Its Methodology, Revised edition. Sage Publications Inc., Los Angeles, April 2012

    Google Scholar 

  29. Lacoste, A., Luccioni, A., Schmidt, V., Dandres, T.: Quantifying the carbon emissions of machine learning. arXiv:1910.09700 [cs], November 2019

  30. Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33(1), 159 (1977). https://doi.org/10.2307/2529310. https://www.jstor.org/stable/2529310?origin=crossref

  31. van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(86), 2579–2605 (2008). http://jmlr.org/papers/v9/vandermaaten08a.html

  32. McLean, G., Osei-Frimpong, K.: Examining satisfaction with the experience during a live chat service encounter-implications for website providers. Comput. Hum. Behav. 76, 494–508 (2017). https://doi.org/10.1016/j.chb.2017.08.005. https://linkinghub.elsevier.com/retrieve/pii/S0747563217304727

  33. Naseri, M., Zamani, H.: Analyzing and predicting news popularity in an instant messaging service. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1053–1056, July 2019. https://doi.org/10.1145/3331184.3331301

  34. Newman, M.E.J.: Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E 74(3), 036104 (2006). https://doi.org/10.1103/PhysRevE.74.036104. http://arxiv.org/abs/physics/0605087

  35. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12(85), 2825–2830 (2011). http://jmlr.org/papers/v12/pedregosa11a.html

  36. Sang, E.F.T.K., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. arXiv:cs/0306050, Jun 2003

  37. Su, X., et al.: A comprehensive survey on community detection with deep learning. IEEE Trans. Neural Netw. Learn. Syst. 1–21 (2022). https://doi.org/10.1109/TNNLS.2021.3137396. https://ieeexplore.ieee.org/document/9732192/

  38. Subhashini, L.D.C.S., Li, Y., Zhang, J., Atukorale, A.S., Wu, Y.: Mining and classifying customer reviews: a survey. Artif. Intell. Rev. 54(8), 6343–6389 (2021). https://doi.org/10.1007/s10462-021-09955-5

    Article  Google Scholar 

  39. Tsao, S.F., Chen, H., Tisseverasinghe, T., Yang, Y., Li, L., Butt, Z.A.: What social media told us in the time of COVID-19: a scoping review. Lancet Digit. Health 3(3), e175–e194 (2021). https://doi.org/10.1016/S2589-7500(20)30315-0. https://linkinghub.elsevier.com/retrieve/pii/S2589750020303150

  40. Vajjala, S., Majumder, B., Gupta, A., Surana, H.: Social media. In: Practical Natural Language Processing. O’Reilly Media, Inc., June 2020. https://www.oreilly.com/library/view/practical-natural-language/9781492054047/

  41. Wattenberg, M., Viégas, F., Johnson, I.: How to use t-SNE effectively. Distill 1(10), e2 (2016). https://doi.org/10.23915/distill.00002. http://distill.pub/2016/misread-tsne

  42. Zhang, X., et al.: TwHIN-BERT: a socially-enriched pre-trained language model for multilingual tweet representations, September 2022. https://doi.org/10.48550/arXiv.2209.07562. http://arxiv.org/abs/2209.07562

  43. Zhong, Z., Chen, D.: A frustratingly easy approach for entity and relation extraction. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 50–61 (2021). https://doi.org/10.18653/v1/2021.naacl-main.5. https://aclanthology.org/2021.naacl-main.5

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to André Büsgen , Lars Klöser , Philipp Kohl , Oliver Schmidts , Bodo Kraft or Albert Zündorf .

Editor information

Editors and Affiliations

Appendix

Appendix

Fig. 12.
figure 12

Affinity difference matrix between all product categories using the fine-tuned FastText embeddings. Equation 1 describes the computation yielding this matrix.

Fig. 13.
figure 13

We generate an HTML site for the aggregated message information. The translated excerpt shows general information about the user and an example product with prices and counts. If vendors offer the same product in several messages, we aggregate all these messages by the product. The interested viewer can inspect each message from which the NLP approach extracted information. We changed the visualization slightly for the compressed presentation.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Büsgen, A., Klöser, L., Kohl, P., Schmidts, O., Kraft, B., Zündorf, A. (2023). From Cracked Accounts to Fake IDs: User Profiling on German Telegram Black Market Channels. In: Cuzzocrea, A., Gusikhin, O., Hammoudi, S., Quix, C. (eds) Data Management Technologies and Applications. DATA DATA 2022 2021. Communications in Computer and Information Science, vol 1860. Springer, Cham. https://doi.org/10.1007/978-3-031-37890-4_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-37890-4_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-37889-8

  • Online ISBN: 978-3-031-37890-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics