Abstract
Messenger apps like WhatsApp and Telegram are frequently used for everyday communication, but they can also be utilized as a platform for illegal activity. Telegram allows public groups with up to 200.000 participants. Criminals use these public groups for trading illegal commodities and services, which becomes a concern for law enforcement agencies, who manually monitor suspicious activity in these chat rooms. This research demonstrates how natural language processing (NLP) can assist in analyzing these chat rooms, providing an explorative overview of the domain and facilitating purposeful analyses of user behavior. We provide a publicly available corpus of annotated text messages with entities and relations from four self-proclaimed black market chat rooms. Our pipeline approach aggregates the extracted product attributes from user messages to profiles and uses these with their sold products as features for clustering. The extracted structured information is the foundation for further data exploration, such as identifying the top vendors or fine-granular price analyses. Our evaluation shows that pretrained word vectors perform better for unsupervised clustering than state-of-the-art transformer models, while the latter is still superior for sequence labeling.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
Google request for the keyphrase “telegram groups” gives many results for search engines.
- 3.
- 4.
- 5.
The lower boundary of the second-best score.
- 6.
- 7.
Following Krippendorff’s alpha.
- 8.
At the time of writing this paper, the monthly premium package prices are 5.29 € for NordVPN and 30,00 € for the first year, and 66,90 € afterward for Sky in Germany.
References
Sklearn.cluster.AgglomerativeClustering. https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html. Accessed 01 Mar 2022
T-Systems-onsite/cross-en-de-roberta-sentence-transformer \(\cdot \) Hugging Face. https://huggingface.co/T-Systems-onsite/cross-en-de-roberta-sentence-transformer. Accessed 14 Dec 2022
Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S., Vollgraf, R.: FLAIR: an easy-to-use framework for state-of-the-art NLP. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp. 54–59. Association for Computational Linguistics, Minneapolis, June 2019. https://doi.org/10.18653/v1/N19-4010. https://aclanthology.org/N19-4010
Akbik, A., Blythe, D., Vollgraf, R.: Contextual string embeddings for sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638–1649. Association for Computational Linguistics, Santa Fe, August 2018. https://aclanthology.org/C18-1139
Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M.: Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, pp. 2623–2631. Association for Computing Machinery, New York, July 2019. https://doi.org/10.1145/3292500.3330701
Baravalle, A., Lopez, M.S., Lee, S.W.: Mining the dark web: drugs and fake ids. In: 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), pp. 350–356, December 2016. https://doi.org/10.1109/ICDMW.2016.0056
Benikova, D., Biemann, C., Reznicek, M.: NoSta-D named entity annotation for German: guidelines and dataset. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 2524–2531. European Language Resources Association (ELRA), Reykjavik, May 2014. http://www.lrec-conf.org/proceedings/lrec2014/pdf/276_Paper.pdf
Bitkom: Neun von zehn Internetnutzern verwenden Messenger | Bitkom Main (2018). http://www.bitkom.org/Presse/Presseinformation/Neun-von-zehn-Internetnutzern-verwenden-Messenger.html. Accessed 18 Feb 2022
Blankers, M., van der Gouwe, D., Stegemann, L., Smit-Rigter, L.: Changes in online psychoactive substance trade via telegram during the COVID-19 pandemic. Eur. Addict. Res. 27(6), 469–474 (2021). https://doi.org/10.1159/000516853. https://www.karger.com/Article/FullText/516853
Büsgen, A., Klöser, L., Kohl, P., Schmidts, O., Kraft, B., Zündorf, A.: Exploratory analysis of chat-based black market profiles with natural language processing. In: Proceedings of the 11th International Conference on Data Science, Technology and Applications, pp. 83–94. SCITEPRESS - Science and Technology Publications, Lisbon (2022). https://doi.org/10.5220/0011271400003269. https://www.scitepress.org/DigitalLibrary/Link.aspx?doi=10.5220/0011271400003269
Camacho-Collados, J., Doval, Y., Martínez-Cámara, E., Espinosa-Anke, L., Barbieri, F., Schockaert, S.: Learning cross-lingual embeddings from Twitter via distant supervision, March 2020. http://arxiv.org/abs/1905.07358
Chan, B., Schweter, S., Möller, T.: German’s next language model. arXiv:2010.10906 [cs], December 2020
Chauhan, P., Sharma, N., Sikka, G.: The emergence of social media data and sentiment analysis in election prediction. J. Ambient Intell. Human. Comput. 12(2), 2601–2627 (2021). https://doi.org/10.1007/s12652-020-02423-y
Christin, N.: Traveling the silk road: a measurement analysis of a large anonymous online marketplace. In: Proceedings of the 22nd International Conference on World Wide Web (2013). https://doi.org/10.1145/2488388.2488408
Dangi, D., Dixit, D.K., Bhagat, A.: Sentiment analysis of COVID-19 social media data through machine learning. Multimedia Tools Appl. 81(29), 42261–42283 (2022). https://doi.org/10.1007/s11042-022-13492-w
Dargahi Nobari, A., Sarraf, M., Neshati, M., Daneshvar, F.: Characteristics of viral messages on Telegram; the world’s largest hybrid public and private messenger. Expert Syst. Appl. 168, 114303 (2020). https://doi.org/10.1016/j.eswa.2020.114303
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [cs], May 2019
Doddington, G., Mitchell, A., Przybocki, M.A., Ramshaw, L., Strassel, S., Weischedel, R.: The automatic content extraction (ACE) program - tasks, data, and evaluation. In: International Conference on Language Resources and Evaluation (2004). https://www.semanticscholar.org/paper/The-Automatic-Content-Extraction-(ACE)-Program-and-Doddington-Mitchell/0617dd6924df7a3491c299772b70e90507b195dc
Eberts, M., Ulges, A.: Span-based joint entity and relation extraction with transformer pre-training, June 2021. https://doi.org/10.3233/FAIA200321. http://arxiv.org/abs/1909.07755
Gomathi, C.: Social tagging system for community detecting using NLP technique. Int. J. Res. Appl. Sci. Eng. Technol. 6, 1665–1671 (2018). https://doi.org/10.22214/ijraset.2018.4279
Griffith, V., Xu, Y., Ratti, C.: Graph theoretic properties of the darkweb. arXiv:1704.07525 [cs] (2017)
Hennig, L., Truong, P.T., Gabryszak, A.: MobIE: a German dataset for named entity recognition, entity linking and relation extraction in the mobility domain. In: Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021), pp. 223–227. KONVENS 2021 Organizers, Düsseldorf (2021). https://aclanthology.org/2021.konvens-1.22
Hoseini, M., Melo, P., Benevenuto, F., Feldmann, A., Zannettou, S.: On the globalization of the QAnon conspiracy theory through Telegram. ArXiv, May 2021. https://www.semanticscholar.org/paper/On-the-Globalization-of-the-QAnon-Conspiracy-Theory-Hoseini-Melo/1b0f3a6da334b898ddb070657c980349d31be4e2
Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv:1508.01991 [cs], August 2015
Jin, D., et al.: A survey of community detection approaches: from statistical modeling to deep learning. IEEE Trans. Knowl. Data Eng. 35(2), 1149–1170 (2021). https://doi.org/10.1109/TKDE.2021.3104155. https://ieeexplore.ieee.org/document/9511798/
Kartal, G.: What’s up with WhatsApp? a critical analysis of mobile instant messaging research in language learning. Int. J. Contemp. Educ. Res. 6(2), 352–365 (2019). https://doi.org/10.33200/ijcer.599138. https://dergipark.org.tr/en/doi/10.33200/ijcer.599138
Klöser, L., Kohl, P., Kraft, B., Zündorf, A.: Multi-attribute relation extraction (MARE) - simplifying the application of relation extraction. In: Proceedings of the 2nd International Conference on Deep Learning Theory and Applications, pp. 148–156 (2021). https://doi.org/10.5220/0010559201480156. http://arxiv.org/abs/2111.09035
Krippendorff, K.: Reliability. In: Content Analysis: An Introduction to Its Methodology, Revised edition. Sage Publications Inc., Los Angeles, April 2012
Lacoste, A., Luccioni, A., Schmidt, V., Dandres, T.: Quantifying the carbon emissions of machine learning. arXiv:1910.09700 [cs], November 2019
Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33(1), 159 (1977). https://doi.org/10.2307/2529310. https://www.jstor.org/stable/2529310?origin=crossref
van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(86), 2579–2605 (2008). http://jmlr.org/papers/v9/vandermaaten08a.html
McLean, G., Osei-Frimpong, K.: Examining satisfaction with the experience during a live chat service encounter-implications for website providers. Comput. Hum. Behav. 76, 494–508 (2017). https://doi.org/10.1016/j.chb.2017.08.005. https://linkinghub.elsevier.com/retrieve/pii/S0747563217304727
Naseri, M., Zamani, H.: Analyzing and predicting news popularity in an instant messaging service. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1053–1056, July 2019. https://doi.org/10.1145/3331184.3331301
Newman, M.E.J.: Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E 74(3), 036104 (2006). https://doi.org/10.1103/PhysRevE.74.036104. http://arxiv.org/abs/physics/0605087
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12(85), 2825–2830 (2011). http://jmlr.org/papers/v12/pedregosa11a.html
Sang, E.F.T.K., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. arXiv:cs/0306050, Jun 2003
Su, X., et al.: A comprehensive survey on community detection with deep learning. IEEE Trans. Neural Netw. Learn. Syst. 1–21 (2022). https://doi.org/10.1109/TNNLS.2021.3137396. https://ieeexplore.ieee.org/document/9732192/
Subhashini, L.D.C.S., Li, Y., Zhang, J., Atukorale, A.S., Wu, Y.: Mining and classifying customer reviews: a survey. Artif. Intell. Rev. 54(8), 6343–6389 (2021). https://doi.org/10.1007/s10462-021-09955-5
Tsao, S.F., Chen, H., Tisseverasinghe, T., Yang, Y., Li, L., Butt, Z.A.: What social media told us in the time of COVID-19: a scoping review. Lancet Digit. Health 3(3), e175–e194 (2021). https://doi.org/10.1016/S2589-7500(20)30315-0. https://linkinghub.elsevier.com/retrieve/pii/S2589750020303150
Vajjala, S., Majumder, B., Gupta, A., Surana, H.: Social media. In: Practical Natural Language Processing. O’Reilly Media, Inc., June 2020. https://www.oreilly.com/library/view/practical-natural-language/9781492054047/
Wattenberg, M., Viégas, F., Johnson, I.: How to use t-SNE effectively. Distill 1(10), e2 (2016). https://doi.org/10.23915/distill.00002. http://distill.pub/2016/misread-tsne
Zhang, X., et al.: TwHIN-BERT: a socially-enriched pre-trained language model for multilingual tweet representations, September 2022. https://doi.org/10.48550/arXiv.2209.07562. http://arxiv.org/abs/2209.07562
Zhong, Z., Chen, D.: A frustratingly easy approach for entity and relation extraction. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 50–61 (2021). https://doi.org/10.18653/v1/2021.naacl-main.5. https://aclanthology.org/2021.naacl-main.5
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Appendix
Appendix
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Büsgen, A., Klöser, L., Kohl, P., Schmidts, O., Kraft, B., Zündorf, A. (2023). From Cracked Accounts to Fake IDs: User Profiling on German Telegram Black Market Channels. In: Cuzzocrea, A., Gusikhin, O., Hammoudi, S., Quix, C. (eds) Data Management Technologies and Applications. DATA DATA 2022 2021. Communications in Computer and Information Science, vol 1860. Springer, Cham. https://doi.org/10.1007/978-3-031-37890-4_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-37890-4_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-37889-8
Online ISBN: 978-3-031-37890-4
eBook Packages: Computer ScienceComputer Science (R0)