Low-Frequency Aware Unsupervised Detection of Dark Jargon Phrases on Social Platforms

Huang, Limei; Wang, Shanshan; Liu, Changlin; Cao, Xueyang; Han, Yadi; Liu, Shaolei; Chen, Zhenxiang

doi:10.1007/978-981-99-7022-3_18

Limei Huang¹²,
Shanshan Wang¹²,
Changlin Liu¹²,
Xueyang Cao¹²,
Yadi Han¹²,
Shaolei Liu¹² &
…
Zhenxiang Chen¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14326))

Included in the following conference series:

Pacific Rim International Conference on Artificial Intelligence

482 Accesses

Abstract

With the development of the Internet, the number of people communicating on social platforms has soared, which means that it is crucial for platform moderators to review and remove illegal content to create a clean network environment for users. However, identifying such content becomes complex due to the use of dark jargons. These jargons are seemingly innocent or newly coined words and phrases, such as “coke” for cocaine or “vanilla sky” for synthetic cathinone, to convey illegal meanings, aiming to evade detection by moderators. Existing methods primarily focus on detecting dark jargons at the word level, yielding commendable results. However, given the prevalence of phrase-level dark jargons in the context, relying solely on word-level detection can introduce ambiguity. For example, “black” is not a dark jargon, but “black bart” is a dark jargon. As a result, there is a growing interest in developing techniques specifically targeting phrase-level dark jargon detection. Unfortunately, such efforts are relatively limited, potentially resulting in the oversight of numerous low-frequency dark jargon phrases. To tackle this challenge, we propose the Low-Frequency Aware Dark Jargon Phrases Detection (DJPD) model. Our approach centers around finding a noun phrasal attention map pattern based on Transformer that enhances the perception of low-frequency phrases, enabling the selection of candidate dark jargon phrases. Subsequently, the candidate dark jargon phrases’ sentence-level context is analyzed to detect dark jargon phrases. Remarkably, our model achieves a significant 84.66% improvement in F1-score compared to the current state-of-the-art method for dark jargon phrase detection in the corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.reddit.com/r/bigquery/comments/3cej2b/17_billion_reddit_comments_loaded_on_bigquery/.

References

Aloraini, A., Poesio, M.: Cross-lingual zero pronoun resolution. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 90–98 (2020)
Google Scholar
Bottou, L.: Stochastic gradient descent tricks. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 421–436. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8_25
Chapter Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dryer, M.S.: Noun phrase structure. Lang. Typol. Syntactic Desc. 2, 151–205 (2007)
Article Google Scholar
Jiang, J.A., Nie, P., Brubaker, J.R., Fiesler, C.: A trade-off-centered framework of content moderation. ACM Trans. Comput.-Human Interact. 30(1), 1–34 (2023)
Article Google Scholar
Ke, L., Chen, X., Wang, H.: An unsupervised detection framework for Chinese jargons in the darknet. In: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, pp. 458–466 (2022)
Google Scholar
Kim, T., Choi, J., Edmiston, D., Lee, S.G.: Are pre-trained language models aware of phrases? simple but strong baselines for grammar induction. arXiv preprint arXiv:2002.00737 (2020)
Koetsier, J.: Report: Facebook makes 300,000 content moderation mistakes every day. In: Forbes (2020). https://www.forbes.com/sites/johnkoetsier/2020/06/09/300000-facebook-content-moderation-mistakes-daily-report-says/?sh=48a605e254d0
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)
Article Google Scholar
Li, Y., Cheng, J., Huang, C., Chen, Z., Niu, W.: Nedetector: automatically extracting cybersecurity neologisms from hacker forums. J. Inf. Secur. Appl. 58, 102784 (2021)
Google Scholar
Loper, E., Bird, S.: Nltk: the natural language toolkit. arXiv preprint cs/0205028 (2002)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 26, 1–9 (2013)
Google Scholar
OpenAI: Gpt-3.5. [Online] (2023). https://openai.com/about
Peters, M.E., et al.: Deep contextualized word representations (2018)
Google Scholar
Shang, J., Liu, J., Jiang, M., Ren, X., Voss, C.R., Han, J.: Automated phrase mining from massive text corpora. IEEE Trans. Knowl. Data Eng. 30(10), 1825–1837 (2018)
Article Google Scholar
Takuro, H., Yuichi, S., Tahara, Y., Ohsuga, A.: Codewords detection in microblogs focusing on differences in word use between two corpora. In: 2020 International Conference on Computing, Electronics & Communications Engineering (iCCECE), pp. 103–108. IEEE (2020)
Google Scholar
Wang, H., Hou, Y., Wang, H.: A novel framework of identifying Chinese jargons for telegram underground markets. In: 2021 International Conference on Computer Communications and Networks (ICCCN), pp. 1–9. IEEE (2021)
Google Scholar
Wang, Y., Su, H., Wu, Y., Wang, H.: SICM: a supervised-based identification and classification model for Chinese jargons using feature adapter enhanced BERT. In: PRICAI 2022, pp. 297–308. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20865-2_22
Yang, H., et al.: How to learn Klingon without a dictionary: detection and measurement of black keywords used by the underground economy. In: 2017 IEEE Symposium on Security and Privacy (SP), pp. 751–769. IEEE (2017)
Google Scholar
Yuan, K., Lu, H., Liao, X., Wang, X.: Reading thieves’ cant: automatically identifying and understanding dark jargons from cybercrime marketplaces. In: USENIX Security Symposium, pp. 1027–1041 (2018)
Google Scholar
Zhu, W., Bhat, S.: Euphemistic phrase detection by masked language model. arXiv preprint arXiv:2109.04666 (2021)
Zhu, W., et al.: Self-supervised euphemism detection and identification for content moderation. In: 2021 IEEE Symposium on Security and Privacy (SP), pp. 229–246. IEEE (2021)
Google Scholar

Download references

Acknowledgements

This work was supported by the Shandong Provincial Key R &D Program of China under Grants No.2021SFGC0401, the TaiShan Scholars Program under Grants No. tsqnz20221146, the Project of Shandong Province Higher Educational Youth Innovation Science and Technology Program under Grant No.2019KJN028, and the Natural Science Foundation of Shandong Province of China under Grants No. ZR2023QF096, and the National Natural Science Foundation of China under Grant No. 61972176.

Author information

Authors and Affiliations

School of Information Science and Engineering, University of Jinan, Jinan, China
Limei Huang, Shanshan Wang, Changlin Liu, Xueyang Cao, Yadi Han, Shaolei Liu & Zhenxiang Chen

Authors

Limei Huang
View author publications
You can also search for this author in PubMed Google Scholar
Shanshan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Changlin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xueyang Cao
View author publications
You can also search for this author in PubMed Google Scholar
Yadi Han
View author publications
You can also search for this author in PubMed Google Scholar
Shaolei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhenxiang Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhenxiang Chen .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Fenrong Liu
SEEK Limited, Cremorne, NSW, Australia
Arun Anand Sadanandan
MIMOS (Malaysia), Kuala Lumpur, Malaysia
Duc Nghia Pham
Universitas Indonesia, Depok, Indonesia
Petrus Mursanto
Tabcorp Holdings Limited, Melbourne, VIC, Australia
Dickson Lukose

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, L. et al. (2024). Low-Frequency Aware Unsupervised Detection of Dark Jargon Phrases on Social Platforms. In: Liu, F., Sadanandan, A.A., Pham, D.N., Mursanto, P., Lukose, D. (eds) PRICAI 2023: Trends in Artificial Intelligence. PRICAI 2023. Lecture Notes in Computer Science(), vol 14326. Springer, Singapore. https://doi.org/10.1007/978-981-99-7022-3_18

Download citation

DOI: https://doi.org/10.1007/978-981-99-7022-3_18
Published: 10 November 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-7021-6
Online ISBN: 978-981-99-7022-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Low-Frequency Aware Unsupervised Detection of Dark Jargon Phrases on Social Platforms