

In recent years, the darknet, a hidden part of the deep web associated with illicit activities, has been the subject of study due to the myths and mysteries surrounding it. Contemporary research aims to uncover the true topics hidden within this network using thematic analysis techniques, which are essential for cybercrime prevention and legal action. However, the dynamic and anonymous nature of the darknet poses the challenge of effectively navigating the TOR protocol to obtain and analyze samples from hidden sites. This paper presents an innovative approach to studying the darknet. Assuming limited prior knowledge of the original topics, a contextual relation-comparison technique with TinyBERT, a large language model, is used to generate super topics from previously identified hidden sites. From these super topics, keywords with contextual scores and weights are extracted, serving as input for a sensor that navigates the TOR network and aggregates new hidden sites. These sites are processed through semi-supervised learning to form clusters of sub-topics. Labels for each sub-topic propagate based on their similarity to the main topics and are ultimately classified in a fine-tuning layer of TinyBERT. The results demonstrate the identification of twelve classes of sub-topics in the darknet, related to drugs, hacking, marketplaces, pornography, and other areas, with a classification accuracy of 95.45%.