research-article

Open access

Transformer-Based Topic Modeling for Urdu Translations of the Holy Quran

Authors:

AbuBakar SiddiqueAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 23, Issue 10

Article No.: 146, Pages 1 - 21

https://doi.org/10.1145/3694967

Published: 24 October 2024 Publication History

PDF eReader

Abstract

Topic modeling enables the discovery of concealed themes and patterns in extensive text collections. It facilitates a thorough examination of the messages present in religious texts. Topic modeling for Quranic verses is a trending study area, with various translations already explored including Bahasa, English, and Arabic. Yet, there is a need for further research, particularly in Urdu translations of the Quran. In this study, we propose applying the BERTopic framework to Urdu translations of the Holy Quran. By leveraging the BERTopic approach, which incorporates a fine-tuned BERT model, we aim to capture the contextual nuances and linguistic complexities unique to the Quran. In this study, we utilized existing Urdu translations of the Quran from eight different translators sourced from Tanzil, a renowned resource for Quranic text and translations. We assessed the performance of our proposed BERTopic model compared to traditional techniques like LDA and NMF, using coherence and diversity metrics. The results indicate that our BERT-based approach outperforms these conventional methods, achieving an average coherence improvement of 0.03 and a diversity score of 0.83. These findings highlight the effectiveness of BERTopic in extracting meaningful topics from Urdu translations of the Holy Quran and contribute to the computational analysis of religious texts, supporting scholarly endeavors in comparative studies of Quranic translations in Urdu.

1 Introduction

The Quran, a religious text revealed to the Prophet Muhammad (Peace Be Upon Him) over 14 centuries ago, holds immense significance. Muslims widely read it for its spiritual importance and comprehensive guidance in addressing various challenges across social, economic, and religious aspects of life. Religion, embraced by diverse nations speaking different languages, translates its religious texts into numerous languages to guide its followers. However, translating a spiritual text can lead to varying interpretations due to the background and understanding of the translator.

Scholars delve into Quranic verses based on their areas of interest. For instance, some scholars focus on “mercy” in the Quran to identify verses, passages, and interpretations that emphasize mercy. Manually identifying topics and conducting in-depth studies from different Quranic translations is time-consuming because the meanings and ideas of different translators overlap from verse to verse and from chapter to chapter [Al Ghamdi and Khan 2022]. Hence, an automatic method is needed to identify topics from multiple translations of religious texts. One solution to this problem is topic modeling, which extracts hidden topics from an extensive collection of documents [Alhawarat 2015]. It allows scholars to gain valuable insights for focused research [Siddiqui et al. 2013]. Topic modeling has been applied to Quranic translations in various languages such as English [Al Ghamdi and Khan 2022], Arabic [Alshammeri et al. 2021b], and Indonesian [Rolliawati et al. 2020]. Still, such techniques are not available for Urdu translations of religious texts.

Urdu, a language rich in grammar but lacking in resources, poses challenges due to its numerous word derivations and inflections [Khalid et al. 2017]. Unlike English, Urdu is written from right to left and follows the Nastaliq writing style, similar to Arabic [Alhawarat and Hegazi 2018; Al Qudah et al. 2022]. Therefore, many essential resources and accurate text processing toolkits developed for other languages cannot be directly applied to Urdu [Daud et al. 2017]. Limited research on Urdu topic modeling has mainly focused on classical approaches [Shakeel et al. 2018; Rehman et al. 2018; Amin et al. 2020]. Topic modeling on religious texts is more challenging than on general texts because a single topic may repeat in different places in the Quran with varying details and contexts. Additionally, one verse may contain multiple topics, making it even more challenging due to the variations in Quranic translations. Although traditional techniques such as Latent Semantic Analysis [Landauer et al. 1998], Probabilistic Latent Semantic Analysis [Hofmann 1999], Nonnegative Matrix Factorization (NMF) [Lee and Seung 2000], and Latent Dirichlet Allocation (LDA) [Blei et al. 2003] are widely used and have provided valuable insights, they often struggle to capture contextual nuances and semantic representations present in Urdu translations of the Quranic text [Vayansky and Kumar 2020]. In recent years, Bidirectional Transformers for Language Understanding (BERT) [Devlin et al. 2019] has shown promise for context-aware language understanding, successfully applied in English and Arabic. It highlights the need for further research to leverage the potential of such models in Urdu translations of the Quran.

In this work, by leveraging BERT contextual understanding, we have proposed a topic modeling technique to analyze themes, concepts, and underlying meanings in the Urdu translations of the Quran. Our contributions lie in exploring the adaptability of BERT-based models to the intricacies of the Urdu language and Quranic text and in capturing the nuanced semantics of Urdu Quranic translations, thus offering a valuable resource for scholars, researchers, and practitioners of religious studies. Through rigorous experimentation, we demonstrate the potential of these models, therefore advancing the state of the art in text analysis for religious texts in Urdu while addressing the existing research gap.

The contributions of this work are outlined as follows:

—

We develop the Quran Urdu Topic Modeling Corpus (Quran-UTM), which comprises Urdu translations of the Quran, including eight esteemed translations by widely recognized translators. To focus on the actual content for topic modeling, the irrelevant information such as metadata, punctuation, and diacritics are removed from each Urdu translation. Corpus and source code are publicly available for the research community.¹

—

We have developed a framework for context-aware topic modeling using BERT.

—

We have assessed the performance of our BERT-based model against traditional methods, such as LDA and NMF. The results show that our BERT-based approach outperforms these conventional methods, achieving significantly higher coherence and diversity scores.

This approach has the potential to bridge the existing gap in Quranic topic modeling for Urdu translations and unlock new insights into the Quran’s profound message in the Urdu language.

The rest of the article is organized as follows. Section 2 presents a comparative review of existing work in topic modeling. Section 3 presents the Quran-UTM dataset. Section 4 presents the proposed methodology, and Section 5 discusses the experimental setup and presents the obtained results. Section 6 provides a discussion and analysis of the results, and finally, Section 7 concludes the study and offers suggestions for future research directions.

2 Related Work

Several studies have been done on topic modeling of Quranic text and its translations in recent years. While most of these studies have focused on the original Arabic script [Abuzayed and Al-Khalifa 2021] and English translations [Alshammeri et al. 2021b] of the Holy Quran, some have also explored other domains. This section provides an overview of previous works that deal with traditional topic modeling techniques for Quranic text and its translations and the application of context-aware advanced methods.

2.1 Classical Techniques for Topic Modeling

Classical techniques for topic modeling have been employed for text in multiple languages, including English, Arabic, and Urdu.

Rahman et al. [2018] aim to classify topics in each verse of Surah-Al-Baqarah from the Quran using machine learning techniques, particularly employing Support Vector Machine (SVM) as the primary algorithm. A comparative analysis with other classification algorithms is conducted. The research contributes to developing a computational environment for text mining the Quran and enabling efficient topic identification. However, there is a need for further investigation to explore additional methods or approaches that can enhance the accuracy and comprehensiveness of topic classification in Islamic literature.

Siddiqui et al. [2015] apply a probabilistic topic modeling algorithm, LDA, to discover the thematic structure of the Quran. By considering each Surah as a document, the paper aims to automatically identify the hidden thematic structure of the Quran, which contains vast information addressing various aspects of human life. Using the Arabic Quran as the corpus, the study successfully identifies significant themes in the Surahs and determines the most important terms associated with these themes.

Seifollahi et al. [2021] introduce a two-stage algorithm for topic modeling, addressing the limitations of traditional methods. While LDA relies on simplistic bag-of-words assumptions, this innovative approach leverages word embeddings and co-occurrence patterns. The first stage determines topic–word distributions by soft-clustering embedded n-grams from documents. In the second stage, it computes document–topic distributions by sampling topics from the topic–word distributions. This method capitalizes on word embeddings, providing a more accurate representation. Nonetheless, the article does not explicitly address potential constraints or difficulties linked to the proposed approach, leaving an opportunity for further investigation in future research.

Alshammeri et al. [2021a] employ the Natural Language Processing (NLP) method Doc2vec to detect semantic-based similarity between verses of the Quran. By mapping Arabic Quranic verses to numerical vectors encoding their semantic properties, the model achieves 76% accuracy in similarity detection compared to annotated textual similarity datasets. These findings provide a foundation for further research in Quranic semantic analysis, although improvements are needed to enhance the model’s performance and robustness.

2.2 Context-aware Techniques for Topic Modeling

In recent years, researchers have focused on using context-aware techniques for topic modeling for text in various languages.

Abuzayed and Al-Khalifa [2021] conduct an experimental study on BERT for Arabic Topic Modeling, specifically focusing on the effectiveness of the BERTopic technique. The study aims to explore the application of BERTopic using different Pre-trained Arabic Language Models as embeddings. By evaluating the results against traditional techniques like LDA and NMF, the study highlights the superior performance of BERTopic in uncovering latent topics in Arabic text. The evaluation metric was the Normalized Pointwise Mutual Information (NPMI) measure. This research contributes to the literature by emphasizing the potential of BERT-based techniques for Arabic topic modeling and addressing a research gap in this area.

Alhaj et al. [2022] propose a deep-learning-based approach using BERTopic to improve the classification of cognitive distortions in Arabic content on Twitter. By enriching text representation with latent topics derived from BERTopic, the study achieves improved performance compared to baseline models. However, more research is needed to understand the specific impact of BERTopic on cognitive distortion classification in Arabic on social media platforms. Further investigation is required to explore this aspect and provide deeper insights.

Alsaleh et al. [2021] introduce AraBERT, a powerful language model based on BERT that has demonstrated remarkable performance in various Natural Language Processing tasks. In this study, the authors leverage AraBERT to classify pairs of Quranic verses from the QurSim dataset as semantically related. The authors pre-processed the dataset and created three subsets for comparison. AraBERT, AraBERTv0.2, and AraBERTv2 are evaluated to determine the optimal version for our datasets. AraBERTv0.2 achieves an impressive accuracy score of 92% on a dataset containing labels “2” and “–1,” the latter generated outside the QurSim dataset. While this paper successfully explores Quranic verse semantic relatedness using AraBERT, there is still room for further research and improvement to better understand the contextual nuances within the Quranic text.

Alshammeri et al. [2021c] introduce a novel approach for detecting semantic similarity in the Quran using a Siamese transformer-based architecture. The study focuses on the significance of semantic similarity detection in various natural language comprehension tasks and its relevance to NLP applications. Leveraging Arabic pre-trained contextual representations, the proposed model generates verse embeddings that capture semantically meaningful information. The twin transformer networks are fine-tuned on a Quranic semantic similarity dataset, resulting in improved performance compared to previous studies. However, the article does not explicitly address any potential limitations or challenges associated with the proposed approach, leaving an opportunity for further investigation in future research.

The article of [Yang et al. 2023] introduces sDTM, a novel supervised topic modeling approach that enhances traditional methods like LDA by incorporating auxiliary data such as star ratings or post categories. Combining a neural variational autoencoder and a recurrent neural network, sDTM improves empirical estimation and predictive performance. While making valuable contributions to the field, the article could benefit from further investigation into scalability and practical limitations, ensuring its applicability to large datasets and real-world text analytics tasks. In their study, George and Sumathy [2023] incorporated LDA with clustering, BERT, and dimensionality reduction techniques (PCA, t-SNE, UMAP) to enhance topic modeling. The experiments were conducting using the open-source, cross-platform Integrated Development Environment (IDE) Spyder. The machine learning experiments were performed using Scikit-learn and NLTK tools for analysis. While the results demonstrated improved topic extraction on benchmark datasets, there is a need for further quantitative analysis of topic quality enhancement, as well as considerations for scalability and real-world deployment challenges.

Table 1 provides a quick comparison of existing research categorized by language (English, Arabic, Indonesian, and Urdu), arranged in reverse chronological order within each language group based on their publication year. Notable observations from the existing studies include:

Table 1.

Reference	Language	Sentence Semantics	Word Semantics	Context-aware Techniques	Covers Quranic Domain
Egger and Yu [2022]	English	Yes	Yes	Yes	No
Al Ghamdi and Khan [2022]	English	No	Yes	No	Yes
Weng et al. [2022]	English	No	No	No	No
Silveira et al. [2021]	English	No	No	No	Yes
Aftar et al. [2024]	Arabic	Yes	Yes	Yes	No
Abdelrazek et al. [2022]	Arabic	Yes	Yes	Yes	No
Alshammeri et al. [2021b]	Arabic	Yes	Yes	No	Yes
Abuzayed and Al-Khalifa [2021]	Arabic	Yes	Yes	Yes	No
Hutama and Suhartono [2022]	Indonesian	Yes	Yes	Yes	Yes
Rolliawati et al. [2020]	Indonesian	No	No	No	Yes
Mustafa et al. [2023]	Urdu	No	Yes	No	No
Zoya et al. [2021]	Urdu	No	No	No	No
Mustafa et al. [2021]	Urdu	No	No	No	No
Amin et al. [2020]	Urdu	No	No	No	No
Munir et al. [2019]	Urdu	No	No	No	No
Rehman et al. [2018]	Urdu	No	No	No	No
Shakeel et al. [2018]	Urdu	No	No	No	No

Table 1. Comparison of Various Aspects

—

Limited Research on Al-Quran Urdu Translation. As shown in Table 1, there are few studies on Urdu topic modeling between 2018 and 2023, and these studies do not cover Quranic text, which indicates the research community’s limited focus on this specific NLP task. Conversely, topic modeling on religious texts has been conducted in various languages such as English, Arabic, and Indonesian, highlighting the need for more comprehensive investigations into Al-Quran Urdu translation.

—

Scarcity of Context-aware Topic Modeling Techniques for Urdu. Table 1 reveals that many studies in English and Arabic have applied context-aware techniques, especially transformer-based language models like BERT and its variations. Most of these studies employing context-aware topic modeling techniques have been published between 2021 and 2024, demonstrating the rapid adoption of these models in the research community. In contrast, only one recent study in Urdu utilized static word2vec embeddings [Mustafa et al. 2023]. Due to variations in Quranic translation, identifying topics from different Quranic translations is more challenging. Static embeddings assign fixed vectors to words regardless of context, lacking adaptability, while transformer-based contextual embeddings dynamically capture nuanced meanings by contextualizing words within various Quranic Urdu translations. Consequently, there is a need to evaluate the effectiveness of context-aware topic modeling techniques in the Urdu translation of the Quran, which is the focus of this research.

In conclusion, context-aware techniques have exhibited remarkable effectiveness in topic modeling for religious texts, particularly in English and Arabic, when applied to the Holy Quran. However, utilizing these techniques for topic modeling in Urdu within the religious context is an under-explored research area. Our study aims to bridge this gap by leveraging BERT-based approaches for text analysis and topic modeling for Urdu translations of the Quran. Our contributions lie in not only exploring the adaptability of BERT-based models to the intricacies of the Urdu language and Quranic text but also capturing the nuanced semantics of Urdu Quranic translations, thus offering a valuable resource for scholars, researchers, and practitioners in the field of religious studies. Through rigorous experimentation, we demonstrate the potential of these models, thus advancing the state-of-the-art text analysis for religious texts in Urdu while addressing this pressing research gap.

3 Quran-UTM: Quran Urdu Topic Modeling Corpus

To address the gap in topic modeling for Al-Quran Urdu translations, we have developed a new corpus called Quran-UTM. Figure 1 provides an overview of the Quran-UTM corpus creation process. The subsequent sections present detailed descriptions of each step involved in developing our dataset and its statistical analysis.

Fig. 1.

3.1 Data Acquisition

We gathered the different Quran Urdu translations from the Tanzil project, accessible at https://tanzil.net/trans/. Only 8 of the 115 Quran translations available on Tanzil are in Urdu. Our study utilized all available Urdu translations, specifically those by Junagarhi, Jalandhry, Jawadi, Ahmed Ali, Kanzuliman, Maududi, Najafi, and Qadri. These translations offer diverse perspectives and interpretations, providing a comprehensive collection of Quranic texts in the Urdu language. The translations were manually downloaded from the Tanzil platform, emphasizing the process’s transparency and validity. The acquired dataset ensures the availability of trustworthy and valuable resources for the subsequent stages of the topic modeling research by sourcing the translations through manual download from Tanzil. Each translation encompasses 6,236 verses from 114 Surahs.

3.2 Data Cleaning

After collecting the eight Quran Urdu translations, we observed that these translations had some irrelevant information, such as metadata, punctuation, and diacritics. Eliminating noise and irrelevant information enhances the quality and relevance of the Quran Urdu translations. It not only reduces computational overhead but also makes it easier to extract meaningful topics, ultimately leading to better model performance and valuable insights in topic modeling. One example of cleaning the verse of Ahmad Ali’s translation is given in Figure 2.

Fig. 2.

—

Metadata Removal. To ensure consistency, we removed the metadata present at the beginning and end of each translation’s text file. This step helped streamline the data and focus solely on the actual content of the translations. Python’s Pandas² library was utilized to accomplish these tasks efficiently.

—

Punctuation Removal. Punctuation marks can pose challenges during text tokenization and do not contribute to retrieval. Hence, punctuation marks were removed from the text to mitigate these issues. The Regex³ library is employed for this task.

—

Diacritics Removal. In Urdu, diacritics are used to indicate pronunciation, but they do not change the fundamental meaning of words. To standardize the Urdu translations, we removed the diacritics using the Urduhack⁴ library that ensures words are treated consistently by the model, which is crucial for accurate topic detection.

3.3 Quran-UTM Corpus Statistics

The statistics of the Quran-UTM dataset are provided in Table 2. It can be observed from the table that the Qadri translation contains the most significant number of words and characters, with 232,744 words and 1,030,375 characters. The Najafi translation has the second-largest count, with 206,847 words and 902,665 characters. On the other hand, the Ahmad Ali translation has the smallest number of unique words, consisting of 9,011 words. The detailed statistics of the Quran-UTM dataset are provided in Table 2.

Table 2.

Translator	Words	Characters	Unique Words
Junagarhi	187,353	802,136	11,262
Jalandhry	188,667	807,898	11,424
Jawadi	186,652	799,235	9,337
Ahmed Ali	171,124	716,886	9,011
Kanzuliman	159,685	679,859	9,918
Maududi	204,593	864,168	11,807
Najafi	206,847	902,665	12,929
Qadri	232,744	1,030,375	15,136

Table 2. Statistics of Quran-UTM Dataset

4 Proposed Methodology

This section presents our proposed BERTopic model [Grootendorst 2022] for the Quranic Urdu translations topic modeling. To capture the contextual nuances of the Quran in Urdu, we utilize BERT embeddings, which convert the Urdu translations into semantic representations in vectors. Afterward, we reduce the dimensionality of the vector representation through Uniform Manifold Approximation and Projection (UMAP), followed by clustering to group verses with similar semantics. We use descriptive analysis and evaluation metrics to infer and assess the discovered topics. Figure 3 provides a workflow of our proposed methodology. The details of each step are provided in subsequent sections.

Fig. 3.

4.1 BERT Embeddings

First, we transform our documents into numerical representations using BERT embedding, which differentiates the different meanings of the same word and various words with the same meaning based on their context within a sentence.

For example, as illustrated in Figure 4, the word

in the Surah Taha represents light and guidance. It symbolizes the divine illumination and the hope for spiritual insight that Prophet Musa sought when he approached the fire on Mount Sinai. Conversely,

in Surah Al-Baqrah verse number 24 denotes Hellfire, representing the concept of punishment and suffering in the afterlife. These two contexts show how the same word,

, diverges into vastly different topics in topic modeling. One context is associated with light, guidance, and knowledge, while the other pertains to punishment, retribution, and the concept of Hell. It illustrates the richness of language in the Quran, where a single term can encapsulate multiple layers of meaning depending on its usage.

Fig. 4.

Different translators choose different words to translate the original text based on their understanding. For example, in Figure 5, the words

and

in these translations serve the same purpose of conveying a successful outcome for those who follow divine guidance. These words belong to the same topic. Traditional techniques failed to capture the contextual nuances and semantic representations in different Urdu translations of the Quranic text. BERT achieves this through bidirectional analysis, considering preceding and following words to generate context-specific embeddings.

Fig. 5.

To apply BERT embedding, we utilized the pre-trained sentence transformer model “paraphrase-multilingual-MiniLM-L12-v2” from Hugging Face [Reimers and Gurevych 2019]. Hugging Face is a popular library and platform that provides a wide range of pre-trained models for natural language processing tasks. Sentence transformers, a type of deep learning model, have gained significant attention for their ability to capture semantic representations of sentences and paragraphs. The sentence transformer model we employed, specifically the “paraphrase-multilingual-MiniLM-L12-v2,” is designed to map sentences and paragraphs to a 384-dimensional dense vector space, enabling clustering or semantic search tasks. By leveraging transformer architectures, sentence transformers encode contextual information and the semantic meaning of the text into these dense vector representations.

In our case, the sentence transformer model was employed to generate BERT embeddings of the proposed Quran-UTM dataset. These embeddings serve as high-dimensional representations of the text, capturing the intricate relationships between words and phrases. By leveraging the contextual information encoded in the embeddings, we can understand the finer details and subtle aspects of the Quranic text in Urdu, uncovering its hidden meanings and more profound significance.

4.2 Dimensionality Reduction

When using BERT embeddings, each word or token of the Quran-UTM dataset is represented by a high-dimensional vector. This high dimensionality can pose challenges regarding computational complexity, memory requirements, and the interpretability of the embeddings. A dimensionality reduction technique is employed to handle the high dimensionality of BERT embeddings. We utilized the UMAP model for dimensionality reduction [McInnes et al. 2018]. There are traditional approaches to reduce the dimensionality of document embedding, such as t-Distributed Stochastic Neighbor Embedding (t-SNE), which emphasizes preserving local similarities [Van der Maaten and Hinton 2008], and Principal Component Analysis (PCA), which is a linear dimensionality reduction technique that focuses on the global structure of the data by projecting it into a lower-dimensional space [Pearson 1901]. In contrast, UMAP offers a balance between preserving global and local structure in data, making it suitable for effectively visualizing and analyzing complex datasets [McInnes et al. 2018]. Additionally, research has shown that reducing high-dimensional embedding with UMAP can enhance the performance of clustering algorithms like K-means and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) in terms of accuracy and processing time [Allaoui et al. 2020]. UMAP uses a cross-entropy-based loss function to optimize the low-dimensional representation. The loss function is defined in Equation (1):

\begin{equation} L = \sum _{i,j} w_{ij} \left[ d_{ij} \log \left(\frac{d_{ij}}{\sigma _{ij}}\right) - (1 - d_{ij}) \log \left(\frac{1 - d_{ij}}{1 - \sigma _{ij}}\right) \right], \end{equation}

(1)

where:

—

$i, j$ represent pairs of data points.

—

$d_{ij}$ is the distance between data points i and j in the high-dimensional space.

—

$\sigma _{ij}$ is the similarity between data points i and j in the low-dimensional space (after applying a Gaussian kernel).

—

$w_{ij}$ is a weight that adjusts the importance of each pair of data points based on their distance in the nearest neighbor graph.

UMAP works by mapping the high-dimensional space of BERT embeddings onto a lower-dimensional space, typically two or three dimensions, which can be easily visualized. The mapping process considers the underlying manifold structure of the data, capturing the essential similarities and differences between the verses. By reducing the dimensionality, we can simplify the data representation, making it more manageable for further analysis and interpretation.

4.3 Clustering

Clustering plays a crucial role in the BERTopic model as it enables grouping similar Quran verses based on their semantic similarities. This step is essential for organizing the vast corpus of Quran-UTM into meaningful clusters or groups, each representing a potential topic. A suitable algorithm is employed for clustering, such as K-means or HDBSCAN. By default, the BERTopic framework uses HDBSCAN clustering [McInnes et al. 2017]. HDBSCAN is a density-based clustering algorithm that builds upon the traditional DBSCAN algorithm by incorporating a hierarchical approach. It has the ability to handle clusters of varying densities and identify outliers. Sometimes, HDBSCAN produces numerous outliers, potentially leading to excessive outlier clusters, making it difficult to identify meaningful clusters accurately. So, in this specific study, we choose the K-means algorithm because it prevents the generation of excessive outliers. It is a partition-based clustering algorithm that iteratively assigns data points to the nearest cluster centroid to minimize the sum of squared distances within each cluster.

By applying K-means to the reduced-dimensional BERT embeddings, we discover distinct clusters of Quran verses that exhibit similar semantic characteristics. These clusters represent topics or themes that emerge from the Quran Urdu translations. By grouping verses with shared semantic similarities, we gain valuable insights into the underlying structure and content of the text, facilitating the exploration and analysis of the translations. Clustering helps to organize the Quran Urdu translations into coherent groups, making it easier to identify patterns, trends, and topics within the text. It provides a framework for understanding the relationships between different verses and unveiling the underlying themes present in the translations.

4.4 Tokenizer

To ensure modularity in BERTopic’s algorithm, a flexible topic representation technique is essential, one that avoids assumptions about cluster structures. We employ a bag-of-words approach, where all documents within a cluster are combined into a single document to count word frequencies, generating a cluster-level representation. This process uses the default tokenizer from CountVectorizer in the sklearn library,⁵ which breaks down the text into individual words or tokens and handles tasks like stop word removal.

Stopwords, being less meaningful and the most frequent words in any text, connect the significant parts of a sentence and build context. However, our approach uses sentence transformers that extract contextual information from sentences, making the removal of stopwords in the initial stages unnecessary.

Eliminating stopwords during the post-processing phase was crucial for enhancing the quality and relevance of the identified topics by removing less informative words and focusing on more meaningful content in the Quran-UTM dataset. In our proposed framework, we utilized a freely available, pre-existing stopwords list [Shafi et al. 2023].

4.5 Topic Inference

After tokenizing and removing the stopwords within a cluster, the next step is to analyze the verses within each cluster to identify the most representative words and themes. Topic inference is the process of identifying and extracting meaningful words from clusters. It involves analyzing the content and patterns within the data to uncover underlying themes, concepts, or subjects prevalent across the clusters of verses.

Descriptive analysis techniques are applied to the verses within each cluster during topic inference. It involves calculating the importance or significance of words within a cluster using Class-based Term Frequency-Inverse Document Frequency (C-TF-IDF).

The C-TF-IDF weight for a term t in a document d belonging to a cluster c is given by Equation (2):

\begin{equation} \text{C-TF-IDF}(t, d, c) = \text{TF}(t, d) \times \text{IDF}(t, D) \times \log \left(\frac{\text{Total number of documents in } D}{\text{No. of docs containing term $t$ in cluster $c$}}\right). \end{equation}

(2)

By examining the C-TF-IDF scores of words within a cluster, we can identify the most characteristic and informative words for that particular topic. These words provide insights into the primary themes and content represented by the cluster, effectively inferring the topics present in the Quran-UTM dataset.

4.6 Evaluation Measures

This study uses two well-known topic modeling evaluation metrics—coherence and diversity—to evaluate the quality and performance of topic models [Lau et al. 2014; Röder et al. 2015; Terragni et al. 2021]. The coherence measure quantifies the interpretability and meaningfulness of topics by considering the co-occurrence of terms within topics. We employ two coherence metrics, NPMI, [Lau et al. 2014] and Coherence Value ($C_V$), [Röder et al. 2015], and a diversity metric, Inverted Rank-Biased Overlap (IRBO) [Terragni et al. 2021], to assess the model’s performance. The details of each metric are as follows:

4.6.1 NPMI Score.

The first coherence metric we use to evaluate the topics is NPMI [Bouma 2009], which has been shown to correlate with human judgment [Lau et al. 2014]. It ranges from (–1,1), where –1 signifies that the words never appear together, 0 suggests that their occurrences are entirely independent, and 1 indicates a perfect and consistent co-occurrence of the words. The NPMI score between two words $\left(x_{i}, x_{j}\right)$ is calculated as follows:

\begin{equation} \operatorname{NPMI}\left(x_{i}, x_{j}\right)=\frac{\log \frac{p\left(x_{i}, x_{j}\right)+\epsilon }{p\left(x_{i}\right) p\left(x_{j}\right)}}{\log p\left(x_{i}, x_{j}\right)+\epsilon }. \end{equation}

(3)

The probabilities $p\left(x_{i}\right)$ and $p\left(x_{i}, x_{j}\right)$ represent the occurrence frequencies of word $x_i$ and the co-occurrence of word pair $\left(x_{i}, x_{j}\right)$, respectively. These probabilities are estimated based on their frequencies in a corpus. Here, $\epsilon$ is added to avoid division by zero. The NPMI score calculates the normalized mutual information between two words and then averages the values across all word pairs in all topics.

4.6.2 $C_V$ Score.

$C_V$ is a coherence evaluation metric that assesses the quality of topics generated by topic modeling algorithms. It utilizes a sliding window, a one-set segmentation of the top words, and an indirect confirmation measure involving NPMI and cosine similarity. Given the top T words of a topic, $\left(x_1, x_2, \ldots , x_T\right)$. The calculation of the $C_V$ coherence score is as follows:

\begin{equation} C_V=\frac{1}{T} \sum _{i=1}^T \cos \left(\mathbf {v}_{\mathrm{NPMI}}\left(x_i\right), \mathbf {v}_{\mathrm{NPMI}}\left(\left\lbrace x_i\right\rbrace _{i=1}^T\right)\right). \end{equation}

(4)

The NPMI score is computed using Equation (3). Likewise, the average $C_V$ score is obtained by considering the scores of all topics. This approach has been highly influential in evaluating topic quality, as evidenced by extensive comparisons with other commonly used coherence measures. The $C_V$ metric yields the best scores that closely align with human evaluations of topic quality (see Röder et al. [2015] for experimental results).

4.6.3 IRBO Score.

The diversity metric indicates the semantic diversity of the generated topics. Previous studies have proposed different diversity metrics to determine the distinctiveness of the generated topics [Nan et al. 2019; Burkhardt and Kramer 2019; Dieng et al. 2020]. In this study, we use the recently proposed IRBO metric to quantify the diversity of topics [Terragni et al. 2021]. It assigns a score of 0 when the topics are identical and 1 when they are entirely dissimilar. Suppose we have a set $\aleph$ containing T topics. Each topic is represented as a list of words, and the order of words within each list reflects their importance or ranking within the topic. The IRBO of these topics is calculated as follows:

\begin{equation} \operatorname{IRBO}(\aleph)=1-\frac{\sum _{i=2}^T \sum _{j=1}^{i1} \operatorname{RBO}\left(l_i, l_j\right)}{n}, \end{equation}

(5)

where $n= (_2^T)$ represents the number of pairs of lists to be compared. The term $RBO\left(l_i, l_j\right)$ refers to the traditional Rank-Biased Overlap measure between two ranked lists $l_i$ and $l_j$ [Webber et al. 2010]. IRBO enables the comparison of lists even when they do not necessarily share the same items and may not cover all items in a given domain. When two lists (topics) share common words, the IRBO score is lower if the shared words are found at the top than when they are located lower in the ranked lists.

5 Experimental Setup and Results

In this section, we describe the experimental setup and the results.

5.1 Experimental Setup

We compare the proposed BERTopic model with two classical models, LDA and NMF. We implement LDA and NMF using the Gensim⁶ library. We also employed the Python library Gensim, specifically its ”CoherenceModel” class, to calculate coherence scores. This class allows us to evaluate the coherence of topics generated by different models, including BERTopic, LDA, and NMF. The Coherence measure provided by Gensim considers the degree to which words tend to appear together within the same topic and how distinct they are from words in other topics. This assessment provides a measure of topic coherence that reflects the semantic consistency and relevance of the topics generated by our models.

One of the advantages of BERTopic is that it does not require specifying the number of topics in advance. However, to ensure a fair comparison across all models, we standardize by setting the same number of topics and extracting the top 10 words within each topic. We determine the optimal number of topics by iterating the LDA model over a range from 5 to 50 topics, in increments of 5, based on the $C_V$ score, as shown in Figure 6. From the figure, it can be seen that the LDA model achieves the highest $C_V$ score with five topics. Therefore, we compare all models using this value, setting the number of topics to five.

Fig. 6.

The diverse dimensions of SBERT models exhibit varying capabilities in capturing semantic information, which affects the quality of embeddings. Therefore, the default UMAP dimension of 2 may sometimes yield optimal results. To address this, we experimented with different dimensions for the Maududi translation, ranging from 2 to 14, in increments of 1. The $C_V$ scores for each dimension are shown in Figure 7. As observed, dimension 4 achieves the highest $C_V$ score, so we set the UMAP model’s dimension value to 4. UMAP’s stochastic nature might cause variations in BERTopic results, so we set a random generator value of 42 for reproducibility. Additionally, we enabled low-memory mode in UMAP to avoid memory issues.

Fig. 7.

We set calculate-probability to true when instantiating BERTopic to obtain a document–topic probability matrix. This matrix provides the probability distribution of each document across different topics, allowing for a more fine-grained analysis and interpretation of the topic model results.

5.2 Results

Table 3 presents the results of three topic modeling techniques—LDA, NMF, and the proposed BERTopic—applied to eight different translations of the Quran: Junagarhi, Jalandhry, Jawadi, Ahmed Ali, Kanzuliman, Maududi, Najafi, and Qadri. The models are evaluated based on CV, NPMI, and IRBO scores.

Table 3.

Models	Metrics	Junagarhi	Jalandhry	Jawadi	Ahmed Ali	Kanzuliman	Maududi	Najafi	Qadri
LDA	CV	0.46	0.41	0.48	0.46	0.51	0.54	0.43	0.44
	NPMI	$-$0.01	$-$0.05	$-$0.02	$-$0.01	0	$-$0.03	$-$0.01	$-$0.02
	Diversity	0.51	0.54	0.65	0.47	0.54	0.54	0.49	0.36
NMF	CV	0.55	0.57	0.51	0.54	0.56	0.59	0.55	0.56
	NPMI	0.01	0.01	0.01	0.02	0.01	$-$0.01	0	0.03
	Diversity	0.83	0.77	0.84	0.79	0.75	0.75	0.78	0.86
BERTopic	CV	0.52	0.51	0.49	0.47	0.51	0.62	0.55	0.53
	NPMI	0.02	0.02	$-$0.01	$-$0.02	0.02	0.1	0.02	0.04
	Diversity	0.87	0.79	0.88	0.71	0.8	0.83	0.86	0.87

Table 3. Comparison of Proposed Model (BERTopic) with State-of-the-art Models

The table shows that BERTopic achieves the highest scores on the Maududi translation, with a CV of 0.62 and an NPMI of 0.1. Additionally, the Jawadi translation achieves a high IRBO score of 0.88 using the BERTopic model. The NPMI scores of the LDA model on each translation are generally negative or zero, suggesting that the word co-occurrence patterns might not align well with the human judgment of topic coherence. In contrast, the NPMI scores for the NMF and BERTopic models are mostly positive, indicating that their topics are more coherent. The varying results across translations highlight the richness and diversity of themes discovered in the Quran’s Urdu translations.

Figure 8 presents the average scores for each translation. The figure shows that the average NPMI and IRBO scores for BERTopic, 0.03 and 0.83, respectively, are higher than those for the classical models, LDA and NMF. Although NMF achieves the highest average CV score of 0.55, BERTopic performs well in the remaining two metrics.

Fig. 8.

Figure 9 displays the word scores for each topic derived from the BERTopic model applied to the Kanzuliman translation from the Quran-UTM dataset. These word scores indicate the importance or relevance of each word within a specific topic. The figure demonstrates that topics are represented by numerous words, beginning with the most representative ones. Each word is assigned a c-TF-IDF score, with higher scores indicating greater representativeness within the topic. For example, in the first topic of the Kanzuliman translation, the word “

” has the highest representation, while “

” is the least represented word.

Fig. 9.

6 Discussion and Analysis

In our study, we performed topic modeling on all of these translations to comprehensively understand the variations in their content. For example, we focused on exploring the topic of “Hellfire” across these translations and made an interesting observation. In Qadri’s translation, this topic was found in 126 different places within the Quranic text, whereas in Kanzuliman’s translation, it appeared only 84 times, as shown in Figure 10. It’s important to note that this difference in the frequency of topic occurrence does not imply that these translations are inherently different, as the Quran is considered the word of Allah. However, it does suggest that specific translations may be more explanatory than others.

Fig. 10.

Upon manual inspection of Qadri’s translation, we found that it provides additional explanations for the translated verses by including parenthetical explanations. This explanatory translation style may contribute to the higher frequency of topics identified in Qadri’s translation. Figure 11 showcases this feature in Qadri’s translation, highlighting how the inclusion of explanations enhances the richness of the translated text. This simple example shows that topic modeling can be effectively utilized for comparative analysis of the Quran and other religious texts.

Fig. 11.

Tables 4 and 5 present topics from the Kanzuliman and Ahmed Ali translations, respectively. The variations in word choice and phrasing between the two translations reflect subtle differences in interpretation. For instance, Topic 4 in the Kanzuliman translation frequently references specific figures like

and

, highlighting historical or narrative contexts. In contrast,

in Topic 3 of the Ahmed Ali translation is discussed within a broader context, including terms like

(prophets) or

(messengers). Despite addressing divine entities using different terms, such as

, and

, the topics maintain similar meanings in context. It can also be observed that Topic 5 in both translations pertains to the same overarching theme, yet the words and their relevance scores differ. It demonstrates that our BERT-based context-aware topic modeling technique effectively captures the nuances of different translations by considering the surrounding context of words. This approach ensures a more accurate and meaningful extraction of themes, preserving the complexity of the text. This method aids scholars in delving deeper into these topics, examining the variations and nuances in translations, and uncovering further insights. Topic modeling facilitates rigorous analysis, allowing researchers to explore religious texts’ diverse interpretations and perspectives. It promotes meaningful dialogue and advances our understanding of these sacred texts.

Table 4.

Table 5.

7 Conclusion and Future Work

Topic modeling in Urdu translations of the Quran presents significant challenges. Manually identifying topics and conducting comprehensive studies across different translations is highly time-consuming due to the overlapping meanings and ideas between verses and chapters. An automatic method for identifying topics from various translations is needed. While topic modeling has been applied to Quranic translations in English, Arabic, and Indonesian languages, similar techniques for Urdu translations of religious texts are not yet available. In this study, we introduce an approach for topic modeling in Urdu translations of the Quran utilizing BERT-based techniques. Our results reveal the potential of BERT embeddings for topic extraction and analysis. We applied the BERTopic model to eight different Urdu translations of the Quran and compared its performance with two traditional models, LDA and NMF, using coherence and diversity metrics. Our findings show that the BERT-based method surpasses these conventional techniques, with an average coherence improvement of 0.03 and a diversity score of 0.83. These results highlight the effectiveness of BERTopic in identifying meaningful topics from Urdu translations of the Holy Quran and contribute to the computational analysis of religious texts, aiding scholarly research in comparative studies of Quranic translations in Urdu. Despite these promising results, there is room for improvement. One area of improvement is the need for a dedicated Urdu-specific BERT model. Monolingual BERT models exist for various languages, but none specifically for the Urdu language. In future work, developing an Urdu-specific BERT model will significantly enhance the results and improve topic modeling performance in Urdu translations.

Footnotes

https://github.com/shaistaDev7/Topic-Modeling-on-Al-Quran-Urdu-Translations

https://pypi.org/project/pandas/

https://pypi.org/project/regex/

⁴

https://pypi.org/project/urduhack/

⁵

https://pypi.org/project/scikit-learn/

⁶

https://pypi.org/project/gensim/

References

[1]

Aly Abdelrazek, Walaa Medhat, Eman Gawish, and Ahmed Hassan. 2022. Topic modeling on Arabic language dataset: Comparative study. International Conference on Model and Data Engineering (MEDI’22). Springer, 61–71. DOI:

Abstract

1 Introduction

2 Related Work

2.1 Classical Techniques for Topic Modeling

2.2 Context-aware Techniques for Topic Modeling

3 Quran-UTM: Quran Urdu Topic Modeling Corpus

3.1 Data Acquisition

3.2 Data Cleaning

3.3 Quran-UTM Corpus Statistics

4 Proposed Methodology

4.1 BERT Embeddings

4.2 Dimensionality Reduction

4.3 Clustering

4.4 Tokenizer

4.5 Topic Inference

4.6 Evaluation Measures

4.6.1 NPMI Score.

4.6.2 \(C_V\) Score.

4.6.3 IRBO Score.

5 Experimental Setup and Results

5.1 Experimental Setup

5.2 Results

6 Discussion and Analysis

7 Conclusion and Future Work

Footnotes

References

Cited By

Index Terms

Recommendations

Topic-based coherence modeling for statistical machine translation

Extractive text summarization using clustering-based topic modeling

Multi-aspect sentiment analysis for Chinese online social reviews based on topic modeling and HowNet lexicon

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations