1 Introduction
Large Language Models (LLMs) have revolutionized a wide range of applications in artificial intelligence, especially those related to natural language understanding, generation, translation, summarization, and even code writing. These models, powered by deep neural architectures like transformers[
14], have demonstrated exceptional performance in processing and understanding vast amounts of textual data. By leveraging self-attention mechanisms, LLMs can capture intricate relationships between words, phrases, and contexts, excelling in tasks that involve sequential data. This capability has led to significant breakthroughs in conversational agents, content creation, and question-answering systems. The ability of LLMs to generalize across various language-related tasks makes them highly versatile tools in the field of AI.
However, despite their remarkable success in natural language processing (NLP) and related domains, LLMs face significant challenges when dealing with graph-structured data[
4][
8]. Unlike sequential data, where the relationships between entities follow a linear order, graph data represents complex networks of nodes and edges, where relationships do not adhere to a strict sequence. The non-Euclidean nature of graph topologies, with entities and their interactions spread across a multi-dimensional space, makes it difficult for LLMs to capture the dependencies and relationships inherent in graphs. This limitation renders LLMs less effective for tasks like node classification[
19], link prediction, or knowledge graph reasoning—tasks that require a deep understanding of both the local and global structure of graphs.
To address the unique challenges posed by graph-structured data, models like Graph Neural Networks (GNNs)[
2] have been developed. GNNs are specifically designed to handle graph data by modeling the dependencies and relationships between nodes and their neighbors in a non-Euclidean space. By aggregating information from a node’s neighbors and learning to represent both the node and its connections, GNNs can capture the structural features of a graph, making them suitable for tasks such as node classification[
5], link prediction[
18], and community detection. Despite their effectiveness, traditional GNNs and Graph Convolutional Networks (GCNs)[
5] still have notable limitations. For instance, they often struggle to capture long-range dependencies within the graph[
10], leading to poor generalization on large, complex graphs. Additionally, deeper GNN layers are prone to over-smoothing[
6], where node representations become indistinguishable, thereby losing their unique characteristics. Moreover, scalability becomes a concern, as the computational cost of aggregating information from an increasing number of neighbors can grow exponentially with the size of the graph.
In recent years, the combination of graph embeddings[
9] and language model-based approaches has emerged as a promising direction for overcoming the limitations of traditional GNNs. Graph embeddings transform nodes, edges, or entire graphs into low-dimensional vectors, while preserving the structural and relational information of the graph. This transformation enables more efficient learning and analysis of graph data, facilitating tasks such as node classification, link prediction, and community detection across various domains, including social networks, biology, and recommendation systems. However, the challenge remains that traditional GNN-based embeddings often fail to capture both local and global contexts effectively[
15], particularly in large and intricate graph structures.
Language model-based embeddings, such as those derived from transformers[
13], offer a more effective alternative for graph embeddings. Their self-attention mechanisms allow these models to capture long-range dependencies and global structure, addressing the limitations of GNNs. By incorporating the ability to model both structural and semantic relationships, language model-based embeddings can significantly enhance performance on complex graph-related tasks. These embeddings can effectively encode the structural information of the graph while simultaneously capturing the semantic attributes associated with each node, leading to improved results in graph-related tasks such as node classification and link prediction.
In applications such as network analysis, anomaly detection, and cybersecurity, language model-based embeddings can provide significant advantages. For example, they can enhance network analysis tasks by identifying key nodes and detecting anomalous patterns in large-scale networked systems. Additionally, they can support privacy-preserving techniques like federated learning by embedding sensitive data in a way that reduces exposure to raw information, thereby enhancing security and privacy. This flexible and scalable approach makes language model-based embeddings highly valuable for tasks involving privacy, security, and large-scale network applications.
In this context, WalkLM[
12] introduces a uniform fine-tuning framework that leverages pre-trained language models (LLMs) to generate attributed graph embeddings. The key innovation of WalkLM lies in its application of random walks on the graph to generate sequences of nodes. These sequences are then fine-tuned with LLMs, enabling the model to capture both the structural (topological relationships between nodes) and semantic (node attributes) information embedded in the graph. By combining random walks with language model fine-tuning, WalkLM achieves significant improvements in various graph-related tasks. For instance, it outperforms traditional GNN-based models in handling both the structure and attributes of graphs, particularly in tasks like node classification and link prediction. Random walks play a crucial role in WalkLM by capturing structural information within the graph. These walks simulate paths through the graph, reflecting neighborhood proximity and the relative importance of nodes. Furthermore, random walks capture semantic information by encoding node attributes into the sequences, allowing the LLMs to learn both the context and content of the graph’s nodes. However, WalkLM’s random walk approach primarily captures global context through depth-first search (DFS), which focuses on longer walks to model structural relationships. While this approach is effective in capturing the overall structure of the graph, it lacks the ability to balance local context efficiently.
In contrast, Node2Vec[
3] offers a more flexible algorithm by combining DFS and breadth-first search (BFS), enabling the capture of both global and local contexts. While Node2Vec performs well on simpler graphs, it faces limitations when the depth of DFS exceeds 2 in complex graphs, as it struggles to represent intricate topologies effectively.
To overcome these limitations, we propose a novel algorithm that enhances random walks by incorporating the out-degree ratio of neighboring nodes to guide the walk. This approach dynamically adjusts the walk based on the structure of the graph, allowing for a better capture of both global and local contexts in the resulting embeddings. Our algorithm goes beyond the static nature of DFS and BFS in Node2Vec, offering a more adaptable mechanism for embedding. Additionally, we implemented k-means clustering algorithm for community detection, leveraging our enhanced embeddings. Experimental results show that our approach outperforms traditional random walks in WalkLM and biased random walks in Node2Vec, achieving better silhouette scores and producing more meaningful community structures.
2 Related Work
Graph embedding techniques have evolved significantly to address the challenges of representing graph-structured data effectively. Node2Vec and DeepWalk[
9] pioneered the use of random walks to generate embeddings, capturing both local and global context through node sequences. Node2Vec employs a flexible strategy that combines depth-first search (DFS) and breadth-first search (BFS) to balance the trade-offs between exploration and exploitation. However, it often struggles with intricate graph topologies, especially when the DFS depth exceeds two levels.
In response to the limitations of shallow embeddings, GNNs have gained prominence. GNNs leverage a message-passing framework to model the relationships between nodes explicitly, allowing for the aggregation of features from neighbors. However, they face challenges like over-smoothing and scalability when applied to large graphs. Variants such as Graph Convolutional Networks (GCNs) have sought to enhance performance but still exhibit issues in capturing long-range dependencies effectively.
Recently, transformer-based models, such as Graph-BERT[
17], have emerged, utilizing self-attention mechanisms to improve the capture of complex relationships in graphs. These models offer the potential to encode both structural and semantic information, addressing some limitations of GNNs. Furthermore, efforts to integrate language models with graph data have led to frameworks like WalkLM, which fine-tunes pre-trained language models to generate attributed graph embeddings. This combination of approaches illustrates the rich landscape of research aimed at enhancing graph embeddings, paving the way for advanced applications in areas such as anomaly detection[
1], community detection[
11], and network analysis[
7].
4 Hybrid Textualization Methodology
Our hybrid approach harnesses the strengths of both WalkLM and Node2Vec, while simultaneously addressing the limitations of each in capturing graph locality and community structures. WalkLM’s random walk-based textualization excels at converting graph data into meaningful text, effectively capturing both topological structures and attribute information. However, it often falls short in understanding graph locality. This makes it challenging to distinguish nodes that are geographically or structurally close within a graph but belong to different communities. As a result, WalkLM struggles to detect nuanced community structures, particularly in complex, real-world graphs. Despite capturing attribute-based information, its lack of graph-specific insights limits its effectiveness in embedding graphs for community detection.
Conversely, Node2Vec balances local and global context by blending BFS (Breadth-First Search) and DFS (Depth-First Search), utilizing its tunable parameters p and q to explore neighborhoods effectively. However, the depth of exploration is limited to a maximum of 2, which is often insufficient for capturing deeper, more intricate relationships within complex graph data. This restricted exploration hinders the model’s ability to fully grasp the broader context necessary for effective community detection.
To tackle these challenges, we propose an outgoing degree-based BFS path exploration technique. The outgoing degree of a node, denoted as degout(v), is defined as the number of edges directed away from the node v. The intuition behind this method is that when the ratio of the outgoing degrees between two nodes is high, these nodes likely belong to different communities and have significant roles within their respective clusters. By integrating this degree-based exploration into the embedding process, our approach captures more subtle structural features of the graph, leading to improved community detection.
Figure
2 illustrates two distinct communities within the graph. One community consists of nodes
V1,
V2,
V3, and
V5, while the other comprises nodes
V6,
V7,
V8,
V9, and
V10. When our BFS algorithm starts from node
V2, it prioritizes the nodes it explores based on the out-degree ratios.
In contrast, the WalkLM method would treat V1 and V6 with equal importance. However, this approach overlooks the fact that V1 has no outgoing edges, whereas V6 has multiple outgoing edges. Therefore, our outgoing degree ratio-based algorithm assigns a higher preference to selecting the path from V2 to V6 in the next exploration level, as V6 contributes more significantly to the structural relationships of the graph. This nuanced prioritization allows our algorithm to more effectively capture the underlying community structure, unlike random walks, which fail to differentiate between V1 and V6 and thus treat them as equally important. Consequently, our approach enhances the exploration of the graph by focusing on nodes that have a greater influence on the structural relationships, leading to improved clustering and community detection.
4.1 Silhouette score for community detection
Our approach employs the Hybrid Walk algorithm to generate a text corpus, which we then use to learn the graph’s representation or embedding. Subsequently, we apply K-means clustering to identify clusters or communities within the graph. To evaluate the effectiveness of this community detection method, we utilize the silhouette score, a widely recognized metric for assessing cluster quality. This measure proves particularly valuable when analyzing real-world graph datasets that lack ground truth labels.
Where a(i) is the average distance between node i and all other nodes in its cluster, and b(i) is the average distance between node i and all nodes in the nearest cluster to which i does not belong.
5 Experiments
5.1 Experimental Setup
Datasets
We implemented our algorithm based on two different real world datasets, PubMed and Cisco22. PubMed contains a graph of genes, diseases, chemicals, and species. The nodes and edges are extracted according to [
16]. A relatively small fraction of diseases are grouped into eight categories.
The Cisco22 dataset represents a network graph where edges denote connections between client nodes (consumers) and server nodes (providers), with attributes like server ports and communication protocols (6 for TCP, 17 for UDP).
5.2 Experimental Results on PubMed Dataset
In our experiment, for the graph embedding, we conducted three different experiments for the graph textualization process. First, we performed these experiments using the
Random_Walk (RW) algorithm in WalkLM, an out-degree ratio based BFS algorithm referred to as
BFS_Walk, and a Hybrid method which is referred to as
Hybrid_Walk that utilizes both the Random Walk and the out-degree ratio based BFS algorithm. These approaches were empirically evaluated to assess their relative performance in various downstream tasks. The effectiveness of the generated graph embeddings was assessed through a link-prediction task. To ensure the reliability of our findings, each experiment was repeated five times. The average accuracy percentages achieved for the Random Walk, BFS-Walk, and Hybrid Walk methods were 87%, 87%, and 89%, respectively. These results are visually represented in Figure
3.
As shown in Figure
3, link prediction accuracy varies with
β values and
Max_hop settings.
Max_hop determines the maximum number of hops considered in graph traversal. Among the configurations tested, the
Max_hop=3 setting achieved the highest accuracy, peaking at 90% when
β = 5. This combination provides the best results within the experimental setup, balancing out-degree ratio and computational cost.
For lower β values, the text corpus is insufficient to provide meaningful contextual information. Conversely, increasing Max_hop beyond 3 results in significantly higher computational resource requirements, exceeding the capabilities of the experimental setup, which utilized an NVIDIA 4080 GPU 16GB. Additionally, smaller Max_hop values fail to capture sufficient structural information, leading to suboptimal performance.
Higher Max_hop settings also introduce noise into the model due to the inclusion of distant connections that may not be directly relevant to the target nodes. These distant connections can dilute the influence of closer, more meaningful relationships, leading to a less focused and effective representation of the graph’s structure. As such, the combination of Max_hop=3 and β = 5 is not universally optimal but represents the best configuration achievable within the scope of our experimental parameters.
5.3 Experimental Results on Cisco-22 Dataset
We applied a textualization process algorithm on the Cisco dataset to generate graph embeddings. The resulting embeddings were then projected onto a two-dimensional space using t-SNE (t-Distributed Stochastic Neighbor Embedding), which facilitates visualization and analysis of high-dimensional data. The outcomes of this process are illustrated in Figure
5, demonstrating clear clusters that indicate well-defined community structures within the dataset.
We evaluated performance with the silhouette score, comparing it against WalkLM with RW, our BFS_Walk algorithm, and Hybrid_Walk.
The experimental results illustrated that the BFS_Walk algorithm outperformed the RW method in terms of graph clustering. This finding underscores the significance of considering local information for effective graph clustering. Furthermore, our Hybrid_Walk algorithm, which adeptly balances global and local information, demonstrated approximately 4% higher performance compared to the standalone RW and BFS_Walk methods. This indicates that an integrated approach leveraging both global and local information is essential for enhancing graph clustering tasks.
These results, presented as silhouette scores, are summarized in Table
3, showcasing the comparative performance of each method.
6 Conclusion
Graph representation learning has advanced significantly, primarily through the development of Graph Neural Networks (GNNs) and their application in various domains such as node classification and community detection. Among these, graph clustering remains a crucial area of research, with its ability to reveal hidden structures and patterns in diverse real-world applications, including fraud detection and social network analysis. Despite their successes, existing GNN-based methods often struggle with challenges such as oversmoothing, limited capacity for directed graphs, and an inadequate focus on node features, which can compromise clustering effectiveness.
To tackle these issues, our proposed hybrid walk approach leverages both random walks and degree-based exploration, providing a robust method for attributed graph clustering. By integrating fine-tuning techniques and enhancing representation learning, the hybrid walk ensures that learned embeddings retain essential structural information, improving clustering performance. Our experiments across benchmark datasets validate the efficacy of our method, showcasing its potential to address current limitations in graph clustering techniques.
As part of our future work, we would like to develop novel effective algorithms for community detection, enhanced LLM-based graph representation learning, and robust path exploration for the textualization of graphs, further advancing the capabilities of graph-based methods in various applications.