Betweenness Centrality is a intriguing metric to rank nodes in a network to reflect their centrality position. The intuition is that nodes with high betweenness centrality score are involved in many processes and greatly influence the flow of the network. In particular, under the assumption that item transfer follows the shortest paths, nodes with higher betweenness centrality has larger influence on the transfer of items through the network. It is widely used in many areas, including social network analysis. However, it is quite computationally expensive and time-consuming to compute betweenness centrality for large graphs.
Let the pair dependency of v to s,t pair be the fraction:
Then the equation to compute betweenness score can be represented as:
Finally, the accumulated dependency values can be computed recursively as:
Currently the fastest algorithm to compute BC score is proposed by Brandes, with O(nm) time complexity and O(n+m) space complexity for unweighted graphs, where n is the number of vertices and m is the number of edges. The algorithm consists of two phases: forward propagation and backward propagation. Forward propagation is a revised version of Breadth-First Search (BFS), which records the distance from the source node while traversing. Then in the background propagation, BC score is computed in a bottom-up manner using the equation third equation listed above.
Pesudocode for the sequential algorithm is shown below:
In this algorithm, computation of all d, sigma and delta are totally independent for different source nodes and only updating bc scores must be atomic. So one potential parallelism can be done on the outer loop of the algorithm.
In addition, the inner for loop for both forward and backward propagation can also be parallelized as long as bc and sigma is updated atomically.
The major challenge of this problem is the massive size and variant degree distribution of real-world graphs(social network, scale-free graph and small world graph etc.). Thus, effectively assign work to GPU computing resources is the key to achieve high performance. Node parallel and edge parallel are the most common approaches for graph computation. However, the performance of node parallel approach suffers significantly from the variant of degrees of vertices, especially for the SIMD pattern on the GPU. On the other hand, edge parallel apporach creates much unnecessary memory access, which also hinders performance. It is difficult to find a simple solution that is preferrable for all types of graphs. As a result, it is better to use a hybrid approach to solve the problem.
Our goal is to explore and analyze performance bottlenecks of implementing betweenness centrality on GPU and come up with a solution that utilizes computing resources efficiently to achieves non-trivial speedup comparing with multi-threaded CPU implementation or even better, GraphLab implementation.
We will demonstrate speedup graphs of using several different implementations on real-data graphs.
1. Source-based Approach
Since computation of d, delta and sigma for different source node is totally independent, it is possible to compute them in parallelism. The only synchronization here is to update bc score atomically. This approach can be easily implemented using OpenMP to enjoy the parallelism.
However, since computation for each node's bc score needs separate memory to store d, sigma, delta and bc for each thread, it is not a feasible GPU implementation, which has hundreds and thousands of threads running.
2. Node-based Approach
Instead of computing each source node in parallel, this approach focuses on parallelizing computation inside a source node. During the forward and backward phase, it assigns each node to a thread.
However, this approch can suffer greatly from load imbalance due to degree divergence. Degree of nodes in most real world graphs follows power law, which means in one GPU warp, the work done by different threads can be significantly different. Since GPU executes the threads in the same warp in SIMD manner, it can cause low hardware resource utilization. The load imbalance caused be skewed degree distribution is illustrated in the figure below:
Thread v5 needs to process node 2 and 9, while thread v6 has to process node 2, 3, 10, 11, 12, 13 and more. Supposed each warp has 64 threads and one node has degree 1,000 while others have degree 1, then for this execution, only about 1/64 of the GPU resource are effectively used. So, this approach can actually perform much worse than the sequential version implemented in CPU.
3. Edge-based Approach
Like node-based approach, this approach also focuses on parallelizing inner for loop for each source node. However, instead assign nodes to threads, it assigns edges to threads.
This approach avoids the problem of divergent execution due to skewed degree distribution in Node-based approach, which is illustrated in the figure below:
However, since it needs to loop over all edges, which are much more than the number of nodes, it requires more memory access. In addition, in backward phase, since different threads may update delta[v] at the same time, it also needs to be updated atomically. So compared to node-based approach, edge-based approach has more memory access and atomic operations. Input data to edge-based approach are in COO format, so different threads in the same warp can have more chance to coalesce memory access, which can leveage the problem of more memory access. In practice, edge-based approach performs much better than node-based approach.
4. Virtual Node Approach
Node-based approach suffers greatly from load imbalance due to skewed degrees distribution, while edge-based approach has the problem of more memory access and atomic operations. Virtual node approach is a compromise between both methods, which divides nodes with high degree into "virtual nodes", and each virtual node only computes on a subset of the original node's neighbors.
Supposed each virtual node can have up to 4 neighbors. In the figure above, node u is divided into two virtual node u1 and u2, which have 4 of the original neighbors of node u. Similarly, a node with 100 edges is divided into 25 "virtual nodes" and each virtual node has 4 of the original node's neighbors. This approach executes all virtual nodes in parallel, which substantially reduces work divergence since each virtual node has similar degrees now.
In the figure shown below, using node-based approach, thread 2 will process all eight neighbors of node 2, which are node 1,, 3, 4, 5, 6, 7, 8, 9. While in the virtual node approach, thread 2 will process neighbor node 1, 3, 4, 5 and thread 3 will process 6, 7, 8, 9. There is also an auxiliary array to records the mapping from virtual node to original node in order to correctly update bc score.
5. Interleaved memory access pattern
Since virtual nodes for the same node are contiguous, it is very likely that they are executed in the same warp. However, in the virtual node approach, they will access non-contiguous memory location in the adjcent array, which may be further improved. For example, in the figure shown above, thread 2 accesses node 1, 3, 4, 5 consecutively and thread 3 accesses node 6, 7, 8 consectively as well. Assume both threads run in the same pace, then the memory access pattern is node 1, 6, 3, 7, 4, 8, 5, 9. If a node has a lot of virtual nodes, the memory access pattern can be even more non-contiguous.
It is possible to improve memory access pattern by making threads running virtual nodes for the same node access interchange memory locations, which is illustrated in the figure below.
If we still assume both threads execute in the same pace, the memory access pattern now becomes 1, 3, 4, 5, 6, 7, 8, 9, which is contiguous in memory. This reminds us of the similar idea of assignment of program instances to loop iterations in ISPC, which is talked in class. (Interleaved assignment vs. Blocked assignment)
6. Hybrid implementation with both CPU and GPU
In approach 1, we discuss that computation for each source node can be done in parallel. We can then extend the idea to make use of both CPU and GPU. On a 12-core machine, we can spawn 24 threads, and each of them has a private structure of d, delta, sigma and bc array to avoid unnecessary atomic operations during the computation. The final bc score is the sum of all local bc scores.
In particular, there is a special thread is responsible for distributing works to GPU. Each time GPU finishes work on a source node, it will grab another source node to process. In the same time, other threads are doing compution locally using pyysical cores on CPU.
7. Removing degree-1 vertices
Intuitively, degree-1 vertices should have bc score of 0 since they are not on the shortest path of any pairs. However, these vertices still occupy CUDA threads during execution and can lead to divergent execution in a warp. As a result, we iteratively removed all degree-1 vertices and add bc score to their neighbors in advance. This can lead to significant speedup for graphs that havemany nodes with degree of 1.
BC score should be precomputed using the following equationn, whose proof can be found in the paper
8. Reordering nodes by BFS order
Grouping nodes based on their BFS order(level order) helps to keep nodes that need to be visited together in both CSR and COO representation. The preprocess helps to improve memory access pattern and thus improves lane usage during SIMD execution on GPU.
We implemented our solution on latedays cluster. It has two, six-core Xeon e5-2620 v3 processors (2.4 GHz, 15MB L3 cache, hyper-threading, AVX2 instruction support) and NVDIA Tesla K40 GPU with CUDA capability of 3.5.
In addition, the heterogeneous implementation achieved 12.23x, 13.11x, and 12.86x speedup over GraphLab implementation on web-NotreDame, web-Stanford, and Amazon0601 respectively.
Graph parameters are as follows:
p2pGnutella30: 36,682 Nodes, 88,328 Edges
email_Enron: 36,692 Nodes, 183,831 Edges
soc-Epinion1: 75,879 Nodes, 508,837 Edges
web-NotreDame: 325,729 Nodes, 1,497,134 Edges
web-Stanford: 281,903 Nodes, 2,312,497 Edges
Amazon0601: 403,394 Nodes, 3,387,388 Edges
We first implemented multi-thread CPU version of betweenness centrality using OpenMP. This approach parallelize the problem on each source node(outter loop) and maintains a frontier at each iteration, so it is work efficient. When updating the nodes, threads are tend to visit continuous memroy(neighbor array) thus also benefits from locality. As a result, the OpenMP version achieves good performance(better than GPU version) when the size of graph is small. Because the worker pool size is fixed, when graphs get large, massive SIMD execution of GPU programming starts to dominate.
As dicussed in previous section, node parallel approach has great divergence that results in low performance
From the profiler output, we can figure out that execution time of each CUDA thread in node parallel approach differs from 5.5 micro seconds to 2.6 millseconds. The upper bound is 500 times of the lower bound. Therefore it clearly demonstrate divergence of SIMD execution.
For edge parallel approach, divergence is significant lower as shown in the profiler data.
However, this approach consumes O(m) memory instead of O(n) in node parallel approach, where m and n are number of edges and nodes in a graph. m is normally much larger than n thus edge parallel approach makes more memory access. It also needs atomic instructions to add up bc value at the end. Those facts become the bottleneck of performacne of edge parallel approach.
Inorder to reduce divergence and achieve better work efficiency, we implemented the virtual_node method. It combines the advnatages of both node parallel and edge parallel approach.
As shown in the graph, virtual_node implementation reduces the divergence to a reasonable level comparing with node parallel impelmentation and meanwhile, generates less memory trafic than edge parallel method does. Through our experiments, this approach has higher performance than the other two in most cases. In order to further improve performance, we take andvantage of the coalesced memory access feature of CUDA. That is, CUDA will integrate continuous address memory requests from threads into one single request and reduce the number of memory access. The experiment shows that this approach indeed improves performance.
Finally we combined the best of both worlds and impelemented the heterogeneous version to maximize utility of computing resources. The heterogeneous implementation calculates betweenness centrality with 23 CPU threads and the GPU using virtual node apprach with coalesced memeory. One interesting observation is that the speed up of this approach on large graphs exceeded the speedup of simply adding two versions of implementation. We believe that was because in heterogeneous version, much less data was transferred between main memory, which led to a surprising speedup.
1. Betweenness centrality on GPUs and heterogeneous architectures (Catalyurek et al., 2013)
2. Scalable and high performance betweenness centrality on the GPU (McLaughlin et al., 2014)
3. Edge v. Node Parallelism for Graph Centrality Metrics (Jia et al., 2011)
4. Accelerating CUDA Graph Algorithms at Maximum Warp (Hong et al, 2011)
5. Revisiting Edge and Node Parallelism for Dynamic GPU Graph Analytics (McLaughlin et al., 2014)
6. Shattering and Compressing Networks for Betweenness Centrality (Ahmet et al., 2013)
List of Work By Each Student
Equal work was performed by both project members.