CMU 15-418/618 (Spring 2013) Final Project Report
Parellelization of Learning via Factor Graph Model
Xiaowen Ding, Yu Zhao
Main Project Page
Background
In probability theory and its applications, a factor graph is a particular type of graphical model, with applications in Bayesian inference, that enables efficient computation of marginal distributions through the sum-product algorithm. Recently several studies are focusing on some application of factor graph model on social network where we denote person as node and relationship of two people as edge such as predicting mood or activity in social network, or inferring social ties.

In most applications, we only need to consider about the factors of at most two nodes in the graph. Therefore in fact there are two kinds of factors in the graph:
• Attribute factor: It is the factor of one node, which represents the posterior probability of the label of the node given some attributes;
• Correlation factor: The factor of two nodes, which denotes the correlation between the labels of two nodes.
Here is an illustration of correlation factor. Here the edges of different colors denotes different kinds of factors between nodes. The factors with the same color denotes same correlations. Our goal is to learn the parameters(weight) of all these correlation as well as attribute factors.

Here are two examples of applications via this factor graph.
• Mood Prediction: The most common application of this model is predicting some features in the social network, i.e. people's mood. Here node in the factor graph represents a person in the social network, and the label of the node represents whether a person is happy or not. Then attribute factor of a node just represents some activities or location information of one person, while the correlation factors between two nodes represents the relationship of two people. There might be different kinds of correlation factors since there exist different types of relationships, i.e. co-workers, classmates, friends...
• Deducing Social Tie: This is a different example. Here we want to infer different kinds of relationship in social network, so we will let the node of the factor graph to be the relationship of two people. The label of the node represents whether the relationship is a specific type or not, i.e. advisor-advisee, employer-employee. The edges in the factor graph represents two relationship with a common people.
As real social networks may contain millions of vertices and edges, it is important for the learning algorithm to scale up well with large networks. Therefore we need to investigate a parallel algorithm which has good speedup as well as acceptable performance.
Serial Algorithm
Since the algorithm description contains a lot of symbols and formulas, we write it in the following pdf file. [pdf] The learning algorithm is summarized above.

Here a big challenge is that the graphical structure in this model can be arbitrary and may contain cycles, which makes it intractable to directly calculate the expectation. A number of approximate algorithms have been proposed, such as Loopy Belief Propagation(LBP). We utilize LBP to approximate marginal probabilities. With the marginal probabilities, the gradient can be obtained by summing over all nodes. Finally with the gradient, we update each parameter with a learning rate.
Parallel Approach
As real social networks may contain millions of users and relationships, it is important for the learning algorithm to scale up well with large networks. To address this, we develop a parallel learning method based on MPI.

The learning algorithm can be viewed as two steps:
• 1) compute the marginal probabilities via loopy belief propagation;
• 2) compute the gradient descents and optimize all parameters with the gradient descents.
The most expensive part is the step of calculating marginal probabilities via LBP.

We can also see that in fact there exist nested loops in our algorithm. The outer loop is repeatedly update parameters via gradient descend method until converge, The inner loop is calculating marginal probabilities via LBP which is also an iterative method.

Our first try is parallel implementation of Loopy Belief Propagation. In another word is to parallel the inner loop. The intuitive idea is that we distribute the graph into all core, and we need to pass messages via the edges between nodes in different cores in each iteration in Loopy Belief Propagation. However this method has a big problem. In fact each iteration of inner loop will not cost too much time, therefore the time cost on the message passing will become the bottleneck and intolerable.

On the other hand, the parallelization of the outer loop is not easy, because the gradient is depended on the total graph and Loopy Belief Propagation is an iterative method. Then we come up with the idea that divide the graph into some subgraphs and assign each core a subgraph. Let each core calculate marginal probabilities and compute the gradient descent on its own subgraph. Then we can accumulate all these gradient descents together to get an appropriate gradient descent of the total graph.

Notice that in fact we eliminate the edges between different subgraphs. This parallel implementation indeed affect the performance of this learning algorithm. Therefore our goal is to divide the graph into roughly equal parts (to ensure a better parallel speedup), as well as the edges between different parts should be as few as possible (to ensure a better learning performance). We finally decide to use a graph partitioning system METIS to partition our graph.  We adopt a master-slave architecture, i.e., one master node is responsible for optimizing parameters, and the other slave nodes are responsible for calculating gradients. At the beginning of the algorithm, the graphical model is partitioned into P roughly equal parts, where P is the number of slave processors. This process is accomplished by graph segmentation software METIS. The subgraphs are then distributed over slave nodes. Note that in our implementation, the edges (factors) between different subgraphs are eliminated, which results in an approximate, but very efficient solution. In each iteration, the master node sends the newest parameters to all slaves. Slave nodes then start to perform Loopy Belief Propagation on the corresponding subgraph to calculate the marginal probabilities, then further compute the parameter gradient and send it back to the master. Finally, the master node collects and sums up all gradients obtained from different subgraphs, and updates parameters by the gradient descent method.

All the codes are implemented in C++, and all experiments are conducted on six cores GHC 3K machines.
Results

Here we use a publication data set from Arnetminer and try to infer the advisor-advisee relationship from the coauthor network. (As mentioned in the second example in the Background section.) Here the nodes in the graph denotes the relationship of two coauthors. Attribute factors are some features about the publication while Correlation factors (edges) exist when two relationships have a common person.

This data set has 6,096 nodes with 2,164 positive labels and 3,932 negative labels. There exist 24,268 edges in this graph.  We now evaluate the scalability performance of our parallel learning algorithm. The first two figures show the running time and speedup of the parallel algorithm with different number of computer nodes (2,3,4,8,12 threads) used. We observe that the speedup curve is close to the perfect line at the beginning. Although the speedup inevitably decreases when the number of cores increases, it can achieve 8x speedup with 12 threads.

It is noticeable that the speedup curve is beyond the perfect line when using 4 threads, it is not strange since our distributed strategy is approximated. In our parallel implementation, graphs are partitioned into subgraphs, and the factors across different parts are discarded. Thus, the graph processed in parallel version contains fewer edges, making the computational cost less than the amount in the original algorithm. Also, LBP may require less iteration to converge in small graph.

When the number of cores increases, although we applies good graph partition algorithm, load balance and communication become important issues. Also, GHC machines have 12 virtual cores, when we make use of all of them, they would have less computation performance.  An important problem we care about is whether "discarding" some edges would affect the performance much. The effect of subgraph partition is illustrated in the last two figures. By using good graph partition algorithm such as METIS, the performance only decreases slightly (1.4% in accuracy and 1.6% in F1-score), which is much better than random partition. A theoretical study of the approximate ratio for the parallel learning algorithm would be an interesting issue and is also one of our ongoing works.

To sum up, our algorithm achieves good speedup while maintaining the quality of solution when we run with a small number of threads (say 2,3,4). But it's hard to scale well with a large number of threads because load imbalance would decrease the speedup, and discarding too many edges may decrease the performance.
References
Open MPI
http://www.open-mpi.org/
MeTis: Unstrctured Graph Partitioning and Sparse Matrix Ordering System
http://dm.kaist.ac.kr/kse625/resources/metis.pdf
ArnetMiner
http://arnetminer.org/
Deliverables
List of work
Xiaowen Ding: Serial version of algorithm, Study different ways to split the graph, Do experiment and optimize ,Checkpoint report
Yu Zhao: Study ideas about factor graph, Parallel version of algorithm, Do experiment and optimize, Final report, Maintain website