Project Proposal:

ADVENTURES IN MONTE CARLO: THE DUMPLING SIMULATION

by Kiraz Baysal and Jen Solyanik

Summary

We are going to create a Monte Carlo simulation of the Chinese Restaurant Process on Blacklight using MPI.
This is aimed to be an extension to the cdec decoder, an open source Machine Translation program written in C++. We will be working with Chris Dyer, an LTI faculty member, who manages the system.
Background

The Chinese Restaurant Process (CRP) is a stochastic process which models data clusters. It assumes a restaurant with an infinite number of tables, each with an infinite capacity, and an infinite number of (ordered) patrons. Each customer has the option of either sitting at an occupied table, or a new, unoccupied table. Thus, each table represents a cluster of data points.
Machine Learning is focused on building models to predict the correct result. In Natural Language Processing, this is used to model human languages. The most naive models are built using probability weights of a certain event happening, as determined by the number of occurences in a given chunk of training data. However, if some perfectly legal event does not occur in the traning data, its probability in the model is 0 and may be discarded in practice. This leads to very poor results.

CRP avoids the 0 probability weight by the probability of a new patron sitting down at an empty table. It can also be used as a clustering algorithm to find clusters (dependent on the data) to build an advanced model. Thus, CRPs help build "smart" models with better results.

We will be parallelizing the process so that there are p(=processor) restaurants. This way, we can "seat" more customers at a time, and are able to process much more data. The connectivity of these restaurants will be decided as we write and test different implementations. The two extremes regarding communication would be to either communicate data at each step so that each restaurant has all of the same customers, or to have them be completely separate.

Challenge

The main challenge in this program is to parallelize the program without sacrificing correctness. Since the problem never gives an absolute number (the result will always be changing due to the random nature of the algorithm), the accuracy will also be difficult to determine. We will try to use the least amount of communication possible, while still have an accurate answer. Also, there is a significant amount of sophisticated math involved in this model. And our goal will be to minimalize amount of communication happening. There is not necessarily a divergent calculation, but we will be calculating the probability of a customer sitting in an empty table or an occupied table. This is the point that will have to access memory, as we will have to keep track of the number of people at a table, the number of tables that are occupied as well as the amount of people at each table. Each person carries a unique tags, and when they sit at a table, their tag transfers to the table.
Resources

Our starting point is the single-processor implementation of the project.
We will also be using the Boost interface for MPI to parallelize.
We do not believe we will need any more resources for this project.

Goals/Deliverables

We have two well-defined goals for this project, and one "reach".
1. Software engineering quality. Because this is an open source project, our code needs to be integratable and usable in the existing code base.

2. Reasonable models. This will be evaluated by comparing the performance of our distributed implementation against the single-processor implementation we are beginning with.

REACH: Once we have fulfilled the above goals, we will measure the perplexity of the language models built by our CRP. This is similar to our second goal, but the results are more meaningful for NLP purposes.

Platform

We will be using MPI and Blacklight to do this project.
We believe that using MPI makes sense because the parallelized program should not require very much communication between processes, and allowing processors to calculate probablities without the use of locks will improve efficiency. Also, there has already been very elementary work using MPI in the project using the Boost interface.

Blacklight is a good resource for us to use because it allows us to use more processors than are available on the GHC cluster computers. Since the point is to make the work as distributed as possible, more processors willl give us better results.

Proposed Schedule

[Please do not modify the schedule on this page
after the proposal process (it is your proposed schedule, and it will
be useful to compare it to your actually project logs at the end of
the project). Update the schedule on your project main page if your
goals or timeline changes over the course of the project.]

Week | What We Plan To Do |

Apr 1-7 | Understand the math behind the CRP algorithm, and familiarize ourselves with the existing code base. |

Apr 8-14 | Begin a simple parallel implementation. |

Apr 15-21 | Debugging + improving our implementation |

Apr 22-28 | Slowly begin more complicated algorithms |

Apr 29-May 5 | Debugging debugging debugging |

May 6-11 | Last minute panic + wrapping the project up. And winning. |