Return to the project index

Project 4: Guidelines


Like the previous projects, create a directory named "FirstAndrewID-SecondAndrewID" at:


If you need to submit updated versions, create new directories, naming them "FirstAndrewID-SecondAndrewID.2" "FirstAndrewID-SecondAndrewID.2" etc. We will look at only the newest version.

Important Things to Note

  • Provide a Makefile or something equivalent (e.g., for ant). We will need to be able to compile your submitted code on Andrew machines. In particular, we will not be able to use Eclipse for importing and compiling your project.
  • All your source code should be placed in a directory named "src," under the root of your submission directory.
  • Avoid any hard-coded parameters, such as host names, path names, and port numbers.
  • Submit your report in PDF, named "report.pdf," under the root of your submission directory.
  • Make sure to explain the overall design of your system, not just descriptions of classes you implemented. E.g., what are the components of your system and what are their roles? How do they interact?
  • Test your implementation on Andrew machines before submission.
  • Provide step-by-step instructions for running and testing your code, so that people outside your team are able to do so easily.
  • Please do not submit a revision history with your code. E.g., no .git in your source directory.
  • Make your code readable by others, with proper indentation and comments placed as appropriate.

    Please note that submissions not following these guidelines can significantly affect the grading process. Make sure you name your submission directory in the way specified above, so we can identify and handle each subission properly. If you want to submit a compressed archive, place it in the correctly named directory.


    1. One critical part of the report is an analysis on the performance scalability of your parallel implementation, compared against your sequential version as a baseline. Your report needs to demonstrate this aspect, for example by graphs showing completion time for varied levels of parallelism. It should also include your explanations of the performance observed.

    K-Means Implementation

    1. For DNA strands, you could compute centroids in different ways. One option is to use the most common base for each position in the strand among all the strands of the corresponding cluster. Another option is to derive a probability distribution over the bases, again for each position in the strand. In either case, you would need to define how to calculate the distance between a centroid and a given strand. Please clearly describe your definitions in your report.

    Random Data Set Generator

    1. You can implement the data set generators in any language of your choice. Please make sure they can be executed in the environment of ghc Andrew machines.
    2. For 2D points, you can use the provided data set generator, or choose to implemnet your own version. For DNA strands, you need to implement one on your own.