We have implemented two versions of sudoku solvers in parallel using the OpenMp library: a humanistic solver and a brute force solver. The humanistic solver is based off Crook's Algorithm, which is a deterministic solver except for cases in which guessing and backtracking is necessary. The brute force solver searches through permutations of possibly correct answers until it reaches a solution. We implemented two versions because the humanistic version, although very fast, did not benefit much from parallelization while the brute force method showed considerable speedup. When run on the GHC 3000 machines with 16 threads, the humanistic solver showed 3.6x speedup while the brute force solution showed 16.0x speedup.
Sudoku is a logic-based number puzzle featuring a 9x9 board divided into rows, columns, and 3x3 square blocks. The goal of the game is to fill each row, column, and block with unique numbers in the range 1 through 9. Variations of sudoku have used larger boards, such as hexadoku: sudoku with a 16x16 board.
Here is an example of a sudoku puzzle (left) along with its solution (right):
Crook's Algorithm describes a few methods for deterministically solving cells in sudoku puzzles, four of which are used by our humanistic solver:
Sudoku has a lot of potential for parallelism because the grid layout naturally allows different regions to be computed independently. However, the humanistic solver above is a quadratic time solver, which is already very fast considering the input size is small (81). So, the brute force method definitely has more potential for parallelism. Since sudoku is NP-complete, solving the board using brute force could result in substantial speedup.
The humanistic solver tries the four methods in the order above. If any method makes changes, then the solver starts over from elimination. This is useful since most changes are made by elimination and lone rangers, and also twins and triplets take more time to find.
The following is a flow chart to visualize this process:
We parallelized each of the steps (outlined by a rectangle), and joined threads in between. Elimination was performed on each cell. Lone rangers, twins, and triplets was performed on each row, column, and block in parallel.
We realized early on that just having a board of solved cells was not a great idea. It would be better to keep a separate board in which each cell contained all possible values for that cell. This prevented our algorithm from searching for possible values every time we tried to solve a cell, and thus let us avoid synchronization in many situations.
However, we could not avoid synchronization completely at first, so our original parallel implementation actually slowed down with more threads. We went back and reviewed some of the parallel decisions we had made, particularly the reader-writer lock we were using. We were structuring the code in a way that made fine-grained synchronization necessary, but by modifying how we implemented lone rangers, we were able to replace locks with atomic updates.
The brute force solver searches through valid cell values in parallel until the entire board is solved. It does so by iterating through the following steps:
We spent a substantial amount of time debugging race conditions, so we did not have enough time to implement some of the features we would've liked to, such as:
We measured speedup on 16x16 sudoku boards that ranged from easy to expert. We used 16x16 instead of 9x9 boards because the smaller board solved too quickly to show noticeable speedup. We tested with 1, 2, 4, 8, and 16 OpenMP threads. For each data point, we took the median of five tests as the final value.
The humanistic solver achieved linear speedup for 2 threads, but tapered off with more threads. We suspect this was because of two reasons:
Here are the results:
It is interesting to note that the humanistic solver achieved better speedup on the more challenging puzzles. This is perhaps because the difficult puzzles used harder techniques (i.e. twins and triplets) which would benefit more from parallelism, and they ran for more iterations.
The brute force solver achieved linear speedup, in fact even superlinear speedup on some tests. The superlinear speedup was possible because with multiple threads, the solver could arrive at the solution faster, and kill all the other threads before they did extra work. The results are graphed below:
(Note that the expert sudoku puzzle was not tested. It took so long to run (> 1 hour) that we decided not to include it in our results.)
Because we never implemented some of the features mentioned earlier, there is still a lot more room for improvement. For example, contention for the shared stack was extremely high. We suspect that if we had implemented separate work stacks with work stealing, we would have seen better than linear speedup.