I implemented a serial algorithm to solve a Sudoku board, then parallelized it using OpenMP. I ran the parallel implementation with 1 to 12 OpenMP threads on ghc30, a Gates Machine, where I compared the speed-up of the parallel implementation across different Sudoku boards of varying difficulty and varying sizes. By graphing the speed-up of the parallel implementation across different boards and on multiple processors, I was able to conclude that my algorithm has linear speed-up until it plateaus at a 5x speed-up on 6 processors.
Sudoku is a puzzle game where a player must fill a NxN grid, where N is traditionally 9, with every number between 1 and N only once. Each 3x3 grid, row, and column must follow this rule.
As you can see from this image, the black-colored numbers are the values of the board that are given to us, and the blue numbers are values that we filled in the board ourselves. All of these numbers combined will give us a board solution, where each row, column, and 3x3 grid have the numbers 1 through 9 only once.
My algorithm's input is a text file containing the sudoku puzzle, then outputs the completed sudoku board. After reading in the sudoku puzzle, it stores these values in a 2-D array of shorts.
I have a stack that contains the permutations of the partially-filled sudoku puzzles, as it makes the most sense to use a stack if I am adding new, recursive board states, then removing board states to check for solution validity. The most important operations on the stack are pushing and popping, as I will need to take a lot of precaution when using these operations in my parallel implementation due to the possibility of having concurrency errors.
The most computationally heavy aspect of my algorithm is the fact that I am backtracking through every possible permutation of the sudoku board until I find the solution. If I do not allocate work properly among my threads, then there wouldn't be much speed-up achieved, as there would be a lot of contention among the data structure containing all of the board permutations: the stack.
My approach to designing the algorithm is to use a brute-force backtracking method. I selected this approach because I noticed that the problem I am trying to solve is simply a growing search tree that has an enormous branching range. I can achieve parallelism on this algorithm by dividing up work among my workers, the work being adding new board states to my growing stack and removing board states to check if my board state is a valid solution.
I use the following steps to permute the board states:
1. After popping a board state from the stack, I initially iterate through all of the empty cells, starting at the first empty cell on the board, and fill it with 1, the smallest possible number it can be.
2. It then proceeds to the next non-empty cell and fills it with the next smallest possible number, 2.
3. If I find a cell where filling it in with such a number violates the main Sudoku logic rule (cannot have 2 of the same numbers in a row, column, or 3x3 grid), then I backtrack to the last board state that most recetly filled an empty cell successfully, increment the value in that cell by 1, then continue on with the algorithm once more.
4. If I cannot increment that value anymore because it's 9, then I backtrack once more until I find a cell that can do it.
I repeat these steps until one of the following conditions are met:
1. There are no more empty cells left on the sudoku board. Therefore, we have found a solution and can print this board as an output.
2. We have backtracked to the very first empty cell of the board. This means that we were unable to find a solution, and therefore we will state that no solution exists for the board.
After testing and confirming that this serial implementation of the sudoku solver works, I then proceeded to parallelize the algorithm. I kept everything the same, but made a couple of very key changes:
1. I made my stack global, allowing it to be accessible for all the current threads.
2. I added a lock to the stack, and require threads to lock and unlock the stack whenever they push or pop from it. This is extremely important because concurrency issues could occur if the stack does not have a lock, and I can potentially miss the solution board completely due to these concurrency issues.
3. Whenever a thread pops a board from the stack, it will find all possible permutations of the first empty cell available, then push all of these permutations onto the stack. This is different from the serial implementation, where I only pushed 1 permutation back onto the stack (from increasing the value once).
4. Any thread that is not currently working on anything will pop a board off the stack and repeat the pruning steps again.
5. Once a solution is found, all threads will terminate their work, and the thread with the correct solution will return the solution. Once again, if the stack is empty, then no solution exists.
I wrote both of my algorithms in C++. For the parallel implementation with OpenMP, I used #pragma omp parallel, which would enable sections of my program to be run in parallel, and #pragma omp critical, which would make sure that threads have exclusive access to certain data structures in the critical region, namely the stack. This is important because having multple workers push and pop simultaneously can create many concurrency errors. These parallel chunks of code are equally divided among the n threads available, where n ranges from 1 to 12. Ghc30, the 3K Gates Machine I am utilizing, has 6 cores, so the threads will be divided among each of the cores so that 2 threads are on 1 core from hyperthreading.
For the following graphs, it is important to note that the execution times recorded were taken as an average of 10 trials on each thread, as there is a fair amount of variance between each trial. More trials could have been run for even more accurate results, but I felt that 10 trials were more than enough to provide accurate results. The baseline for each of these graphs is the parallelized implemenetion run on 1 processor with the respective board.
Here are the results of running my parallelized algorithm on an easy-difficulty 9x9 sudoku board. Unfortunately, this data is inconclusive for analysis because the runtime difference between the different number of processors is so minimal. The scale of the execution time is in the thousandths, so I cannot accurately analyze the impact my parallelized algorithm had on this board.
I decided to choose a moderately difficult sudoku board to solve, one with more unfilled cells, in order to increase the execution time so that I can more accurately anaylze the effects of my algorithm. After running my algorithm on a medium-difficulty 9x9 sudoku board, the speed-up is slightly noticeable. One can observe a reasonable difference in execution time from 1 to 3 processors, but once again it is important to note the range of the execution time: it is in hundredths of a milisecond. Once again, I cannot accurately analyze the impact my parallized algorithm had on this board because of the miniscule difference in execution time.
After some more searching, I found one of the most difficult sudoku boards ever produced, due to it being an extreme worst-case scenario for people to solve it. On this worst-case-difficulty 9x9 sudoku board, the speed-up is extremely noticeable. Firstly, the time scale is at a much more reasonable range, where it is now in ones and tens of miliseconds. The execution time difference from 1 to 4 processors is extremely large, which demonstrates that my algorithm was successful in decreasing execution time. Note that around 6 processors and onward, the execution time plateaus in the amount it can decrease.
Here is the corresponding speed-up graph of the worst-case 9x9 board. As you can see, there is linear speed-up until 6 processors, where the amount of speed-up plateaus at 5x.
Next, I ran my algorithm against a medium-difficulty 64x64 sudoku board. I arbitrarily chose a 64x64 board, as I was more concerned with testing my algorithm on a larger scale sudoku board, but any N^2 by N^2 board, where N > 3, would have sufficed. As you can see, the graph of execution times is almost identical to that of the worst-case 9x9 sudoku board. The scale of the execution time is slightly wider, with the highest value at one processor being 67 ms compared to the 37 ms of the previous board, but the difference in execution time between the first 6 number of processors is extremely noticeable.
Here is the corresponding speed-up graph of the medium-difficulty 64x64 board. Just like the worst-case 9x9 board, there is linear speed-up until 6 processors, where the amount of speed-up plateaus at 5x.
From running my algorithm on the easy-difficulty and medium-difficulty 9x9 boards, I was able to conclude that it was extremely difficult to measure speed-up on these simpler boards since it solves the puzzle too quickly, and the number of processors involved has little to no relevance on the overall speed-up. The range of the execution time in these boards was also extremely small, in the hundredths and thousands of a milisecond, so the difference in execution times become more and more insignificant since the difference became smaller.
From the worst-case 9x9 worst board, it is clear that speed-up increases linearly, then plateaus at 5x speedup at 6 processors. From the medium-difficulty 64x64 board, the results are almost identical to that of the worst-case 9x9 board.. Therefore, the fact that the 64x64 board produced the same results as the 9x9 board verifies that my algorithm has scalability, as it scales with input size. However, my algorithm limitations scale as well, as the 64x64 board also plateaus in speedup at 6 processors, just like in the worst-case 9x9 board.
There is a lot of potential for improvement in my algorithm. One of the biggest flaws in my implementation is that there is high contention on my global stack. I speculate that this is due to the lock that I implemented, compounded with the fact that the workload on each thread is fairly small since threads only needed to check for validity on one cell, as opposed to checking validity for an entire row of cells, column of cells, or a 3x3 grid of cells. If threads finish their computation before the lock on the stack becomes free, then threads must wait, or idle, until the lock becomes free. During this time of idling, the threads are not doing anything useful. Therefore, by increasing the computation that each thread is doing, which can happen by making each thread check for validity for a row, column, or 3x3 grid as described above, then the time that each thread should spend idling for the lock would significantly decrease, which would result in less contention on the stack.
Another spot for improvement in my algorithm is the inconsistent timing that is prevalent among my executions with a high number of processors. As I described previously, the data I collected on the execution times for each number of processors was gathered by taking the average of 10 trials. The main reason why I conducted 10 trials was because there was a fair amount of inconsistency with the actual value of the execution time, as the range of the execution time can vary greatly depending on the number of processors. I speculate that this is due to the fact that threads are not pushing and popping in a fixed order, so the thread that contains the permutation with the solution could be reached faster or slower on multiple executions, depending on whether that thread can push and pop before the other threads get access to the lock. As described previously, this can be resolved by reducing contention with the stack, or I can implement a fixed order in which threads would be able to push and pop things from the stack. The latter option may not necesarily give me a faster runtime, but it would definitely ensure that there would be more consistency in my execution times.
I believe that my choice of machine was perfectly sound. The Gates Machine that I used, ghc30, with 6 cores (but 12 threads from hyperthreading being enabled) was more than enough to provide me a reasonable way to observe the speedup difference from using a varying number of threads. My selection of using OpenMP to paralleize my solver is also extremely efficient for the problem I was given. OpenMP provided me a very friendly and convenient framework for locks, parallelization, and shared memory, and I felt that the execution time from OpenMP was sufficient enough so that I didn't need to pursue other frameworks. For future work, I could try utilizing the OpenMPI framework or design different pruning algorithms to parallelize.