Hopcroft's Algorithm is a DFA Minimization Algorithm that is known to have an asymptotic linearithmic time complexity. This project is an attempt to correctly implement a modified version of the algorithm whose complexity is approximately the same, for multi-core systems.
I used OpenMP for parallelizing and I targetted the GHC machines for benchmarking. map and partition were, in some cases, run in parallel across the available cores, and in the average case each instance of Seq was run in sequential, but they were all run at the same time. This is, of course, a change in the original algorithm, and it no longer meets the worst-case complexity bound, but this does not affect the average-case complexity.
The most helpful optimization was in compressing as much as possible and then vectorizing it. First was the change from boolean arrays to bitvectors. Then, by vectorizing, I don't mean using SIMD, but rather, using 64-bit values which support very efficient operations like counting the number of '1's (popcnt), taking the complement (~), partition (&, ^), and even computing union (|). Perhaps an implementation using SIMD could help even more, but it seems that 4 popcnts are significantly faster than any avx instruction for counting the number of bits, and in addition, SIMD wouldn't support atomic (|).
Another helpful optimization was to interleaving the loading of cache lines. This property is controlled by CACELINE, in Config.h.
The only other thing that helped was to solve the problem of having a lot of divergence by, rather than using a dynamic scheduling, use interleaved static scheduling. This is because the overhead of using a truely dynamic scheduling was simply too high, and while there is expected to be a lot of divergence, there is not ALWAYS divergence, and in the cases that it isn't, the cost is too high. Other things, like basing the static scheduling on the size of the cache line, also helped to some degree, especially when the number of threads was small.
The result of all of these optimizations were not just for the speedup, but even compared to the original algorithm, the new implementation became almost 3x as fast, but because it's less work-efficient, in some tests (well only 1 really, the one that cannot be minimized), it became worse.
I did not start from an existing code-base. Pseudocode for the original algorithm is available on Wikipedia.
Speedup is the measure of performance I was using while optimizing my algorithm.
The experimental setup included a set of pre-determined DFA topologies that is fairly easy to speedup, some pseudo-randomly generated DFAs, and two powerset automata. The pre-determined set and the powerset automata were used for both testing and benchmarking, while the randomly generated ones, of course, were used only for benchmarking. The size of the inputs for benchmarking were as large as possible without taking forever to complete the benchmarks. The baseline comparison is the same implementation with the number of threads set to 1.
The number of nodes is not the only thing that changes between inputs, but also the topology, so it is hard to quantify the results solely based on the problem size. However, it is certainly the case that, the speedup increases greatly depending on the size of the input, given that it can complete that problem in a short amount of time.
As I have already mentioned in the approach, the biggest problems were bandwidth, divergence, and while the existence of dependencies is also a limit to the speedup, it is not a limit that can be overcome with this algorithm, no matter the implementation. I can't provide exact evidence for each of these separately, but checkingout older revisions of the git repo (or alternatively, looking at the difference in results table shown below), both offer evidence that the fixes made really did improve the speedup. There is evidence for bandwidth: If you modify CACHELINE to something larger, the program becomes slower, as a result of loading lines from DRAM from vastly different locations, meaning that it's very likely that the problem is bandwidth-bound. And since the fixes involved reducing dependencies and reducing bandwidth, surely these are both major problems that limited the speedup of the program.
I manually added a number of debug statements that measure the amount of time a certain block of code takes, and then at the end, it makes a histogram, and outputs it (see FINETUNING in Config.h). Using these results, I was able to identify the blocks of code that get WORSE when the number of threads grows large (usually, this is a result of bandwidth and cache limits), and using this information, I was able to optimize. The statements are still left in the code but only enabled by the above parameter. There are still a lot of dependencies, so unless the problem size is increased, there will be a point after which using more cores does not help. There is still a lot of room to improve, in the most obvious places: the map and partition. Neither of these achieve a perfect speedup as a result of the bandwidth requirements.
Perhaps using a GPU would have been superior to using a CPU because bandwidth is certainly the limiting the factor, but at the same time, if a GPU implementation was chosen, perhaps the resulting divergence would not have been so good either, and certainly a large amount of synchronization would be required regardless.
Speedups as a Table
|Structured with Sink||2||1.00||2.09||3.44||3.47|
|Structured Cycle with Sink||2||1.00||1.75||3.06||3.23|
|Structured with Sink||2||1.00||1.42||1.83||2.09|
|Structured Cycle with Sink||2||1.00||1.36||1.58||1.82|
Note that the topology of the benchmarking automata vary, but the strongest correlation seems to be in the number of symbols.