I wrote a parallel AES ctr mode implementation for nvidia gpu's, specifically targetting the gtx 670. On the gtx 670 encryption rates of around 1.4GB/s were achieved.
AES is a pre-shared key block encryption cipher. It is a common component in web security and is one of the most popular pre-shared key algorithms. Due to its prevalence in security fast implementations of AES are critical for maintaining security without a significant sacrifice in performance.
Since the aes algorithm is a block cipher it only specifies how to encrypt a single block (16 bytes) of data with a single key. In order to encrypt large amounts of data such as files there must be a specification of how subsequent blocks in the file are encrypted. In this implementation we use counter mode since it exposes significant parallelism.
The basic aes algorithm works by completing a series of rounds on the input data. Each round consist of four operations: substitution, row shifting, column mixing, and adding round keys. In aes with a 128 bit key 10 rounds are completed and the result is the encrypted data. Each block is considered as a 4x4 column major grid when performing row shifting and column mixing. The round keys are extracted from the initial key.
In order to be effectively parallelized the encryption of each block can't have a sequential dependency on the previous block. Counter mode starts with a 16 byte long random seed (size of a block). This seed is encrypted and the result is XORed with the data we want to be encrypted. For subsequent blocks of data the random seed + a counter value is encrypted and XORed with the respective input block. Thus the ith input to the AES algorithm is the random seed + i. Since the encryption of each block of data only depends on the random seed + a counter the encryption process is parallelizable across each block of data.
The most obvious source of parallelism is across different blocks. The blocks are completely independent given the random seed and counter value as an input. However, the significant problem here is how computations are mapped onto the gpu as blocks are significantly larger than the 32 bit word size. The two most basic mappings are a thread per block and a thread per byte; but a simple mapping will suffer from poor memory access and poor resource utilization respectively.
Another source of parallelism is the combining of byte size operations into word size ones. The gpu operates in 32 bits so we want to parallelize byte operations into word size operations since using word size ops for single byte operations gives poor resource utilization.
The remaining sources of parallelization and speedup are appropriately mapping the algorithm onto the hardware to fully exploit its computing resources. Good memory access and cache exploitation are candidates for optimization.
For implementing the encryption I used CUDA targetting nvidida's gpu architecture. The fastest implementation used a single thread to encrypt each block of data. Due to the data-parallel nature of encrypting different blocks the mapping of threads to blocks is irrelevant so I chose to use 64 thread blocks thus each block consumed 2 32 wide warps. Due to the small size of the working set shared memory was not utilized and instead relied on L1 cache for access to the data being encrypted. With the small working set conflict misses were not a concern. Additionally the read only cache was leveraged to provide fast access to a single t-table. The single t-table was chosen to reduce cold misses which would inevitably increase with 4 t-tables.
Initially I implemented a cpu version of aes ctr mode in order to understand the structure of the algorithm and to check the correctness of later gpu implementations.
The first implementation implemented the algorithm as depicted in the background using single byte operations. Each CUDA thread is responsible for encrypting a single block of data.
In the second implementation byte operations were coalesced into word size operations where possible. In addition the read only cache is leveraged to store the substitution table to free up cache and give better access times
T-tables are precomputed results of the substitution, shift rows and mix columns operations. By precomputing these results the amount of computation is decreased but has an increased cost of performing extra table lookups.
The four precomputed t-tables are exactly the same tables but with the words rotated by 8 bits. As such a single t-table can be used so that all lookups are to a single table. This adds the cost of performing shifts within each round
With a single thread encrypting each block, memory locations requested by threads running in the same warp map unevenly to the memory banks. As a result memory loads and stores take several cycles as there is too much demand on some banks and zero on others. In this implementation blocks are copied into shared memory with 4 byte spacing giving a 5 word stride for memory accesses.
With a single thread per word, memory access is cleaner and often contiguous. Looping is generally eliminated in favor of work distributed over threads. Threads working on the same block technically must sychronize when mixing within blocks; however, since all threads are within a single warp they execute in lockstep implicitly providing synchronization.
Leverages benefits of a single word per thread in addition to trading computation for reduced memory access and more cache hits for single table
In general I achieved my goals except for accounting for different key sizes. I initially thought the key size would have performance implications; however, they only affect the number of rounds and the round key generation and have almost no effect on performance.
From the Performance Scaling graph all implementations scale linearly with an increase in the number of blocks. This is because the independent blocks are data-parallel. Thus any significant increases in performance will come from the optimization within the encryption of a single data block.
Three different primary implementations were tested. The naive implementation, a single t-table implementation and a full t-table implementation (4 tables). The results above are the average execution time for encrypting 1GB of random data over 50 runs. Interestingly, both of the t-tables have near identical performance despite the significant difference in operations that they perform. The single table performs 4 lookups from a single table an 3 rotates on the lookups while the full table performs 4 lookups from different tables.
The extra computation of the single table implementation appears to be offset by better cache performance since all lookups are on a single table. In other words the cold misses from having 4 tables removes the benefit of reducing the number of computations.
The above results are the average execution time for encrypting 1GB over 50 runs. When I started the project I thought resolving memory access would give me the greatest performance increases; however, the results show that having well distributed memory access is not worth the cost of reorganizing the data form a good distribution. For the encryption the cost of explicitly copying to shared memory rather than relying on the L1 cache is not worth the benefit of fewer cycles when warps load or store.
I wasn't satisfied with the results of the last test so I offset the cost of copying to shared memory by encrypting the data 50 times in a row. The new results were: Fragmented: 36374 ms, Unfragmented: 37734. The tests were on encrypting 1GB 50 times averaged over 10 runs. The results show that given enough use good memory access patterns are beneficial. It is also interesting to note that Nvidia has worked to reduce effects of bad access patterns. Compute capability 3.0 devices have 32 banks that can supply 8bytes to any number of threads provided they all are accessing the same words(its ok if 8 threads are getting 8 different bytes as long as they reside within 2 contiguous words). As a result the poor access pattern imaged above would take at most 2 loads/stores for the whole warp to read/write.
The results are from encrypting 1GB of data averaged over 50 runs. In both runs nearly the same operations are performed. As such I assume the difference in runtime is primarily due to scheduling overhead. The per block implementation's threads perform 4x the work of the per word threads. Additionally assigning a single word to a thread results in more redundant caluclations for indexing in order to get full simd utilization. Like the results from the memory fragmenting tests I am surprised that the better memory access pattterns of the per word implemntation doesn't give it a noticable performance improvement.
The implementation with the best performance had a single thread per block, no memory fragmenting, and used a single t-table. With this implementation on a gtx670 a 42x performance improvement was achieved as compared to a naive single threaded implementation of the cpu. This is a reasonable improvement for a software only implementation; however given cpu hardware encryption support and dedicated asic solutions, encrypting with a gpu really isn't the most practical solution. Additionally AES was designed to be light arithmetically and have light memory requirements. As a result the AES implementation did not significantly benefit from GPU's memory latency hiding capabilities and the whole working set fit easily within shared memory, L1 and read only caches.
With implementing aes on a gpu communication and synchronization weren't really relevant in the parallelization process. The most important aspect was understanding the architecture to get the best performance out of the memory system and provide an effective mapping of the algorithm to the hardware. Simd utilization is not a limitation as almost all divergence can be eliminated. If I missed speedup opportuinities I believe they would be most likely related to the memory system. Most of my improvements came from trading off computation and memory access and exploiting the cache for faster lookups. However, from my testing with memory access patterns I think there might be some performance gains with a better mapping of blocks to memory.
I will be implementing AES encryption and decryption using counter mode on a nvida GPU.
AES encryption using integer counter mode enables a large amount of data to be encrypted using a single key. The use of counter mode enables parallelizable encrytpion since there are no dependencies between the blocks other than a counter value that is trivially computable. A randomized value combined with the counter is run through the AES algorithm (as the input) with a fixed key for all blocks. The message is then encrypted by using a XOR operation on each block of plaintext with the output of the aes algorithm for each block.
In general encryption is a computationally intensive operation even with hardware support provided by many current processors. As such it is not bandwidth bound making it reasonable to offload to the gpu especially for the encryption of large amounts of data.
The AES algorithm consists of a series of rounds made up of substitutions and permutations within each block of data. I intend to implement an efficient parallel implementation in order to effectively encrypt as many blocks of data as quickly as possible.
Initially it seems that AES incryption is trivially parallelizable by simply parallelizing the computation of each block. However each block is 16 bytes wide making the trivial parallelization ineffective due to memory access patterns. In addition the permutations and substitutions within each block are not uniform leading to potentially divergent execution, decreasing the speed and efficiency of the implementation. Additionally a parallel implementation of the permutations has to handle concurrency problems efficiently as this is a significant part of the overall workload of encryption.Overall the major challenges of the project will be effectively managing memory access for the 16 byte blocks and effectively permuting values within the blocks without significant memory cost or divergent execution.
I plan to first implement a basic single threaded implementation on a CPU to ensure a complete understanding of the algorithm and to check for correctness. Additionally I plan to implement a basic GPU implementation where the encryption is only parallelized on a per block basis. Finally I will implement a final optimized GPU implementation with support for multiple key sizes. While cataloging different intermediate implementations and their corresponding performance.
If I have time I want to implement a multithreaded CPU implementation leveraging hardware support. I would then use this implementation to compare CPU and CPU performance with respect to parallelization and single block performance to gain a better understanding of the effectiveness of using a GPU for encryption purposes.
For a demo I will present the speedups achieved for each implementation and if I have extra time graphs showing the benefits and drawbacks of using a GPU as compared to a CPU with hardware support.
I initally intend to implement the algorith on the ghc machines with GTX 670 GPU's. I intend for my implementation to be scalable to also run on the latedays titanx GPU's for additional performance testing. The specific GPU shouldn't be important as I intend for a scalable solution.
Gain a complete understanding of the AES algorithm and the specific operations that much happen in each round. Work on and hopefully finish CPU implementation.
Finalize and test CPU implementation. Start basic GPU implementation.
Finish and test basic GPU implementation. Work on intermediate GPU implementation testing different memory access patterns and techniques derived from analysis of basic GPU implementation. Catalog each different intermediate implementation and test performance.
Continue testing and improving intermediate implementations. Complete final GPU implementation
Do performance testing on final implementation with different size files and test on latedays titanX. Document all speedups and generate plots. Hopefully have time for a CPU implementation with hardware support. Perform performance testing of CPU vs GPU and characterize performance benefits/drawbacks.
The only significant schedule revision is pushing everything back about a half week. I chose to leave it at week wide granularity since further partitioning would mean specifying tasks of unknown cost and I would just be guessing at the scheduling.
I have completed researching and understanding the AES algorithm. Additionally I completed a basic CPU implementation to use for testing my GPU implementations. The CPU implementation is single threaded and does not leverage hardware support. From the tests I have performed it seems to be working fine. I also started my basic GPU implementation but have not completed nor tested it.
I am currently a half-week to a week behind schedule; however, I'm sure I will be able to produce all of my deliverables. I'm not concerned about being behind as I will have more time than anticipated during the last two weeks of the project as a result of a light schedule. My ability to produce the "nice to haves" will depend on the problems or lack thereof that I run into during my GPU implementation. At this point I don't know.
The biggest remaining unknown is the performance tuning. There is really no accurate way to predict how difficult this will be. Otherwise I don't forsee any other problems.