QuickVec C++

A modern C++ approach to explicit SIMD vectorization.
This is the final report.


Unsplashed background img 2

Summary

QuickVec C++ is a modern C++ approach to explicit cross-platform SIMD vectorization. It enables developers to access the power of the hardware SIMD feature sets using a common set of accelerated functions. It is implemented as a header only library with focus on usability and performance.

For final project deliverables, I have created the header library which implements generalized SIMD for SSE - SSE4.2, AVX, and AVX2. It includes a framework to do runtime selection for a vector type. Also included in source control is a Visual Studio 2015 test suite for correctness and a mandelbrot-based performance test.

Bitbucket Git Repo

Background

SIMD

SIMD (Single instruction multiple data) is a processor feature for parallel execution of the same instruction over multiple elements. All Intel and AMD processors from this decade have support for SIMD execution. Also, certain ARM processors and other architectures have support for SIMD. It is useful for operating on multiple pieces of data at once given that their control flow is similar (minimal vector-divergence) and the operation is not memory bound. Unfortunately writing code that directly utilizes SIMD is not straight-forward.

Instruction Set Extension Intrinsics

Intrinsics available for instruction sets are not easily understood when interpreting a piece of code. They are not general in that each set of intrinsics is specific to a particular instruction set extension for a particular platform. Processors support different sets of these instructions, so creating a single solution is limited by the worst machine that may use the developed program. The instructions are also very verbose, each one taking between 15 and 30 characters for simple arithmetic operators. They are also not type safe, so a vector mask and a vector of float values have the same type. Confusing types can result in error riddled code that aren't noticable until the code is run. A type-safe implementation could prevent these errors at compile time.

Strong need for a generalized solution

A lack of a generalized interface means that development effort is duplicated for each set of instructions that need to be supported. This not only creates extra work during implementation, but also during maintenance of a code base. Also, it creates a code base that is not easily extensible due to fragmentation. Lastly, this extra code is written only to support other platforms, but not for any extra functionality.

Approach

To create the library I used SIMD intrinsics defined in Microsoft Visual Studio headers. These intrinsics are common to all compilers, but the data-types used by them are not. I make little use of the Microsoft specific union members of these data types, but the project is nonetheless still dependent on Microsoft Visual studio for the time being. As it is a C++ header library, I used C++ as the language for development with heavy use of C++11 specific features. The targeted machines are x64 and x86 (32-bit Intel).

I started by creating a general sequential implementation of the floating-point and integer vectors. The float_base and int32_base classes are generalized 32-bit floating point and 32-bit integer vector implementations. These are provided as the base class for all SIMD accelerated vector types. Each type listed to the right is accelerated to the level defined by its name. Due to inheritance, each feature does not need reimplemented at each level. However, features that are faster can overload the methods of base classes. One example of this is the modulo implementation for SSE which is not possible using intrinsics until SSE2, but is implemented with faster instructions in SSE4_1. If the support is only SSE, the sequential version will be used; if the support is SSE2, the SSE2 version is used; and if the support is SSE4.1 or better, the most accelerated version will be used.

After implementing arithmetic operators, I considered how comparisons and masking would work. This involved a fair amount of change to the implemented float and int vectors. I created boolean mask type. It is the return type of comparisons between vectors. There is an implementation for each feature support needed in the same way as the accelerated float and int vectors. These can be used as a mask to combine vectors.

The types are listed to the right.


template< size_t N>
class float_base< N >;
class float4_sse;
class float4_sse2;
class float4_sse4_1;
class float8_avx;

template< size_t N>
class int32_base< N>;
class int32x4_sse2;
class int32x4_sse4_1;

template< size_t N> 
class bool_base< N>;
class bool4_sse;
class bool8_avx;
Unsplashed background img 3

Challenges

The major challenge of this project was bringing all of the targets together in one library while maintaining a clean interface. The target instruction sets include SSE (versions 1-4.2), AVX (1,2, and 512-bit), and ARM Neon. There are many differing sets of intrinsics for different instruction feature sets and platforms. A large part of the project was determining which features are available for a processor at runtime and executing with little to no overhead in comparison to a sequential implementation if the SIMD instructions are not available. The second challenge is getting results with little to no difference in speed with a similarly written SIMD intrinsics implementation.

I also ran into problems with understanding operator overloads and templates in C++. When these are used incorrectly, compiler error messages are incredibly obtuse and the developer is faced with a wall of text that often is unrelated to the actual problem.

Deliverables

A Correctness Test Bench
I implemented a test suite that checks for the correctness of all configurations for the implemented classes and functions. It does this by first testing the sequential implementations for having expected values and then testing accelerated vector types against the sequential ones.
A Performance Test Bench / Example Use Case
I created a performance test based on the mandelbrot set calculation that compares timing of specific aspects of QuickVec against an auto-vectorized sequential implementation, and AVX intrinsics.
Presentation
My final presentation included results of the performance comparison, sample snippets of code, and an a demo of the example application.

Results

Match Performance
QuickVec matches the performance of sequential code when accelerated versions of operations are not available on a platform. When the accelerated functionality is available, it performs similarly to the same code written with intrinsics and faster than an auto-vectorized version whenever SIMD utilization is possible.
A Common Interface
QuickVec provides a set of accelerated vector classes with methods which are common to all supported platforms. These include common arithmetic operations, compares, loads, stores, and masking methods. It also provides a common framework to test for support and automatically select a vector type at runtime.
Modern C++ Style
QuickVec is written with the style of STL and Boost libraries in mind. QuickVec's interface maintains value semantics. Operators are overloaded in an intuitive way that allows code to be changed only minimally in order to use vectorization.

Performance Results

This is the result of doing a mandelbrot set calculation with a maximum iteration per pixel of 100 times and a resolution of 2048x2048. Each code example was run 100 times to collect the data seen in the chart. The code was compiled using Microsoft Visual Studio with Full Optimization (/Ox) enabled. It should be noted that this means the sequentially written code has been auto-vectorized by the compiler.

The chart to the right shows the speedup relative to the compiler optimized sequential implementation. As shown by the chart, the QuickVec code is as fast as the code written using AVX intrinsics directly. Below, it can be seen from the code example that the same code written with intrinsics becomes harder to understand. However, the code utilizing QuickVec maintains speed, but does so in a way that doesn't obfuscate the intention of the code.

The testing platform used was a laptop with an i7-4700hq processor. The processor supports all SSE instruction sets, AVX, and AVX2.

Sequential


float oneOverRes = 1.0f / RESOLUTION;
for (int iy = 0; iy < RESOLUTION; iy++) {
  float y = static_cast< float>(iy) *oneOverRes;
  for (int ix = 0; ix < RESOLUTION; ix++) {
    float x = static_cast< float>(ix) * oneOverRes;
    float z = 0.0f;
    float zi = 0.0f;
    float val = 0.0f;
    for (int i = 0; i < ITERATIONS; i++) {
      float a = z*z;
      float b = zi*zi;
      if (a + b > 4.0f) {
        break;
      }
      val++;
      zi = (2.0f*z*zi) + y;
      z = a - b + x;
    }
    //Record values
    results[ix + (iy * RESOLUTION)] = val;
  }
}

QuickVec Generalized


float_vec increment = float_vec::increment(0,1); //From 0 by 1
float oneOverRes = 1.0f / RESOLUTION;
for (int iy = 0; iy < RESOLUTION; iy++) {
  float_vec y = float_vec(static_cast< float>(iy) * oneOverRes);
  for (int ix = 0; ix < RESOLUTION; ix += float_vec::size) {
    float_vec x = (increment + ix) * oneOverRes;
    float_vec z, zi, vals;
    z = zi = vals = float_vec::zero();
    float_vec::bool_t finished(false);
    for (int i = 0; i < ITERATIONS; i++) {
      float_vec a = z*z;
      float_vec b = zi*zi;
      finished |= ((a + b) > 4.0f);
      if (finished.all())
        break;
      vals.if_not_set(finished, vals + 1.0f);
      zi = (2.0f*z*zi) + y;
      z = a - b + x;
    }
    //Record values
    vals.store(&results[ix + (iy * RESOLUTION)]);
  }
}

AVX Intrinsics


alignas(32) float incr[] = { 0.0f, 1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f };
	__m256 increment = _mm256_load_ps(incr);
	__m256 one = _mm256_set1_ps(1.0f);
	__m256 four = _mm256_set1_ps(4.0f);
	__m256 oneOverRes = _mm256_set1_ps(1.0f / RESOLUTION);
	for (int iy = 0; iy < RESOLUTION; iy++) {
  __m256 y = _mm256_mul_ps(_mm256_set1_ps((static_cast(iy))), oneOverRes);
  for (int ix = 0; ix < RESOLUTION; ix += 8) {
    __m256 x = _mm256_mul_ps(_mm256_add_ps(increment, _mm256_set1_ps(ix)), oneOverRes);
    __m256 z = _mm256_setzero_ps();
    __m256 zi = _mm256_setzero_ps();
    __m256 vals = _mm256_setzero_ps();
    __m256 finished = _mm256_setzero_ps();
    for (int i = 0; i < ITERATIONS; i++) {
      // a= z*z;
      __m256 a = _mm256_mul_ps(z, z);
      // b= zi*zi;
      __m256 b = _mm256_mul_ps(zi, zi);
      // finished = finished | (a+b > 4)
      finished = _mm256_or_ps(finished, _mm256_cmp_ps(_mm256_add_ps(a,b), four, _CMP_GT_OQ));
      // if (finished) break;
      if (_mm256_testc_ps(_mm256_cmp_ps(vals, vals, _CMP_EQ_UQ), finished) == 0)
        break;
      // vals = vals + 1;
      vals = _mm256_blendv_ps(_mm256_add_ps(vals, one), vals, finished);
      // zi = 2.0f*z*zi + y;
      zi = _mm256_add_ps(_mm256_mul_ps(_mm256_mul_ps(_mm256_set1_ps(2.0f), z), zi), y);
      // z = a - b + x;
      z = _mm256_add_ps(_mm256_sub_ps(a, b), x);
    }
    //Record values
    _mm256_storeu_ps(&results[ix + (iy * RESOLUTION)], vals);
  }
}

Resources

I started from a blank code base. For references I made heavy use of the intrinsic documents and reference guides for SSE and AVX intrinsics found here

Unsplashed background img 2