QuickVec C++

A modern C++ approach to explicit SIMD vectorization.


Unsplashed background img 2

Summary

QuickVec C++ is a modern C++ approach to explicit cross-platform SIMD vectorization. It enables developers to access the power of the hardware SIMD feature sets using a common set of accelerated functions.

Background

Speed

Efficiency and performance are important to many high performance applications. QuickVec C++ will be implemented with a focus on performance and efficiency. It will expose the power of the SIMD instructions available on the hardware on each system without forcing the user to duplicate any work.

Familiar Development

It is important to developers to work with well designed systems that follow the paradigms that they know and understand. The QuickVec C++ library will attempt to match the patterns presented by the C++ standard library and STL. This will ease the development process greatly.


Cross-Platform

When developing applications for multiple platforms, it not only inconvenient for developers to use multiple SIMD intrinsic sets, but also inefficient and time-consuming. The QuickVec C++ library will maintain its performance while being fully cross-platform. This means that functions written using the library should run on any of the supported platforms with no extra effort expended.

Unsplashed background img 3

Challenge

The major challenge of this project was bringing all of the targets together in one library. The target instruction sets include SSE (versions 1-4.2), AVX (1,2, and 512-bit), and ARM Neon. There are many differing sets of intrinsics for different instruction feature sets and platforms. A large part of the project is determining which features are available for a processor at runtime and executing with little to no overhead in comparison to a sequential implementation if the SIMD instructions are not available. The second challenge is getting results with little to no difference in speed with a similarly written SIMD intrinsics implementation.

Resources

I will be starting from a blank code base. For references I will be using the intrinsic documents and reference guides for SSE and AVX intrinsics found here, and ARM Neon intrinsics found here

Goals

A Common Interface
QuickVec provides a set of accelerated classes and methods which are common to all platforms. These include common arithmetic operations as well as other common vector operations such as shuffle. Also, there is an accelerated masking interface.
Match Performance
QuickVec will match the performance of sequential code when accelerated versions of operations are not available on a platform. When the accelerated functionality is available, it will perform similarly to the same code written with intrinsics or the same code using a similar vectorization library such as Yeppp!. Benchmarks coming soon!
Modern C++ Style
QuickVec is written with the style of STL and Boost libraries in mind. QuickVec's interface maintains value semantics. Operators are overloaded in an intuitive way that allows code to be changed only minimally in order to use vectorization.
Unsplashed background img 2

Why use cross-platform C++?

C++ is the current standard language for high-performance cross-platform application development. It is also capable of achieving different levels of abstraction that are not available in other languages. Also, base intrinsics are available onall of the target platforms in C or C++ as a base for implementations.

Schedule

Week Task Done
April 3 - April 10 Find Commonalities and Design Interfaces
April 10 - April 17 Implement sequential and SSE
April 17 - April 24 Simple instructions for all SSE versions, AVX, and Neon.
Create android project to test ARM Neon implementation.
Expand test bench for all versions.
April 24 - May 1 Automatic detection of extensions.
Compile-time detection or specification of extensions.
Create interface for efficient loads and stores.
Add integer type interface for all versions.
May 1 - May 8 Add compare and masking operators.
Add shuffles and shifts.
Add dynamically sized array type.
Create performance test (QuickVec, Yeppp, intrinsics).
May 8 - May 11 Prepare for final presentation.
Unsplashed background img 2

Current Implementation

Test bench

A test bench using visual studio unit tests is capable of testing the operators for consistency against the sequential implementation of each operator. The test bench tests all implemented methods.

Sequential Implementation of 32-bit float and integer operators

The sequential implementation is designed to be the base class of all further specialized classes (such as SSE or AVX). The implementation does a compile time unroll of whatever operator is being applied across all elements of the vector. This is done using a template meta-programming technique to approximate a compile-time for each. The reason for this is to guarantee the lack of branching when operating on all vector elements. The hope for this is that if a specific platform does not properly report its use of SSE or AVX, that ILP will work smoothly for the set of instructions.

SSE and AVX floating point and 32-bit integers

The SSE (all versions) implementations for 128-bit or 4-wide single precision floating point vectors and 32-bit integer vectors are complete. Also, the AVX implementations of 8-wide vectors for 32-bit floats and 32-bit signed integers are complete. These includes common arithmetic, loads, stores, and comparisons.

Performance Results

This is the result of doing a mandelbrot set calculation with a maximum iteration per pixel of 100 times and a resolution of 2048x2048. Each code example was run 100 times to collect the data seen in the chart. The code was compiled using Microsoft Visual Studio with Full Optimization (/Ox) enabled. It should be noted that this means the sequentially written code has been auto-vectorized by the compiler.

The chart to the right shows the speedup relative to the compiler optimized sequential implementation. As shown by the chart, the QuickVec code is as fast as the code written using AVX intrinsics directly. Below, it can be seen from the code example that the same code written with intrinsics becomes harder to understand. However, the code utilizing QuickVec maintains speed, but does so in a way that doesn't obfuscate the intention of the code.

Sequential


float oneOverRes = 1.0f / RESOLUTION;
for (int iy = 0; iy < RESOLUTION; iy++) {
  float y = iy *oneOverRes;
  for (int ix = 0; ix < RESOLUTION; ix++) {
    float x = ix * oneOverRes;
    float z = 0.0f;
    float zi = 0.0f;
    bool finished = false;
    int i;
    for (i = 0; i < ITERATIONS; i++) {
      z = (z*z) + (zi*zi) + x;
      zi = 2.0f*z*zi + y;
      if (((z*z) + (zi*zi)) < 4.0f) {
        finished = true;
        break;
      }
    }
    results[ix + (iy * RESOLUTION)] = i;
  }
}

QuickVec AVX


using float_vec = QuickVec::float8_avx;
float_vec increment(0, 1, 2, 3, 4, 5, 6, 7);
float_vec one = 1.0f; //Single elements can be used to construct
float_vec four = 4.0f;
float_vec oneOverRes = 1.0f / RESOLUTION;
for (int iy = 0; iy < RESOLUTION; iy++) {
  float_vec y = iy * oneOverRes;
  for (int ix = 0; ix < RESOLUTION; ix += float_vec::size) {
    float_vec x = (increment + ix) * oneOverRes;
    float_vec z, zi, vals; // Construction initializes to 0
    for (int i = 0; i < ITERATIONS; i++) {
      z = (z*z) + (zi*zi) + x;
      // Single elements can be used with operations
      zi = 2.0f*z*zi + y;
      float_vec::bool_t finished = (((z*z) + (zi*zi)) < four);
      vals.if_not_set(finished, vals + one);
      if (finished.all())
        break;
    }
    vals.store(&results[ix + (iy * RESOLUTION)]);
  }
}

AVX Intrinsics


alignas(32) float incr[] = { 0.0f, 1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f };
__m256 increment = _mm256_load_ps(incr);
__m256 one = _mm256_set1_ps(1.0f);
__m256 four = _mm256_set1_ps(4.0f);
__m256 oneOverRes = _mm256_set1_ps(1.0f / RESOLUTION);
for (int iy = 0; iy < RESOLUTION; iy++) {
  __m256 y = _mm256_mul_ps(_mm256_set1_ps(iy), oneOverRes);
  for (int ix = 0; ix < RESOLUTION; ix += 8) {
    __m256 x = _mm256_mul_ps(_mm256_add_ps(increment, _mm256_set1_ps(ix)), oneOverRes);
    __m256 z = _mm256_setzero_ps();
    __m256 zi = _mm256_setzero_ps();
    __m256 vals = _mm256_setzero_ps();
    for (int i = 0; i < ITERATIONS; i++) {
      z = _mm256_add_ps(_mm256_add_ps(_mm256_mul_ps(z, z), _mm256_mul_ps(zi, zi)), x);
      zi = _mm256_add_ps(_mm256_mul_ps(_mm256_mul_ps(_mm256_set1_ps(2.0f), z), zi), y);
      __m256 finished = _mm256_cmp_ps(_mm256_add_ps(_mm256_mul_ps(z, z), _mm256_mul_ps(zi, zi)), four, _CMP_LT_OS);
      vals = _mm256_blendv_ps(_mm256_add_ps(vals, one), vals, finished);
      if (_mm256_testc_ps(_mm256_cmp_ps(vals, vals, _CMP_EQ_UQ), finished) != 0)
        break;
    }
    _mm256_store_ps(&results[ix + (iy * RESOLUTION)], vals);
  }
}

Deliverables

A Correctness Test Bench
I will implement a test suite that will check for the correctness of all configurations for the implemented classes and functions.
A Performance Test Bench
I will create a performance test that will compare timing of specific aspects of QuickVec against Yeppp! and AVX intrinsics.
An example implementation
I will implement a common algorithm such as a mandelbrot set or marching cubes to demonstrate the ease of use of the interface.
Documentation
I will document all classes and features as well as provide example code for common tasks.
Presentation
My final presentation will include results of the performance comparison, sample snippets of code, and an a demo of the example application.