QuickVec C++

A modern C++ approach to explicit SIMD vectorization.


Unsplashed background img 2

Summary

QuickVec C++ will be a modern C++ approach to explicit cross-platform SIMD vectorization. It will enable developers access to the power of the hardware SIMD feature sets using a common set of features.

Background

Speed

Efficiency and performance are important to many high performance applications. QuickVec C++ will be implemented with a focus on performance and efficiency. It will expose the power of the SIMD instructions available on the hardware on each system without forcing the user to duplicate any work.

Familiar Development

It is important to developers to work with well designed systems that follow the paradigms that they know and understand. The QuickVec C++ library will attempt to match the patterns presented by the C++ standard library and STL. This will ease the development process greatly.

Cross-Platform

When developing applications for multiple platforms, it not only inconvenient for developers to use multiple SIMD intrinsic sets, but also inefficient and time-consuming. The QuickVec C++ library will maintain its performance while being fully cross-platform. This means that functions written using the library should run on any of the supported platforms with no extra effort expended.

Unsplashed background img 3

Challenge

The major challenge of this project is bringing all of the targets together in one library. The target instruction sets include SSE (versions 1-5), AVX (1,2, and 512-bit), and ARM Neon. The challenge for this project is that there are many differing sets of intrinsics for different instruction feature sets and platforms. A large part of the challenge will be determining which features are available for a processor at runtime and executing with little to no overhead in comparison to a sequential implementation if the SIMD operations are not available. The second challenge is getting results with little to no difference in speed with a similarly written SIMD intrinsics implementation.

Resources

I will be starting from a blank code base. For references I will be using the intrinsic documents and reference guides for SSE and AVX intrinsics found here, and ARM Neon intrinsics found here

Goals

  • Find and Implement Commonalities

    I plan to look through the different intrinsic functions from the different sets and find commonalities. From there I plan to bundle as much as possible together using templates and template-specialization. This will allow the user to specify arguments such as the type (float or int), precision, and number of elements in a vector, and the template instantiate that from whatever is available from the compiled feature set.

    Progress:
    So far I believe that I have been able to narrow in on a useful set of functionality accessible in most vector instruction sets. I plan to implement this set of features on each platform, and then expand the set of features. So far the features I plan to include are loads, stores, arithmetic operations, bitwise operations, masks, and shuffles. Of those I have covered arithmetic and bitwise operations.

  • Match performance

    This is in reference to firstly the sequential version of algorithms. By this I mean that when using sequential implementations due to SIMD features not being available the library will introduce only minimal overhead. Secondly, this is in reference to other libraries similar to this one, such as Yeppp!, which is a C library for cross-platform SIMD utilization. Lastly, this is relative to using the platforms intrinsics directly. This means that overhead for using SIMD via this library will be minimal.

    Progress:
    I have not yet created a test bench for relative performance. However, when implementing classes I keep this in mind and always consider the instructions that will result from use of certain constructs. With the current implementation there is no branching overhead for use of operators, and the output from the compiler for doing an add of four numbers in both the vector class and as just floats are roughly the same.

  • Use modern C++ Style

    By this I mean that I will follow the interface style of the STL libraries and Boost. This includes making sure that move and value semantics are maintained for all types. This includes making vectors capable of range for loops.

    Progress:
    This goal has been maintained in the interfaces. However, range for loops are not yet usable as those will be specialized for a dynamically sized vector class that will operate on each vector as the largest vector type supported. This is a stretch goal and is part of the task to "Add iterators and other helpers".

Unsplashed background img 2

Why use cross-platform C++?

C++ is the current standard language for high-performance cross-platform application development. It is also capable of achieving different levels of abstraction that are not available in other languages. Also, base intrinsics are available onall of the target platforms in C or C++ as a base for implementations.

Schedule

Week Task Done
April 3 - April 10 Find Commonalities and Design Interfaces
April 10 - April 17 Implement sequential and SSE
April 17 - April 24 Simple instructions for all SSE versions, AVX, and Neon.
Create android project to test ARM Neon implementation.
Expand test bench for all versions.
April 24 - May 1 Automatic detection of extensions.
Compile-time detection or specification of extensions.
Create interface for efficient loads and stores.
Add integer type interface for all versions.
May 1 - May 8 Add compare and masking operators.
Add shuffles and shifts.
Add dynamically sized array type.
Create performance test (QuickVec, Yeppp, intrinsics).
May 8 - May 11 Prepare for final presentation.
Unsplashed background img 2

Completed Tasks

Test bench

I have created a test bench using visual studio unit tests. In my test bench I test the operators for consistency against the sequential implementation of each operator.

Sequential Implementation of float operators

The sequential implementation is designed to be the base class of all further specialized classes (such as SSE or AVX). The implementation does a compile time unroll of whatever operator is being applied across all elements of the vector. This is done using a template meta-programming technique to approximate a compile-time for each. The reason for this is to guarantee the lack of branching when operating on all vector elements. The hope for this is that if a specific platform does not properly report its use of SSE or AVX, that ILP will work smoothly for the set of instructions.

I implemented all arithmetic operators for the sequential implementation. It is extendible to different floating point types and different int types, as well as to any vector width. The data type is templated as well as a class to access the data type. For the bare sequential implementation, the data type is a std::array with an accessor class that returns elements from positions in the array. Elements can be retrieved with run-time or compile-time indexes. Compile-time indexing is implemented in order to support compile-time unrolling as well as shuffling and other functions later on.

SSE 4-wide 32bit float

I extended the sequential float class (float_base) with a class that has a data type of __128 and implemented an accessor class for it. I then overloaded all of the arithmetic operators except for the % operator, as SSE does not support modulo.

Deliverables

A Correctness Test Bench
I will implement a test suite that will check for the correctness of all configurations for the implemented classes and functions.
A Performance Test Bench
I will create a performance test that will compare timing of specific aspects of QuickVec against Yeppp! and AVX intrinsics.
An example implementation
I will implement a common algorithm such as a mandelbrot set or marching cubes to demonstrate the ease of use of the interface.
Documentation
I will document all classes and features as well as provide example code for common tasks.
Presentation
My final presentation will include results of the performance comparison, sample snippets of code, and an a demo of the example application.