Project Proposal:
Performing Optical Flow with CUDA
Dev Shah - devs@andrew.cmu.edu, Haopeng Wang - wanghp18@andrew.cmu.edu
Checkpoint
Proposal
Summary
We are going to be optimizing/improving the performance of Optical Flow by using scheduling techniques in Halide (a domain specific language used for image processing). We ended up with a 7x speedup over a sequential version of the code.
Background
Optical Flow is the apparent motion of objects, surfaces, and edges in a visual scene caused by relative motion between the observer and the scene. This is extremely useful in object tracking. By tracking the relative motion of the entire scene, one can keep track of where the object is with respect to previous frames given the change in motion. Our inputs are two images, which are sequential frames in a video. The output is a vectorized field drawn on top of the second frame. The green dot represents the location on the second image, and the red line indicates the movement from the previous frame. Halide performs calculations by defining functions which are realized to calculate images. We provided different scheduling techniques (we used a variations/combinations of the 1st, 4th, and 5th) on these functions in order to improve speedup. This involved vectorizing and tiling. There are multiple parts to the optical flow algorithm which could be parallelized, the biggest of which is the calculation of the spatial gradient. The spatial gradient requires calculation of the derivative of the image with respect to x and y directions. This requires gathering a 5x5 window around each pixel and calculating the effect neighboring pixels have on that specific image. This is similar to the blur example which is commonly used to demonstrate the powerfulness of Halide. A lot of the computation here can be improved by spatial locality. With improper pipelining, there is a lot of redundant work being done. This is because precomputed data is needed between calculating the spatial gradient from the derivatives. A lot of that information can be

Approach
We used image processing specific language Halide to implement pyramid recursive LK optical flow algorithm. Halide runs on LLVM and rely on CUDA library, we can compile Halide code targeting CUDA platform to utilize GPU highly parallism computation power. During the the project implementation, we used the GHC63 machine, which has the graphics processing unit GeForce GTX 480.

2. The pyramid recursive LK optical flow implementation involves several parts:

1) grayscale given color image: Given a color input image, based on rec601 luma component computation, we converted the three channels color information into one channal grayscale information. In this step, the whole image is converted and Halide is very suitable to do bulk image processing. We just use the computation defination syntax provide by Halide, and Halide will do a simple pixel by pixel parallism.

2) detecting good features to track Given the converted grayscale image, each pixel has a three by three window to compute the maximum absolute color density difference comparing with the central pixel, and the maximum value will be saved for that central pixel. Then we use double for loop to traverse the whole processed pixels and pick the pixel whose maximum color density difference is higher than a pre-set tuned threshold. And we also constrain that every selected feature pixel will not be too close to another (we set the minimum distance is 2 pixel). The computation of maximum color density difference is very suitable for Halide to compute big bulk of image data. But selecting feature pixels are not suitable for Halide, since we cannot use Halide to do conditional computation and generate output data whose size is not predefined.

3) building three layers of pyramid (layer's number is recommended by reference paper) We borrow the equation in OpenCV pyrimad down CUDA code to down sampling the grayscale image. Although the output image size is not the same as input, but its deminsion is certain to be half of accoridng input image dimension, so this step is very suitable for Halide.

4) computing pixel color density derivative We use central difference method to compute the color density of each pixel on grayscale image on both row and column direction. This step is very suitable for Halide, basically it is very similar with blur computation.

5) computing spatial gradient matrix We use the equation in the reference paper, and this step is not suitable for Halide. First computation involves multiple times of adjacent pixels reading and use the pixel information in different ways. Second, in our project the window size for computing the matrix is five by five, so to compute each pixel's gradeint matrix, the extra computation impose too much computation overhead. (Since Halide is good at whole image processing and we didn't figure out an efficient way to only apply computation on selected points, because we didn't find a way to do conditional computation like vector mask). So we use double for loop to compute the matrix for each selected feature pixel, although the sequential code is not efficient, but the computation work load reduced dramically (about 1000x smaller), so the overall performance is better the Halide counterpart implementation.

6) recursive LK optical flow on pyramid layer Since we only have spatial gradient matrix for selected sample points, it is natural to do this step also in sequenctial manner. We didn't implement whole sequential code first. But we compared the performance between pthread pool and CUDA, and found pthread pool method wins. And for some unfigured out reason, any Halide scheduling we tried or any Halide scheduling combination we tried will lower the performance. Currently, the Halide code without any scheduling has best performance. The algorithm is created is from the referenced paper, and we borrow some equations from OpenCV optical flow implementation.

First, we worked on implementing a correct pyramid LK optical flow algorithm, and the default parallel scheme Halide utilizes is the pthread pool. So we first tested our code on pthread pool method, and then we compile our code targeting CUDA platform to compare the performance difference.

Results
Here are two frames in succession

As you see, the car on the left is moving towards us and the cars on the right frames are moving away. This is in line with the data we received back from our optical flow results. The red dots on the first frame indicate the points selected from feature detection, and as mentioned before, the green dots are the end point/predicated movement based on a neighborhood examination in the second frame.



We were able to achieve a 7x speedup in relation to the code that we wrote which had sequential scheduling. When we incorporated a highly efficient scheduling algorithm where we tiled and then vectorized our definition of derivative calculation. This can be shown in the following graph (The no scheduling represents all code being run in for loops across a CPU whereas the optimized code is running on the gpu with scheduling tactics)



Additionally, we wanted to compare performance against the opencv working implemention of optical flow. The results we obtained are shown below.



Our implementation was noticably slower than the opencv implementation (it may be difficult to see, but we also examined our gray scale function in comparison to opencv and saw a speedup reduction of ~50x). After some research, we realized that some inherent interfacing between CUDA and Halide was bottle necking the process as every time some method such as gpu_thread or cuda_block was called, the process time was much slower than anticipated. This is further exemplified by the fact that we examined the Halide website signature code (also found on Kayvon's notes) which is said to have run in .9 ms for a 1 MegaPixel blur, but ran on the Gates Machine in times varying between 40 and 60 ms. We experimented on multiple images/frames where we tried kept the number of input points consistent. The feature detection injected around 300 points to track within our Halide implementation and the implementation from OpenCV takes in points as a parameter so we hard coded in a good amount of points. We are searching for reasons why this speedup is slow, but we believe that based our scheduling tactic is highly optimized meaning that once we solve the delay from the CUDA code, we will seen reasonable speedup with respect to opencv.
Division of Labor
Haopeng Wang - Research, Feature Detection, Vectorized Movement, Implementation of Optical Flow sequential code, improved scheduling for optical flow

Dev Shah - Research, Testing/Debugging, improved scheduling for optical flow, implementation of experimental setup across opencv
References

Ragan-Kelley, Jonathan, and Andrew Adams. Decoupling Algorithms from Schedules for Easy Optimization of Image Processing Pipelines. Tech. N.p.: n.p., n.d. Web. 9 May 2014. .

Bouguet, Jean-Yves. Pyramidal Implementation of the Lucas Kanade Feature Tracker Description of the Algorithm. Rep. Intel Corporation, n.d. Web. .

Ragan-Kelley, Jonathan, and Connelly Barnes. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. Publication. MIT, n.d. Web. 9 May 2014. .