Darkroom Implementation of the CMU Smart Headlight
My final project for the course 15-869: Visual Computing Systems
The CMU Smart Headlight is a reactive visual system developed at Carnegie Mellon university under the leadership of prof. Srinivasa Narasimhan of Robotics Institute.
I intend to implement a similar high speed image processing/ analysis pipeline using darkroom which is a language for describing hardware image processing pipelines embedded in Terra.
The Big Picture
A binary reactive visual system is a dynamic lighting visual system which consist of 3 major stages viz. Image Sensing, Image Processing and reactive illumination/ display. The display will illuminate or dis-illuminat certain objects in the scene based on the outcomes of the processing stage. The CMU smartheadlight is one of its application where we track and dis-illuminate tiny, scattered particles of finite size (typically snow/rain). Another proposed use case is in theaters for automated control of spot lights or lightings demanded by plays.
Picture Courtesy: Illumination and Imaging Lab, Carnegie Mellon University
Quite often it turns out to be a fact that impressive results are obtained from high speed low latency systems that employ very simple algorithms when compared against systems that rely on more complex algorithms to achieve similar results often by paying a cost of increased latency.
The pipeline stages
The following stages are explained simplistically in the context of a programmable headlight/ illumination system.
- Image Acquisition : This is the sensing stage. We capture the input image using a high speed monochrome CMOS camera. A typical implementation would required atleast a capture of 500fps at the resolution demanded by the application.
- Background Subtraction : The simplest form of background subtraction is done by averaging a few frames and then subtracting the resulting from the subsequent incoming frames. Unless the background is fairly static, the background image needs to be recalculated periodically.
If c(x,y,t) represent the input frame at time = t, then background image can be computed as the time average of n consecutive frames. k is the time instance from which you need to start averaging the consecutive frames. Generally k = 0, if the background is fairly constant.
- Image binarization : Mostly likely the reactive illumination component will be a digital light projector. We turn ON/ OFF individual output pixels based on the objects' position. So the background subtracted images are binarized by applying a threshold. Ideally the threshold needs to be computed globally based on the lighting conditions. But for all practical purposes we apply a hardcoded threshold derived experimentally.
The frame Cs(x,y,t) obtained by subtracting the background frame is threholded as mentioned below. A typical value of threshold,CTh = 100.
- Dilation / Prediction : In reality, the system responds to an input image after a finite amount of time. During this time period, no matter how small it is, the detected object would have changed its position in the 2D plane as seen by both the sensing element (camera) and the actuation element (DLP Projector). Inorder to make sure that the illumination/ dis-illumination is nearly accurate, we either need to predict the subsequent positions of the objects using the history of images or dilate the input image.
"Dilation is one of the basic operations in mathermatical morphology which uses a structuring element for expanding shapes contained in an input image."[Courtesy: Wikipedia]
The most convenient and cache friendly structuring element is a w x h rectangle. Typically after dilation, each output pixel would have a value of 1, if atleast one of the pixels with in the structuring window in the input frame is 1. Alternatively, we can sum up the pixels in every window and see if the sum is non zero. This works well because the input and output images are binary.
The dilated image pixel values cd(x,y,t) can be computed based on Ds(x,y,t), the sum of pixel values around cs(x,y,t) with in the w x h rectangular window
A more robust way of deading with this situation is to predict the locations of the "blobs" in the image to be projected. The below mention is a highlevel description of "linear prediction". Blobs are nothing but spare clusters on binary 'HIGH' pixels'
- Detect the "blobs" in the previous and current frames using any of the standard methods. (Typically, difference of gaussians)
- Generate the list of centeroids of the detected blobs for every frame.
- For every blob in the current frame, find the nearest neighbor in the previous frame, compute the motion vector ( the order pair of X differene and Y difference).
- Apply the calculated motion vector to the current blob, and repeat the same procedure for all the blobs in the current frame. Typically when we say, "apply the motion vector to a blob", we draw a similar contour (or a circle for ease) in the newly calculated location in the output image.
More complex curve fitting techniques can be applied at the expense of an increased latency. Nevertheless, we have a finite amount of uncertainity in the predicted positions of the objects. More over this does not fit very well into an FPGA friendly line buffered pipeline. So as an easy alternative we use image dilation with a fixed rectangular window.
- Inversion: So far we were detecting and trying to illuminate particles. In the case of dis-iluminating the objects the foreground pixels need to be turned off and background needs to be turned on. (only for dis-illumination) So, this is achieved by simply toggling all the binary pixel values.
- Image Warping/ Homography : The resultant image needs to under go a tranformation from the camera plane to the projector/ illumination system's plane. In most of the implementations, homography is implented with the help of a look-up table created by calibrating the system. And this is the final image which will be projected.