20Q-Bot: Face Detection and Gesture Recognition
Our project is an application that allows you to play 20 questions by answering yes and no questions with either a nod or shake of your head. It is written in C# using the OpenCV library, and can be used with a simple USB webcam.
Our project creates a user experience that expands conventional keyboard and mouse. By recognizing people and seeing their actions, we hope to create a new method of interaction between man and machine. In the future, more advanced research in face recognition and motion detection will allow the user to interact with a robot without any extra controls needed.
Our project was built using Visual C# and the OpenCV library. To create your own project or to add on to our project, you need the following installed:
1. Visual Studio 2005 or better. You can the free Visual Studio C# Express Edition at (http://www.microsoft.com/express/vcsharp/).
2. Open Computer Vision (OpenCV) Library – Downloadable at (http://sourceforge.net/projects/opencvlibrary/). Install in the default directory to be able to use OpenCVDotNet.
3. OpenCVDotNet – This will allow you to use the library in a C# project in Visual Studio. Downloadable at (http://code.google.com/p/opencvdotnet/).
The OpenCVDotNet page has links to a couple example projects and a small tutorial to get you started. You can download our source code (here).
At each time step, a snapshot of your head is taken with the
webcam. First, the image is converted to YCbCr color space. This color space,
sometimes confused with YUV, is primarily used in video and digital
photography. It represents images with the luminance component (Y) and the blue
and red chrominance components (Cb and Cr). The Y component represents how
light or dark each pixel is. In fact, standard composite video signals (NTSC or
PAL) use Y to represent black and white images (black and white TVs only see
the Y component). Converting to this color space allows us to compensate images
for different
luminance
in the space and compare different skin colors. First we find the average luminance in the image:
;
which is normal grayscale. We determine skin color by looking at how much red
chrominance is in your face. The formula is: Cr =
.
We say that a certain object has a skin color if the range of Cr is in the
range of 0.04 to 0.18 (on a 0 to 1 scale). Note that this range is independent
of the Y component. Notice on the color space, different Y values change the
range that is red. We recreate a black and white image, with the white pixels
specifying skin color.
Next we take the image and apply a noise reduction mask. We apply a 5x5 mask and search through the image. If most of the pixels in a 5x5 area are white, then the whole area is set to white. This allows for better edges and makes blobs more consistent. Blob detection is done using wave front algorithm. For each white pixel we see, we label it a blob number, and any white pixels that touch it also get the same label. After scanning the image, we have a certain number of blobs that could be a face.
After we have the blobs labeled, only those with a certain
height to width ratio are considered faces. The ratio must be in the range from
0.8 to 1.5. These values were found from experiments; they could be tweaked for
your project. We create a bounding box in the image on those blobs.
By now, most of the blobs are ruled out as not faces. We still attempted to do some feature detection to create by the end. We attempted to do mouth detection by examining segments of black pixels within a blob. For each y-coordinate, we find the consecutive number of black pixels, and with that histogram, the coordinate with the largest width (number of black pixels with same x-coordinate) is determined to be the mouth. This didn’t work out completely, and would be one of the first things to add on to the project.
Once we have a face, we did simple gesture recognition to determine a nod or head shake for yes and no. For shaking your head no, we looked at the height/width ratio changes to the face. Since you move from a portrait to profile view of the head, you see more skin, and the width of the blob changes. This change in width is used to determine a head shake. The head nod worked similarly. By checking the changes to the height of the blob, we could determine if a nod was performed. Each change in ratio must occur numerous times before the program will determine that the action was taken.


We set up a Windows form application to link to 20q.net website. Each time that a nod or shake would be performed; the form would click the appropriate link, and answer the question. The 20 questions database on the website has been running since 1997, and has accumulated plenty of information over the years.
In the immediate future, making the detection algorithm more robust is definitely needed. By fixing and adding in more feature recognition, the face detection should be more robust to complex backgrounds. Later on, adding more functionality to the program will allow for different user experiences. Recognizing speech would be cool to have a robot that will recognize your face and have a conversation with you, without being limited to general conversation. It could know you and ask specific questions to your needs.