Emotional DJ (E-DJ)
Or... Blind Video Expression Mapping








Intro
(intentionally less technical)

I noticed a while ago that if you distort a picture of a person's face just the right way it seems to change his emotional expression. I saw this phenomenon when using a distort filter in Photoshop. I started to wonder if I could automatically change the expression on someone's face. There is lots of work on facial expression generation, though most of it relies heavily on human-in-the-loop scenarios and complex models such as the 3D mesh models used in "Expression Cloning" [1]. The problem with the complex models is that they won't run in real-time, and there are tons of problems with human-in-the-loop. The images generated (right) using expression cloning are really neat, but they can't be generated completely automatically. There was one really compelling paper [2] that started with a human face, and then used geometric image warping to create new facial expressions. You can see the original image on the left, and the warped image is in the middle. This is called expression mapping. To generate the image in the middle, several training images of this girl's face are hand marked with a few dozen control points. The paper also used expression ratio images (ERI). The ERIs control for different lighting conditions to add shadows and wrinkles to the facial expression, like in the right-most of the three images. I really liked the results in this paper, so I set off to try to create similar results, but using only a single image that was automatically marked.  I decided to use Ashish's fully automatic upper facial feature tracking [3] to mark my images. Ashish's setup tracks the eyebrows and eyes of a person in real-time video, so I thought I'd try to get my face warping to work in real-time too.  The biggest drawback to this system is that the setup requires an infrared camera, which means that it can't be used on old video taken with a regular camera.  Otherwise, it works really well, and it tracks the eyes and eyebrows of people who sit in front of the camera most of the time. I used the tracking information to try to warp (like the distorting Photoshop filters) the face into a new expression. I used a 2-pass mesh warping algorithm developed by Wolberg [4]. You give it a bunch of input coordinates and corresponding output coordinates, and it warps an image according to a mesh that it generates with the images. As I tried to guage which features to warp and how far to warp them, I made measurements on the Cohn-Kanade database [5]. I measured how far people raised the corners of their mouths when smiling, as well as how high people raised their eyebrows when displaying certain emotions. Then I normalized the distances - I divided through by the distance between the pupils since that's roughly standard across many different people. These measurements helped me gauge the warping distances for my project. I also tried to construct simple wrinkle templates by subtracting the image of a smiling person from a neutral image of the same person. I decided to try to warp three action units: the inner brow raiser, the lip corner puller, and the brow lowerer. From here on the images I show you are images generated by my E-DJ system. 

Results & Analysis

(more technical)














As you can see the brow raises and lowerers work really well. The lip corner pull tended to look less realistic, but still had a good effect. Remember the wrinkles/shadows added using the ERIs? I tried seeing what would happen if I added in a cheek shadow from a different person when someone smiled on my system, and it didn't look very realistic. It would probably be helpful to segment lots of shadows from lots of people and use a matching scheme to see which template fits which person best.






I created a real-time system in C, so that I could control the left and right inner eyebrow, as well as the corners of the mouth using the keyboard. To warp, I used C code based on Wolberg's original 2 pass mesh-warping algorithm. The code is distributed with xmorph which uses it as the major substep to image morphing. The warping code is open source. I used Ashish Kapoor's face tracking code [3] to find facial features. The main challenges in getting the system to where it is were the learning curves related to true C programming in Linux. I had used C a long time ago under windows, but the environment was visual. One of my biggest hang ups, for example, was a linking error in a MakeFile.

The tracking didn't always work, but when it did the warping always produced a fun effect. The effect wasn't, however, always realistic looking. I used a mesh grid for warping. I varied the number of mesh points from 4 by 4 to 16 by 16, and I found that numbers near 8 by 8 worked pretty well. A high number of mesh points tended to give very accurate control over the warping, but didn't let you warp very far before it looked bad due to mesh splines crossing each other. A low number of mesh points gave you more freedom to warp the feature far, but not very accurately, as unintended parts of the image tended to warp along with the intended feature. I set the system up so you could control the warping with the keyboard to allow real-time exploration of emotional expressions. While I didn't run any scientific experiments with the system, I did let a dozen or so people try the system while I observed. When the tracking worked, everyone found it fun and tended to laugh or generally be amused at the funny faces E-DJ was making. Some people played with it for a very long time (just over 5 minutes). But when the tracking failed consistently, people were frustrated. I think the reason the tracking failed when it did was usually because there was something reflective in the background. It helped a lot when I blocked out the background with a black poster board. That way, the background wouldn't reflect infrared light, which confuses the pupil detector. The detector was also far more accurate in diffuse light.

The algorithm used to warp the facial features was pretty simple. The mesh point closest to the facial feature to be warped was chosen, and it was assigned the coordinate of the facial feature. Then the corresponding output mesh point was assigned the same coordinate with some offset. The maximum offset was empirically determined by measuring a few images in the Cohn-Kanade database. The percentage of the maximum offset to be used was determined by the user by pushing the up and down arrow keys on the keyboard. The eyebrows maximum offsets were typically about 40 pixels (assuming a 640x480 image), and the mouth corner maximum offsets were typically between 50 and 60 pixels. 

The system ran around 4 f.p.s. in the standard 160 x 120 image size that I used. I wrote some simple code to downsample and upsample the images without any interpolation for maximal speed. I think the computer is running a 1.4 GHz processor, though I'll have to check on that later today. A new machine would probably double the frame rate, creating a smoother video experience.

I think there are two major challenges for the future of this system. The first is to create a more complex system for mapping mesh points, so that if a feature is being warped, more than just one mesh point is affected. While the splines do some of this work for you, it's not enough.

 

I like the project for its artistic appeal alone. For example, I can imagine performing by pointing cameras at people and then E-DJ'ing their faces in funny ways while projecting the output onto the wall. E-DJ also stands up fine just as a simple toy with no explicit purpose.

But, perhaps more importantly, I think we can learn a lot from using and playing with E-DJ. One fun idea is that when we see our faces being warped it invokes the facial expression that we see on the monitor. If that happens, then E-DJ will take the invoked response a little bit further than it really is, creating the beginning of a positive feedback loop by leading the facial feature.

A more concrete way to use E-DJ for learning might be to use it in conjunction with some facial expression detectors to purposely exaggerate a detected facial expression. This could be used to train people who have a hard time distinguishing emotional expressions, such as autistics. It's possible too that people will learn about their emotions just by exploring their facial expressions without explicit goals.

I can also imagine E-DJ being used to subtly change the look in a crowd of people. If you changed each facial expression just a little bit, the overall look of a picture might be noticeable, while the individual expressions may not look noticeably different. This would be a good use of E-DJ since smaller changes look more realistic.

The same principle could be applied to a video. A video with slightly different facial expressions might be really great for affective research experiments. Imagine if you could condition a person to be in a good or bad mood by showing them exactly the same video with only slightly different facial expressions. The similarities between the video would help ensure that no biases were introduced haphazardly by showing two different videos.

What's Next?

I can imagine a couple different future paths that might bare fruit. The first is to continue the project by adding control over more action units. Then the action units could be coordinated to express complex emotional "words" like surprised. The intensity of the emotional words could still be controlled by the user, or emotional stories could be preprogrammed with a timeline editor not unlike a simple music editor. To get all this to work properly, though, a better mesh point assignment algorithm will need to be written. In order to make the expressions convincing, some kind of wrinkle/shadow system will need to be implemented. However, when a feature is warped it tends to "bunch up" wrinkles or shadows that are already in existence giving the illusion of increased wrinkling. There is a physical analogy to image warping that says to think of the image as a rubber sheet that can be pulled and stretched. The face is particularly well suited to this aspect of warping, because the face is quite pliable and not too far from being like a rubber sheet.

 

Another direction that I would like to take this system would be to switch the facial feature tracking software to one that works on regular video (non-infrared) and images. The image above was marked automatically by face tracking software licensed to Dr. Brazeal's group, and is available for use throughout the Media Lab. This would allow me to create plug-ins for existing image and video software, which is really sharing the technology since it's giving it away for free in a form that people can use. It is enticing to imagine my favorite (or least favorite) politician giving a supposedly sad speech with a "joker smile" on her face.

 

Wrap Up

Even with just the 3 action units demonstrated in this project, it seems promising that blind expression mapping can be done realistically in real-time. The question is how to add more action units and how to gain better control over the mesh points. The feature tracking is a separate module that can be improved and swapped without too much severance.

 

Most Important Resources

[1] J. Y. Noh and U. Neumann. Expression cloning. In Proc. SIGGRAPH’01, pages 277-288, 2001.
 

[2] Z. Liu,Y. Shan, and Z. Zhang. Expressive expression mapping with    ratio images. In Computer Graphics, Annual Conference Series, pages 271--276. Siggraph, August 2001.

[3] Kapoor, Ashish, & Picard, Rosalind, W. Real-Time, Fully Automatic Upper Facial Feature Tracking. Proceedings of the 5th International Conference on Automatic Face and Gesture Recognition 2002, Washington D.C.

[4] Wolberg, G. (1990). Digital Image Warping. IEEE Computer Society Press, Los Alamitos, CA.

[5] Kanade, T., Cohn, J.F., & Tian, Y. (March 2000). Comprehensive Database for Facial Expression Analysis. Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition. (FG’00) Grenoble, France.