Affective Labeling

Cory D. Kidd

Spring 2004

Problem

When humans interact with one another, we infer many things from the emotional cues presented in another person’s speech. Imagine for a moment a scenario where a parent presents a couple of toys to a child that the child has not seen before. When pointing out the first one, the parent sounds excited and says “Here is a toy!” As the parent hands the second toy to the child, she sounds somewhat disgusted by it and states “Here is a toy” in a lower, sadder voice. While the adult has introduced both toys using the same sentence, there is a lot of information that is passed through other qualities of the statement. The child is likely to be more interested in the former toy if he trusts his parent’s opinions about the toys.

My project idea is to see if I can use this affective appraisal for a computer to judge items around it in a similar manner. The specific instance that I have in mind is when a person interacts with our robot Leonardo. As happens in the video below, the person could show Leo several items that are Leo’s “toys.” While telling Leo about each toy, the robot would use the emotional qualities of their voice to assign an affective quality to each object. You can then ask him which object he would like to play with and he would choose the one that you sounded most interested in when telling him about the objects.

Previous Work

This project is based on my previous work and the work of the Robotic Life Group. Our robot in the video above, Leonardo, is highly expressive robot that is intended to interact with people and learn from them. We are working to incorporate several kinds of social cues into these interactions and this work carries on from this tradition.

Implementation

The implementation of this system is based on the Ph.D. thesis of Raul Fernandez. For his work, he identified and recognized 105 features in human speech. He also evaluates which of these features might be most useful in classifying speech. Using his thesis (Fernandez) and conversations with him, I identified five features that could be used with some accuracy and that I would have some chance of making real-time to be used in our robotic system. The five features chosen are:

1. IQR of voiced F0 values. (inter-quartile range)

2. Skewnewss of voiced F0 values. (asymmetry of curve)

3. Percentage of voiced F0 values above the mean (<=1). (asymmetry of curve)

4. The range between the max voiced F0 value and the mean.

5. The range between the mean and the min voiced F0 value. (the 95- and 5-percentile vals. are used instead of the true max and min to minimize outlier effects).

The system is currently implemented in MATLAB and allows the user to record three short samples of speech (presumably labeling an object, but not necessarily so) and analyze the speech to determine which objects should be liked most and least. The interface looks like the following image:

After labeling three objects, you hit the 'Calculate' button and MATLAB will process the sound files and then label the toys. The labels are:

most liked = green

intermediate = yellow

least liked = red

The sound files used here are:

And the results are:
Evaluation

The system was trained on data from 4 people. Each was recorded and asked to speak the same sentence three times, once happy, once neutral, and once sad. This data was then run through the analysis described above to determine the relative importance of the features that were being used. Based on this, weights were chosen for each of the five features. A weighted sum of the features is calculated for each new sample and the relation of one to another is determined by the resulting numbers.

This was then tested on four more subjects. The results were that two are correct in their ordering, one confused positive and neutral, and one confused neutral and sad. There were no cases in this very limited test set where positive and negative affect were confused. After listening to these cases, I can imagine that it would be difficult for a human to get the evaluation correct without more context than simply the voice.

Future Work

The system is currently very slow to run in MATLAB. I would like to export it into C code if possible to get speed improvements. So far I have not been successful at doing so. If this can be done, this system can then be integrated into our robot's code base to allow Leonardo to assign an affective appraisal value to objects that are introduced to him by a person.

One thing that might have to change in the system is that it currently simply ranks the objects against one anotther. To implement an ongoing system, this software would have to assign a value to each object independent of other speech acts, which may be challenging given my thoughts after working with this system as it currently stands.

References

Breazeal, Cynthia. Designing Sociable Robots. Cambridge: MIT Press. 2002.

Fernandez, Raul. A Computational Model for the Automatic Recognition of Affect in Speech. Ph.D. Thesis. MIT Media Lab. 2004.

Kidd, Cory D. Improving Human-Robot Interaction with a Social Attention System. Submitted to DIS 2004. 2004. Available at http://web.media.mit.edu/papers/DIS 2004 Kidd preprint.pdf [pdf].

Picard, Rosalind W. Affective Computing. Cambridge: MIT Press. 1997.

Acknowledgements

Thanks to Professor Roz Picard for her suggestions and direction.

Many thanks also to Raul Fernandez for the base MATLAB code and advice on how to use it to support this project.