Cheng-Zhi Anna Huang
Arm Gestures for Human-Robot Interaction
Data and project design provided by Andrea Thomaz and Guy Hoffman   



We use k-nearest neighbor (k-NN) to classify three gestures, grasping, pointing and pushing, performed at five locations, resulting in a total combination of 15 classes.  With dynamic time warping (DTW), we explore three questions.  First, how well can we classify the classes using all the basic features?  Second, how soon do we know what gesture is being performed?  Third, which ˇ§partsˇ¨ of the gesture contributes the most to classification?

Our data is collected by using the Vicon Motion Capture System to track the motion of one person performing the gestures at a single session.  Our data consists of ten samples for each of the 15 classes.  The samples range from 44 frames to 84 frames, with a rounded average of 60 frames.  Each frame includes four sets of x, y, z coordinates that represent the shoulder, elbow, wrist and pinky joint. 

Since the number of frames varies for each sequence, we align the sequences by using pair-wise dynamic time warping.  The distance between each pair of sequences is computed by the summation of the square root of the summation of the 12 pairs of coordinates (3 per joint) in each frame.

We use 120 samples (8 per class, 40 per gesture, 24 per location) to find the best k by performing leave-one out validation.  When using the entire lengths of sequences for training, the highest average recognition rates for the classes, gestures and locations are 86.67%, 92.50% and 93.33%, while the best ks are 1, 1 and 2 respectively.  Five samples in grasping are misclassified as pointing.  One pointing gesture is misclassified as grasping.  For pushing, one is misclassified as grasping, and one as pointing.  Note that no grasping or pointing samples are misclassified as pushing.  This reveals that by using our distance metric, grasping and pointing is more similar then any other pairs of combinations.  Next, the best ks are then used to test the remaining 30 samples (2 per class, 10 per gesture and 6 per location).  When k is set to 1, perfect recognition rates was achieved.  When k is set to 2, locations are perfectly classified, while classes and gestures both received an average recognition rate of 96.67%. 

Moreover, we wanted to find out how soon a gesture can be recognized.  We use the best k for classes from above to perform 1,2-NN on varying lengths of the gestures from their beginnings, at an increasing interval of five percent of the total number of frames for each sample.  During the training phase, 1-NN out performs random guess from the very first frame and on.  The recognition rates fluctuate until half the total length, and then monotonically increases until 95% total length.  For the classification of classes, 1,2-NN securely achieves an average recognition rate of 35.8% at one-fourth the total length, 45% at half the total length.  For gestures, an average recognition rate of 58.3% is reached at 15% total length, and 70% at total length.  For locations, 1,2-NN starts with a secure average recognition rate of 46.67% and achieves 50% at 40% total length.  The best average recognition rates for classes, gestures and locations are achieved at 95%, 95%, 100% total length, with best ks of 1, 1 and 2 respectively.  For testing, at the 50% length, the classes, gestures and locations reach an average recognition rate of 43.33%, 73.33% and 50%.  Perfect recognition rates werereached at 90% total length for all three groups.

Furthermore, since the number of samples is incomparable to the dimensionality of our feature space (3 coordinates points * 4 joint points * 60 average number of frames = 720), we attempt to reduce the dimensions and at the same time explore what features contribute the most to classification.  However, principal component analysis (PCA) failed by only performing as well as random guess during the testing phase.  Hence, we approach dimension reduction ad-hoc in three directions, by training and testing with only one joint, with only one axis, or with a sliding window approach to find a portion of the frame series that contributes the most to classification.  During training, the highest average recognition rates  using only one joint was 90.83%, 94.17% and 97.5% for classes, gestures and locations, and perfect results were achieved during testing.  When using only one axis for training,  the highest average recognition results were 86.67%, 92.50%, 95.00% for using the x, x, z axis.  When using these axis for testing, the highest average recognition rates were 96.67% for classes, and perfect results for gestures and locations.  Note that we ran three separate tests using the three best axis from above.  We also trained a 1-NN on a sliding window approach to find a sub-frame series that achieves the highest average recognition rates.  For classes, an average recognition rate of 95.00% is reached for the window that spans the 60-85% of the total length.  For gestures, 95.83% was achieved at the same window.  For locations, a perfect recognition rate was revealed at the window spanning 40-95% total length.

In the near future, we hope to explore how Fisher's Discriminant Analysis and Multi-linear Analysis may reduce the dimensions of the feature space.  We also hope to generalize our method to different coordinate settings by extracting higher-level features.


Further illustrations on the background and setup can be found in the proposal, as linked here.
Results shown in graphs can be found in the power point presentation, as linked here.