Cheng-Zhi Anna Huang
Data and project design provided by Andrea Thomaz and Guy Hoffman
ABSRACT / SUMMARY
We use k-nearest neighbor (k-NN) to classify three
gestures, grasping, pointing and pushing, performed at five locations, resulting
in a total combination of 15 classes. With dynamic time warping (DTW), we
explore three questions. First, how well can we classify the classes using all
the basic features? Second, how soon do we know what gesture is being
performed? Third, which ˇ§partsˇ¨ of the gesture contributes the most to
Our data is collected by using the Vicon Motion Capture
System to track the motion of one person performing the gestures at a single
session. Our data consists of ten samples for each of the 15 classes. The
samples range from 44 frames to 84 frames, with a rounded average of 60 frames.
Each frame includes four sets of x, y, z coordinates that represent the
shoulder, elbow, wrist and pinky joint.
Since the number of frames varies for each sequence, we
align the sequences by using pair-wise dynamic time warping. The distance
between each pair of sequences is computed by the summation of the square root
of the summation of the 12 pairs of coordinates (3 per joint) in each frame.
We use 120 samples (8 per class, 40 per gesture, 24 per
location) to find the best k by performing leave-one out validation. When using
the entire lengths of sequences for training, the highest average recognition
rates for the classes, gestures and locations are 86.67%, 92.50% and 93.33%,
while the best ks are 1, 1 and 2 respectively. Five samples in grasping are
misclassified as pointing. One pointing gesture is misclassified as grasping.
For pushing, one is misclassified as grasping, and one as pointing. Note that
no grasping or pointing samples are misclassified as pushing. This reveals that
by using our distance metric, grasping and pointing is more similar then any
other pairs of combinations. Next, the best ks are then used to test the
remaining 30 samples (2 per class, 10 per gesture and 6 per location). When k
is set to 1, perfect recognition rates was achieved. When k is set to 2,
locations are perfectly classified, while classes and gestures both received an
average recognition rate of 96.67%.
Moreover, we wanted to find out how soon a gesture can be
recognized. We use the best k for classes from above to perform 1,2-NN on
varying lengths of the gestures from their beginnings, at an increasing interval
of five percent of the total number of frames for each sample. During the
training phase, 1-NN out performs random guess from the very first frame and
on. The recognition rates fluctuate until half the total length, and then
monotonically increases until 95% total length. For the classification of
classes, 1,2-NN securely achieves an average recognition rate of 35.8% at
one-fourth the total length, 45% at half the total length. For gestures, an
average recognition rate of 58.3% is reached at 15% total length, and 70% at
total length. For locations, 1,2-NN starts with a secure average recognition
rate of 46.67% and achieves 50% at 40% total length. The best average
recognition rates for classes, gestures and locations are achieved at 95%, 95%,
100% total length, with best ks of 1, 1 and 2 respectively. For testing, at the
50% length, the classes, gestures and locations reach an average recognition
rate of 43.33%, 73.33% and 50%. Perfect recognition rates werereached at
90% total length for all three groups.
Furthermore, since the number of samples is incomparable
to the dimensionality of our feature space (3 coordinates points * 4 joint
points * 60 average number of frames = 720), we attempt to reduce the dimensions
and at the same time explore what features contribute the most to
classification. However, principal component analysis (PCA) failed by only
performing as well as random guess during the testing phase. Hence, we
approach dimension reduction ad-hoc in three directions, by training and testing
with only one joint, with only one axis, or with a sliding window approach to
find a portion of the frame series that contributes the most to classification.
During training, the highest average recognition rates using only one
joint was 90.83%, 94.17% and 97.5% for classes, gestures and locations, and
perfect results were achieved during testing. When using only one axis for
training, the highest average recognition results were 86.67%, 92.50%,
95.00% for using the x, x, z axis. When using these axis for testing, the
highest average recognition rates were 96.67% for classes, and perfect results
for gestures and locations. Note that we ran three separate tests using
the three best axis from above. We also trained a 1-NN on a sliding window
approach to find a sub-frame series that achieves the highest average
recognition rates. For classes, an average recognition rate of 95.00% is
reached for the window that spans the 60-85% of the total length. For
gestures, 95.83% was achieved at the same window. For locations, a perfect
recognition rate was revealed at the window spanning 40-95% total length.
In the near future, we hope to explore how Fisher's
Discriminant Analysis and Multi-linear Analysis may reduce the dimensions of the
feature space. We also hope to generalize our method to different
coordinate settings by extracting higher-level features.
Further illustrations on the background and setup can be found in the
proposal, as linked here.
Results shown in graphs can be found in the power point presentation, as