MAS622J/1.126J: PATTERN RECOGNITION AND ANALYSIS
Final Project Proposal 2006
Cheng-Zhi Huang (huangcza -at- media -dot- mit
-dot- edu)
Arm Gesture Classification for Human-Robot
Interaction:
Provided by Andrea Thomaz & Guy Hoffman
(alockerd,guy -at- media -dot- mit -dot- edu)
Scenario:
We want to interact with a virtual robot displayed
on a plasma screen by letting it recognize our arm gestures. There are 5 locations indicated in the
robot's environment on the screen, and we want to be able to communicate one of
the 3 gestures (grasping, pointing, pushing down) at
each of the 5 locations. In order
to achieve this goal, the human interactor wears a coat with markers attached
to the back and right arm, and is surrounded by Vicon Motion Capture Cameras to
track his/her movement.
Setup:
The Plasma Screen displays the virtual robot, and
indicates the 5 locations on the screen.
The human interactor stands facing the plasma screen,
and moves his/her arm towards one of the 5 locations and performs one of the 3
gestures in front of the location.
The 5 on-screen locations are located in the bottom half of the screen. They are labeled 1 to 5 from the left of
the screen to the right, and they span the full screen width.
The Vicon Motion Capture System consists of 9
cameras that surround a rectangular area.
The interator stands within the rectangular area, and wears a coat with
markers attached. In order for the
motion of a marker to be captured, it must be visible to at least three
cameras. For each gesture, the
system starts capturing from the arms natural position hanging next to the
body, and continues to capture through its raising and forward movement, and
stops as soon as the gesture is established and stable, and does not include
the withdraw motion. The cameras
captures the positions of the markers at a rate of 100Hz, but the network
transfer rate is lower, causing the resultant collected frames to be at a lower
rate.
11 markers are attached to the coat. 4 on the back,
and the rest are all on the right "arm", including 3 on the shoulder
and upper arm, 2 on the lower arm and 2 on the back of the hand.
2 sets of features are presented for each
frame. The first set is the lower-level
raw data captured from the 11 markers.
The second set is the higher-level calibrated skeletal data derived from
the raw data. The skeletal data
represents the positions of 4 joints, i.e. wrist, pinky, elbow, shoulder. Each feature is presented as a position
vector (x, y, z), with x as the horizontal axis, y as the height, and z as the
depth.
There will be an initialization set of data where
the human interactor points to the four corners of the
plasma screen.
Data set:
100 examples (files) for each of the
following 15 classes.
class 1-5 = grasping to location 1-5
class 6-10 = pointing to location 1-5
class 11-15 = pushing down at location 1-5
The data set for each class will be divided into
the training set, the test set and the validation set.
Input file format:
The first line of each file
gives the number of sequences (frames) recorded.
For each sequence, there are 4 skeletal points,
followed by 11 raw points. Each
point is presented as a comma-separated list, giving its x, y, z
positions. Each point is separated
from one another by the colon.
The 4 skeletal points are presented in the order
of wrist, pinky, elbow, shoulder.
However, the 11 raw points may be in any order, and one or more points
may be missing in a frame.
Goal:
By using the (i) skeletal data and (ii) raw data
separately, and (iii) in combination,
(1) classify 15 classes
of gestures (3 gestures at 5 locations).
(2) explore which parts
(sub-sequences / sub-gestures) of the gestures most distinguishes between the
gestures (location-wise and/or type of gesture)?
(3) determine how soon
can we distinguish between the arm gestures (location-wise and/or type of
gesture)?
Compare how the (1)-(3) performs by using
(i)-(iii) data set.
Drawbacks:
The most distinguishing characteristics of our
"arm" gestures are actually "finger" gestures. However, in the setup, we do not have
markers on the fingers, but only two markers on the back of the hand, and the
rest are all on the arm and back.
Assumptions:
Data Collection:
We assume the inter-sequence interval is constant
throughout the whole data set.
We assume the human interactor's lower body does
not move, causing the back markers to only slightly "rotate".
We assume the human interactor will perform the
gestures "rationally", meaning the human interactor would perform the
gesture through a "shortest path" from his/her arm's preparation
gesture to the goal gesture. In addition,
as the he/she's arm is traveling through the path, he/she would try to maintain
minimum amounts of moment between his/her hand, arm, shoulder and back in order
to minimize the amount of work needed to be done.
Feature Extraction:
We assume parts of the body are rigid bodies. For example, the distance between the
two markers on the lower arm would stay more or less the same because the
length of the arm does not change through time. However, since the markers are stuck to a clothing, the clothing would slide, contract and extend
according to the movement, and thus the distance between the two markers will
actually vary slightly.
Classification:
We assume the Gaussian distributions when using
Bayesian decision theory.
Preprocessing:
Classification of raw points:
The raw points are presented in an unknown
order. We do not know which data
point corresponds to which marker.
However, we do know the joints that the skeletal
points refer to, and we can try to cluster the raw points around these points,
and since we also know that the relationships (distances) between some of the
markers stay relatively constant (due to rigid bodies), we can identify those
markers that are not the closed to the joints.
We may also try to cluster the raw points by
themselves, without the help of the skeletal data.
We will perform clustering by calculating the
normalized dot product between the markers.
Furthermore, if some of the points of a raw data
sequence are missing, we may estimate its positions through the movement of
this particular point in preceding and following sequences, and also by the
tendencies of nearby points.
Divide and conquer:
We may divide the classification problem into
two: (a) identifying the location
only, and (b) identifying the type of gesture (grasping, pointing, pushing)
only. For example
for (a), movement along a particular axis maybe strongly correlated to the
location that the gesture will be performed at.
We may also subdivide the sequence to identify
sub-gestures, as described below.
Extract higher-level features:
Calculate the amount of movement of a marker
through sequences (time). For
example, calculate the Euclidean distance between successive points to study
the displacement, velocity and acceleration.
Derive moments from the data points in each frame.
moment = Fd = force X perpendicular distance from pivot.
For example, the moment of the upper arm maybe
calculated with the shoulder joint as the pivot.
Classification of sub-gestures:
By using the raw data or skeletal data, and the
higher-level features, and domain-specific knowledge, we may want to derive
sub-gestures.
However, how many sub-gestures should we derive
and what kind of sub-gestures are meaningful to our
classification problem?
Moreover, sub-gestures may not require data from
all the markers.
We may want to isolate motion of various groups of
markers that represent different parts of the body. For example, we would divide the body
into 5 parts, namely the hand, the lower arm, the upper arm, the shoulder and
the back, and study there gestures separately.
We may also divide our overall gesture into
semantic sub-gestures, such as initial pose, chiefly raising motion, chiefly
forward motion, characteristic hand gesture (grasping, point, pushing
down). We may also divide gestures
by points of change of "directions" in high-level features.
Furthermore, the speed of the sub-gestures may
relate to its gesture (class).
Dimension Reduction:
At some point through the above procedures, we may
want to reduce dimensionality by using principal component analysis (PCA) or
Fisher's Linear Discriminant (FLD).
However, we must be very careful when choosing the principal components
in PCA, since the most distinguishing features that correspond to the hand
gesture may have the lowest variance, and we do not want to drop that. Hence, we may want to whiten the data
before reducing the dimensions depending on the the classification goal.
Classification:
There are two ways to classify the gestures. One is to classify them blindly after
minimum preprocessing, as if we do not know anything about the data. The second is to use domain-specific
knowledge to analyze the data first, which may or may not lead to more
satisfying results.
K-Nearest Neighbours (KNN)
We will also try different feature spaces of
various dimensions for calculating the nearest neighbours.
Using the first approach, for each training
example, we will find a test example that has the closest number of sequences,
and then we will sum the distances between each of the sequences from the
training example and the test example.
The total distance is normalized by the number of sequences, and the
shortest distance will reveal which class the test example belongs to.
Using the second approach, the feature space may
be for normalized summations of moments of different parts of the body.
Moreover, we will try different values for K.
Bayesian Belief Network
In our initial design of the Baye's net, there are
three layers. The root layers consists of sub-gestures, which maybe gestures of
different parts of the body, or some semantic sub-gesture. The layer below the root layer consists
of the two nodes, location and hand gesture, the last layer are 15 leaves for
the 15 classes.
n-order Markov Model (nMM)
We will start with a 1st-order Markov chain. We will build a Markov model for each of
the 15 classes. The
states maybe some sub-gesture, or high-level feature.
Voting
Finally, with the results from the above methods,
we will perform a voting procedure to decide the final class classifications.
Plan of action:
We will first use only the skeletal data (i) to
solve problems (1) to (3).
Specifically, when we try to decide how soon an
arm gesture can be determined, we will try different number of sequences
(length of time), and compare their results.
As time permits, we will attempt the 3 problems by
using also the raw data (ii) and then consider how to use the raw and skeletal
data in combination (iii) to maximize the results.
For the results, we will analyze the rate of hit,
false alarm, miss and correct rejection rates when they apply.