Final Project Proposal 2006

Cheng-Zhi Huang (huangcza -at- media -dot- mit -dot- edu)



Arm Gesture Classification for Human-Robot Interaction:

Provided by Andrea Thomaz & Guy Hoffman (alockerd,guy -at- media -dot- mit -dot- edu)





We want to interact with a virtual robot displayed on a plasma screen by letting it recognize our arm gestures.  There are 5 locations indicated in the robot's environment on the screen, and we want to be able to communicate one of the 3 gestures (grasping, pointing, pushing down) at each of the 5 locations.  In order to achieve this goal, the human interactor wears a coat with markers attached to the back and right arm, and is surrounded by Vicon Motion Capture Cameras to track his/her movement.  





The Plasma Screen displays the virtual robot, and indicates the 5 locations on the screen.  The human interactor stands facing the plasma screen, and moves his/her arm towards one of the 5 locations and performs one of the 3 gestures in front of the location.  The 5 on-screen locations are located in the bottom half of the screen.  They are labeled 1 to 5 from the left of the screen to the right, and they span the full screen width. 


The Vicon Motion Capture System consists of 9 cameras that surround a rectangular area.  The interator stands within the rectangular area, and wears a coat with markers attached.  In order for the motion of a marker to be captured, it must be visible to at least three cameras.  For each gesture, the system starts capturing from the arms natural position hanging next to the body, and continues to capture through its raising and forward movement, and stops as soon as the gesture is established and stable, and does not include the withdraw motion.  The cameras captures the positions of the markers at a rate of 100Hz, but the network transfer rate is lower, causing the resultant collected frames to be at a lower rate.


11 markers are attached to the coat.  4 on the back, and the rest are all on the right "arm", including 3 on the shoulder and upper arm, 2 on the lower arm and 2 on the back of the hand.


2 sets of features are presented for each frame.  The first set is the lower-level raw data captured from the 11 markers.  The second set is the higher-level calibrated skeletal data derived from the raw data.  The skeletal data represents the positions of 4 joints, i.e. wrist, pinky, elbow, shoulder.  Each feature is presented as a position vector (x, y, z), with x as the horizontal axis, y as the height, and z as the depth.


There will be an initialization set of data where the human interactor points to the four corners of the plasma screen.



Data set:


100 examples (files) for each of the following 15 classes.

class 1-5 = grasping to location 1-5 

class 6-10 = pointing to location 1-5

class 11-15 = pushing down at location 1-5


The data set for each class will be divided into the training set, the test set and the validation set.



Input file format:


The first line of each file gives the number of sequences (frames) recorded.

For each sequence, there are 4 skeletal points, followed by 11 raw points.  Each point is presented as a comma-separated list, giving its x, y, z positions.  Each point is separated from one another by the colon.


The 4 skeletal points are presented in the order of wrist, pinky, elbow, shoulder.  However, the 11 raw points may be in any order, and one or more points may be missing in a frame.






By using the (i) skeletal data and (ii) raw data separately, and (iii) in combination,


(1) classify 15 classes of gestures (3 gestures at 5 locations).

(2) explore which parts (sub-sequences / sub-gestures) of the gestures most distinguishes between the gestures (location-wise and/or type of gesture)?

(3) determine how soon can we distinguish between the arm gestures (location-wise and/or type of gesture)?


Compare how the (1)-(3) performs by using (i)-(iii) data set.






The most distinguishing characteristics of our "arm" gestures are actually "finger" gestures.  However, in the setup, we do not have markers on the fingers, but only two markers on the back of the hand, and the rest are all on the arm and back.






Data Collection:

We assume the inter-sequence interval is constant throughout the whole data set.

We assume the human interactor's lower body does not move, causing the back markers to only slightly "rotate".

We assume the human interactor will perform the gestures "rationally", meaning the human interactor would perform the gesture through a "shortest path" from his/her arm's preparation gesture to the goal gesture.  In addition, as the he/she's arm is traveling through the path, he/she would try to maintain minimum amounts of moment between his/her hand, arm, shoulder and back in order to minimize the amount of work needed to be done.


Feature Extraction:

We assume parts of the body are rigid bodies.  For example, the distance between the two markers on the lower arm would stay more or less the same because the length of the arm does not change through time.  However, since the markers are stuck to a clothing, the clothing would slide, contract and extend according to the movement, and thus the distance between the two markers will actually vary slightly.  



We assume the Gaussian distributions when using Bayesian decision theory.






Classification of raw points:

The raw points are presented in an unknown order.  We do not know which data point corresponds to which marker.

However, we do know the joints that the skeletal points refer to, and we can try to cluster the raw points around these points, and since we also know that the relationships (distances) between some of the markers stay relatively constant (due to rigid bodies), we can identify those markers that are not the closed to the joints.  

We may also try to cluster the raw points by themselves, without the help of the skeletal data.

We will perform clustering by calculating the normalized dot product between the markers.

Furthermore, if some of the points of a raw data sequence are missing, we may estimate its positions through the movement of this particular point in preceding and following sequences, and also by the tendencies of nearby points. 



Divide and conquer:

We may divide the classification problem into two:  (a) identifying the location only, and (b) identifying the type of gesture (grasping, pointing, pushing) only.  For example for (a), movement along a particular axis maybe strongly correlated to the location that the gesture will be performed at.

We may also subdivide the sequence to identify sub-gestures, as described below.



Extract higher-level features:

Calculate the amount of movement of a marker through sequences (time).  For example, calculate the Euclidean distance between successive points to study the displacement, velocity and acceleration.

Derive moments from the data points in each frame.

moment = Fd = force X perpendicular distance from pivot.

For example, the moment of the upper arm maybe calculated with the shoulder joint as the pivot.



Classification of sub-gestures:

By using the raw data or skeletal data, and the higher-level features, and domain-specific knowledge, we may want to derive sub-gestures.

However, how many sub-gestures should we derive and what kind of sub-gestures are meaningful to our classification problem? 

Moreover, sub-gestures may not require data from all the markers.

We may want to isolate motion of various groups of markers that represent different parts of the body.  For example, we would divide the body into 5 parts, namely the hand, the lower arm, the upper arm, the shoulder and the back, and study there gestures separately.

We may also divide our overall gesture into semantic sub-gestures, such as initial pose, chiefly raising motion, chiefly forward motion, characteristic hand gesture (grasping, point, pushing down).  We may also divide gestures by points of change of "directions" in high-level features.

Furthermore, the speed of the sub-gestures may relate to its gesture (class).  



Dimension Reduction:

At some point through the above procedures, we may want to reduce dimensionality by using principal component analysis (PCA) or Fisher's Linear Discriminant (FLD).  However, we must be very careful when choosing the principal components in PCA, since the most distinguishing features that correspond to the hand gesture may have the lowest variance, and we do not want to drop that.  Hence, we may want to whiten the data before reducing the dimensions depending on the the classification goal.






There are two ways to classify the gestures.  One is to classify them blindly after minimum preprocessing, as if we do not know anything about the data.  The second is to use domain-specific knowledge to analyze the data first, which may or may not lead to more satisfying results. 


K-Nearest Neighbours (KNN)

We will also try different feature spaces of various dimensions for calculating the nearest neighbours. 

Using the first approach, for each training example, we will find a test example that has the closest number of sequences, and then we will sum the distances between each of the sequences from the training example and the test example.  The total distance is normalized by the number of sequences, and the shortest distance will reveal which class the test example belongs to.

Using the second approach, the feature space may be for normalized summations of moments of different parts of the body. 

Moreover, we will try different values for K.


Bayesian Belief Network

In our initial design of the Baye's net, there are three layers.  The root layers consists of sub-gestures, which maybe gestures of different parts of the body, or some semantic sub-gesture.  The layer below the root layer consists of the two nodes, location and hand gesture, the last layer are 15 leaves for the 15 classes. 


n-order Markov Model (nMM)

We will start with a 1st-order Markov chain.  We will build a Markov model for each of the 15 classes.  The states maybe some sub-gesture, or high-level feature.



Finally, with the results from the above methods, we will perform a voting procedure to decide the final class classifications.




Plan of action:


We will first use only the skeletal data (i) to solve problems (1) to (3).

Specifically, when we try to decide how soon an arm gesture can be determined, we will try different number of sequences (length of time), and compare their results.

As time permits, we will attempt the 3 problems by using also the raw data (ii) and then consider how to use the raw and skeletal data in combination (iii) to maximize the results. 

For the results, we will analyze the rate of hit, false alarm, miss and correct rejection rates when they apply.