Classifying Individuality from Movement Patterns in Video

Sophia Yuditskaya

MAS.622J Final Project

Fall 2008



Introduction

The Human Speechome Project has collected nearly 100,000 hours of raw video recordings over the past three years, for the purpose of enabling longitudinal studies of behavior and child development in a situated context [1]. To gather meaningful information from this vast sea of unscripted, complex real-world data, it is essential to find effective methods for automating computer vision tasks such as object recognition, tracking movement patterns through space over time, and identifying which movement patterns correspond to which person in the video. The ability to computationally associate a specific identity (or anonymous distinct "individuality") with observed patterns in video is crucial for enabling automated analysis of interaction between people. Such a mechanism is relevant in addressing such problems as distinguishing child from adult data in longitudinal studies of child language development or autism behavior, and classifying foot-traffic patterns of customers and employees in a bank.

Inextricable from the study of movement in space is the subdivision of the space into meaningful regions. The specific activity a person is doing is usually highly correlated with the location of the person. If we want to apply classification of individuality to the study of interaction, we need to focus in on certain meaningful regions that segment movement into self-contained action and activity primitives [5]. Among other things, it is useful to know which classification method works better or worse within each region. It is also useful to discover which region subdivisions may be more effective for HMMs that are derived from region sequences.

In this analysis, I attempted to answer find the answer to two questions:

1. Which method is best for classifying individuality from their location movement patterns?

To answer this question, I experimented with HMMs, SVMs, Fisher Linear Discriminant, Generalized Linear Discriminant, and k-Nearest Neighbor.

2. How do different methods perform within a region, and which choice of regions provides the best overall discriminability?

To answer this question, I worked with 6 different region subdivision strategies: a 4x4 grid, a meaningful grouping of the grid into 9 regions, and k-Means Clustering for k = 3, 5, 7, and 9.

Dataset

The data used in this analysis takes the form of a set of (x,y) points and timestamps representing the walking movements across the floor of a person within a video segment. I collected this data from the Human Speechome project’s raw video data, using a computer vision tracking tool that a fellow graduate student in my research group has developed. The data I collected for this project consists of one day’s worth of tracking data for both Rupal and Deb, within a single room -- the kitchen.

To collect data, I ran the tracker tool on a section of video data; clicking on the desired person began the tracking process for that person. The tracker required supervision, as it is only semi-automated: sometimes the tracking element would jump over to another person who passes by the person being tracked, and other times it would mistakenly decide to track a pot on the stove or the person’s reflection in the microwave door and get stuck there. The tracker also had trouble keeping up with sudden increases in walking speed. Additionally, this tracker could not see the child well at all, which left me without the (semi-)automation required to collect tracking data for the child. For this reason, I was left with only two out of three classes of data to analyze: Rupal and Deb.

The complete dataset is shown in Figure 1.
Figure 1. The complete
dataset.
Figure 1. The complete dataset.


Pre-processing of the Data

Each minute’s worth of collected XY data was extracted separately and stored as a distinct sequence. Sequences varied in length depending on whether the tracker tool had crashed at some point during that minute, but generally consisted of ~300 samples. The first 5 samples in each minute sequence were discarded, because whenever the tracker was restarted anew (after a crash or when resuming data collection after a hiatus), it took a few clicks to get the tracker to latch onto the correct person.

There were 154 such sequences for Rupal and 146 sequences for Deb. The number of individual samples for each person was roughly equivalent: ~32,000 to 33,000 data points.

First, the ordering of the sequences was randomized to prevent bias from entering the training and testing process. Next, a region number was computed for each XY data point. Six different region subdivision strategies were used for this purpose: a 4x4 grid yielding 16 rectangular regions (Figure 2), a meaningful grouping of these original 16 regions into 9 new regions (Figure 3), and k-Means Clustering for k = 3, 5, 7, and 9 (Figure 4).

The 9-region meaningful grouping roughly subdivided the kitchen as follows:

Region 1: Baby Chair
Region 2: Coffee station
Region 3: Sink and Dishwasher
Region 4: Door to dining room
Region 5: Table
Region 6: Stove
Region 7: Cabinets at lower left corner
Region 8: Refrigerator and Pantry Cabinet
Region 9: Door to hallway

K-Means Clustering was used in order to derive meaningful regions based on natural clusters in the data. The kmeans() Matlab function was used to perform the clustering.

Figure 2. 4x4 grid region subdivision.
Figure 3. Meaningful groupings of 4x4 grid into 9 regions.
Figure 4. K-means clustering for
k = 5.
Figure 2. 4x4 grid region subdivision.
Figure 3. Meaningful groupings of 4x4 grid into 9 regions.
Figure 4. K-means clustering for k = 5.

Methods

The pattern classification techniques used in this analysis are Hidden Markov Models (HMM) and Markov Processes, Support Vector Machines (SVM), the Fisher Linear Discriminant, the Generalized Linear Discriminant (GLD), and k-Nearest Neighbor (kNN) Classification. The methodologies with which each of these techniques were applied in this analysis are described below.

Hidden Markov Models

Only the computed region sequences were used for the HMM classification analysis. Using the Baum-Welch training algorithm [2], a distinct HMM was trained for each person with the person’s first 120 sequences. That left 34 sequences for Rupal and 26 sequences for Deb, for testing. Each of the test sequences were applied to the trained HMMs of both people; the Forward algorithm [2] was used to compute the log probability of each individual’s HMM having produced that test sequence. The HMM that produced the largest log probability value determined the identity labeling for that test sequence; in other words, a correct classification occurred when the largest log probability for an individual’s test sequence was produced by that individual’s HMM. As an example, if Rupal’s test sequence produced a higher log probability on Rupal’s HMM than on Deb’s HMM, then that was marked as a correct classification. The number of correct classifications was tallied over all 60 test sequences, and probability of error was calculated as (# misclassifications) / (# test sequences).

Training HMMs using Baum-Welch requires defining an initial HMM. This initial HMM can simply be a set of random values, but because Baum-Welcome only finds a local optimum [3], it is preferable for the initial HMM to be meaningfully derived, if possible. As discovered in this analysis, deriving a good initial HMM for a given dataset can be tricky. The time frame for this project allowed for three initialization options to be considered:
1. Randomized initial HMM with 3 states
2. Randomized initial HMM with # states = # regions
3. Markov Process with transitional and prior probabilities derived from training data.

The latter requires further explanation. A Markov Process is different from an HMM in that the former does not have a hidden layer of states. Because all states are directly observable, this means that each observation value corresponds to a single state; in other words, each state has only one possible output. In the initialization HMM provided as input to Baum-Welch, the observation probabilities for a Markov Process model form the identity matrix.

My primary reason for choosing a Markov Process as one of the initialization options is because this model made it possible to derive meaningful transitional and prior probabilities directly from the training data. The number of transitions from each region to every other region was tallied and then divided by the total number of transitions to obtain the matrix of transitional probabilities. Prior probabilities were calculated from the region affinity numbers: the number of samples for the person within each region was divided by the total number of the person’s samples.

To implement the HMM analysis, I started out by using Jahmm’s library of HMM functions, due to its promise of scaling to avoid underflows. However, because Jahmm proved intractable due to various bugs (one of which I was able to fix in the Jahmm source code), and because of the utility of Matlab for data analysis (as opposed to Java), Kevin Murphy’s HMM Toolbox for Matlab was the package that I ultimately used for the HMM analysis.

Static Discriminants

In addition to the time-dynamic analysis given by HMMs, a separate static (time-independent) classification analysis was done on the XY data, using four different discriminants: Support Vector Machines, Fisher Linear Discriminant, Generalized Linear Discriminant, and k-Nearest Neighbor. These methods were applied to the overall XY data, as well as per-region for each of the 6 different region subdivisions.

Support Vector Machines

In Figure 1 we can see that the XY data is not linearly separable in two dimensions. SVMs were chosen as one of the methods for this classification, because they are known to be linear classifiers for non-linearly separable data. In addition, SVMs avoid overfitting and generally do well in real-world applications.
The SVM functions provided by the Matlab Bioinformatics Toolbox were used for the SVM analysis. First, due to the memory limitations of these SVM functions, 5000 data points were selected randomly from each person’s dataset as the input data. Next, the data was labeled, combined, and input to the crossvalind() function, which randomly divided the data into training and testing sets. The training data was passed to svmtrain(), which returned as output the trained SVM parameters. Finally, the test data was classified using svmclassify(), taking the trained parameters as input. The probability of error was computed using the classperf() function. The final probability of error calculation was the average of 10 trials, each trial selecting a different random sampling of the overall data as input.

All available kernel functions were tested on the overall XY data: linear, quadratic, polynomial, gaussian radial basis, and multilayer perceptron. The kernel that performed best, the radial basis function, was used in the per-region analysis.

Fisher

The Fisher Linear Discriminant finds the linear projection of the data that provides the best discriminability between classes, and is therefore well suited for data that is Fisher assumes that all distributions are gaussian. Thus, the mean and covariance of each person’s complete data set were used to calculate the Fisher projection w and a Gaussian PDF for each person. Each sample was then projected onto w, and plugged into both Gaussian PDFs. The PDF that produced the largest probability value determined the classification of that sample point. Probability of error was computed as (# misclassifications) / (# of sample points Static Discriminants

Generalized Linear Discriminant

GLD is another method that is useful in classifying data that is not easily separable in its native dimensions (2D in this case), because it computes a decision region in a higher dimensionality [3]. That optimal higher dimensionality is a parameter k that was trained using leave-one-out validation. For my analysis, I used the GLD training and testing functions that I implemented for Problem Set 5b. Training was done on 1/4 of each person’s dataset. For the overall XY data, this came out to a total of 16,250 training samples combined.

k-Nearest Neighbor

One of the simplest and oldest machine learning algorithms, k-Nearest Neighbor often outperforms more complex methods in many real-world applications. I used an optimized implementation of kNN written by Yi Cao of Cranfield University [4]. Similarly to GLD, I used 1/4th of each person’s dataset to train the parameter k -- the optimal number of nearest neighbors -- using leave-one-out validation. As an extra verification step, I repeated the training and testing process four times, each time using a different, mutually exclusive, but equal-sized portion (i.e. 1/4th) of the dataset as the training data, and the remaining 3/4 of the data as the testing data. The final probability of error was the average of the probabilities of error given by these four trials.

Results

HMM Analysis Results

Figure 5 shows the probabilities of error for HMM classification across the 6 different region subdivision strategies. Overall, the HMM analysis produced an error rate well above 50% across the board, with all region subdivision strategies but the 4x4 grid producing higher than 80% error. The 3 state HMM had 97% error for the 9 Meaningful groupings and 98% error for 9 k-Means Clusters. The Markov Processes for 7 and 9 k-Means clusters both had 98.5% error.

Figure 5. HMM Classification Probability of Error.
Figure 5 (see Addendum #2 for corrected Markov Model results)

Corrected Results (from Addendum #2)

Figure 19 shows the final Markov Model results after the bugfix, for one day’s worth of data. Randomized initial HMMs give roughly 50-60% probability of error, a classification accuracy that is equivalent to random chance. This is truly poor performance for HMM Classification with randomized initial HMMs. The Markov Process performs better across the board, even producing as low as 20% probability of error in the 4x4 grid, and giving 30-45% error in the rest of the subdivision strategies. Consistent with this trend, the grid surprisingly had the best overall performance in the Markov Model results.
Figure 19. HMM Classification Probability of Error.
Figure 19

Discussion

As an example, the initialization Markov Process models derived for 3 k-Means Clusters are shown in Figure 6. Looking closely at Figure 6, it should be no surprise that HMMs perform poorly on the current dataset. Although Figure 6 describes just the Markov Process instance, I think the issues it presents are symptoms of general problems that also reflect upon the other variations that were tested.

First of all, the Markov Processes for both Rupal and Deb are very similar to each other, with the biggest discrepancy being a difference of just 0.001; specifically, the transitional probability from State 1 to State 2. Given very similar initial HMMs, which were derived from the training data in the first place, it is to be expected that the trained HMMs would remain very similar to each other. And if we have two very similar HMMs, classification between them cannot be expected to be reliable or informative at all.

Figure 6. rupal
Figure 6. deb

Figure 6. Markov Process Models for Rupal and Deb, 3 k-Means Clusters


The second problematic feature of the Markov Processes in Figure 6 is that between-state transitional probabilities are very low. Of course, in the case of 3 k-means Clusters, the clusters are large, which decreases the probability of transitions. However, if we look at the transitional probabilities for the other region strategies here, the self-state transitional probabilities never go below 0.97. The consistently low between-state probabilities and high self-state probabilities seem to be clear symptoms of there being not enough data.

A third problem is that priors calculated from training data (first 120 sequences) are not consistent with the region affinities for the overall dataset (Figure 7). In other words, the patterns in the training data do not accurately reflect the patterns in the overall dataset. This is yet another sign that classification with HMMs requires more data.
Figure 7. Cluster Affinities, 3 k-Means Clusters
Figure 7. Cluster Affinities, 3 k-Means Clusters

Future directions with HMM Analysis

To resolve these issues and ultimately train HMMs that can classify individuality from tracking data, I would like to try several things. Even though I did randomize the sequences, I'd like to try randomly selecting 120 training sequences for training and see how that changes the results. I would also like to try deriving Markov Process probabilities from data that is neither training nor testing data. Further, it might help to study the actual log likelihood values that are being compared for classification to see how different they are, i.e. how close the score is to each other between the two HMMs for each test sequence. It is also important to think about ways to design a better initial HMM to capture the differences in individuality in this context. And finally, I plan to collect more data -- at least a week, but ideally more like a month’s worth of data.

Static Analysis Results

Overall XY Data Results

Figures 8 and 9 show the evidence curves of kNN and GLD for the overall XY data. The k values chosen based on these graphs were k=7 for kNN and k=3 for GLD. The results of the static classification methods on the overall XY data are shown in Figure 10. The clear winner is kNN, with just 2% probability of error. SVM with a Radial Basis kernel did next best, with 8% probability of error. The worst performance was also by an SVM, using a linear kernel, with 17% probability of error. GLD and Fisher came in at 13% and 15%, respectively.
Figure 8. knn
Figure 9. gld
Figure 8. kNN Evidence Curve for Overall XY Data.
Figure 9. GLD Evidence Curve for Overall XY Data.

Figure 10. Overall XY results

Figure 10. Overall XY Results for Static Classification

Within-Region Results

Figures 11 through 16 show the static classification results within each region for each of the region subdivision strategies. Again, kNN is consistently the clear winner in all regions, for all region subdivision strategies, consistently scoring below 5% probability of error. Also, all of the methods achieve less than 45% error in Meaningful Groupings across all 9 regions, while 5 regions in the 4x4 Grid had greater than 45% error. All K-means clustering instances end up with at least one region scoring greater than 45% error in at least one method. In addition, if we ignore kNN for a moment, different classification methods do far better in some regions than in other regions, and vice versa.

As expected, due to its linear nature, Fisher tends to consistently perform worst overall, by a huge margin in several cases, at times coming close to 100% error -- in region 6 of the 4x4 Grid and Region 5 of the 9 k-Means Clustering. GLD and SVM perform worse than Fisher in a few instances. The highest error rate for GLD is ~75% in region 13 of the 4x4 Grid. The highest error rate for SVM is ~50% for region 3 of the 9 k-Means Clustering.

Figure 11. 4x4 grid
Figure 12. Meaningful Grouping
Figure 11. Within-Region Classification, 4x4 Grid
Figure 12. Within-Region Classification, Meaningful Grouping
Figure 13. Within-Region Classification, 3 k-means clustering
Figure 14. 5 k-means Clustering
Figure 13. Within-Region Classification, 3 k-Means Clustering
Figure 14. Within-Region Classification, 5 k-Means Clustering
Figure 15. Within-Region Classification, 7 k-means clustering
Figure 16. 9 k-means Clustering
Figure 15. Within-Region Classification, 7 k-Means Clustering
Figure 16. Within-Region Classification, 9 k-Means Clustering

Discussion

The first question for discussion is: Why does kNN do so very well, so consistently, on this data? One possible reason for its success is that it may capture the height of a person. Due to the nature of the recording and its two-dimensional overhead representation of activity, if two people of different heights follow an identical physical trajectory, their tracks would present as the same shape, but at a slightly different scale. It also seems that K-Nearest Neighbor would naturally be good at capturing the small-scale habitual quirks of movement that are unique to every person.

However, at this stage, kNN’s success in this dataset should be taken with a grain of salt. K-Nearest Neighbor tends to overfit, so as the amount of data increases, its classification accuracy may decrease and fall apart. Testing with more data is needed to verify kNN's good performance more conclusively.

Secondly, why do some regions do better or worse than others, and why? To gain some insight into this issue, let’s consider an interesting example by comparing the within-region data for Regions 2 and 3 of the 3 k-Means Clustering. Notice in Figure 13 that Region 2 generally does much better than Region 3. The kNN, SVM, and GLD error rates are all less than 10% in Region 2, and Fisher comes in at roughly 20%. In contrast, within Region 3, the probability of error for SVM is roughly 45%, just over 70% error for Fisher, and 65% for GLD. In Figure 17, we can see that the data for the two classes have a more complex spatial relationship with each other in Region 3 than in Region 2. In fact, Deb’s data has two visible clusters on opposite ends of the space as well as samples surrounding Rupal’s data in an overlapping semicircle. This configuration would tend to challenge Fisher greatly, as well as GLD and SVM. Because of the curved progression of Deb’s data around Rupal’s data, both GLD and SVM would have a hard time finding a higher dimensionality in which an adequate decision boundary can be found. In contrast, Region 2 shows a relatively simple relationship between Deb’s data and Rupal’s data. To note, There is a fair amount of overlap in Region 2, which accounts for the nontrivial 25% error rate for Fisher.

Region 2
Region 3
Figure 17. Comparison of Within Region data for Regions 2 and 3 of 3 k-Means Clusters.


Which choice of regions is the best? Figures 11 and 12 clearly demonstrate that the meaningful groupings are better than a plain 4x4 grid for accuracy in static classification. The 9 meaningful groupings of the 4x4 grid regions provided the most consistent accuracy across all regions for all methods, with all error rates below 45%.

Answering this question may not be the ideal end-goal, however. Throughout all the different subdivision strategies, there were certain regions that did extremely well for certain methods, with error rates less than 5%. It may be preferable instead to learn from this within-region analysis which method is best for which particular region.


unequalized dataset sizes
equalized dataset sizes
Figure 18. Unequalized (left) vs. Equalized data set sizes, 4x4 Grid


One issue that came up during within-region analysis is that some regions have far more data for one person than the other person, by several orders of magnitude. To check whether this adversely affected the validity of within-region results, I tested the per-region static classifications with equalized datasets, where a random sampling of data from the larger dataset was taken to match the size of the smaller dataset. Figure 18 shows the comparison for the 4x4 grid. The results are similar, with kNN still performing far better than the other methods, and Fisher performing the worst. In some ways, the equalized dataset analysis performs worse than with unequalized datasets. It can be observed, therefore, that equalizing the dataset sizes, at the very least, does not improve classification. This makes sense, because the difference in region affinities between two people can be a type of meaningful information that can help with classification.

Final Reflections

In conclusion, this analysis found that k-Nearest Neighbor performs very well in classifying individuality in a one-day long corpus of tracking data. While HMMs performed rather poorly on this same corpus, we have reason to believe that repeating this analysis with 7 to 30 times more data will improve HMM classification performance. Also, developing more meaningful region subdivision strategies will be of benefit to HMM performance, and would improve the consistency and degree of accuracy in within-region static classification.

Future directions for this research involve both short term and long term goals. In the short term, I would like to collect more data, henceforth using a new tracker written by Philip DeCamp, a Ph.D. student in my group. Philip’s tracker has been shown to successfully track the child, so this will enable me to evaluate classification of three individuals, as well as adult vs. child. Given more data, I will also be able to verify the performance of k-Nearest Neighbor and more conclusively evaluate the performance of HMMs in this classification.

In the long term, I would like to apply these results in classifying interaction-specific activities with the individual identity as one of the features. To note, it will be of benefit to this future study of interaction to design a region subdivision strategy in which regions can isolate meaningful segments of activity. Towards developing such a meaningful segmentation of space, I plan to also test with irregularly shaped (non-rectangular) meaningful regions.

Addendum (after discussing with my advisor Deb Roy)

Cursory interpretation of the HMM results may suggest that HMMs perform poorly on this data, due to the high probabilities of error. However, since the probabilities are significantly higher than 50% -- near 100%, in fact -- it follows that we can reliably use the trained HMMs for reverse classification. If we rename Rupal’s HMM to be Deb’s new HMM, and Deb’s old HMM to be Rupal’s new HMM, then these high probabilities of error would become high probabilities of correct classification. The question remains, however, why does an HMM trained with Rupal’s data produce a higher probability for Deb’s data than for Rupal’s data, and vice versa? Such a flip can be suspicious, in spite of the promising numbers that we obtain here if we assume reverse classification. If Rupal’s HMM sees itself as more likely to produce Deb’s sequences than Rupal’s, and vice versa, then this suggests an inaccurate generalization of the training data. Further investigation into this reversal is necessary to verify its validity and whether it can reliably generalize with a larger dataset.

Addendum #2 (after fixing a bug in my code)

Figure 19 shows the final Markov Model results after the bugfix, for one day’s worth of data. Randomized initial HMMs give roughly 50-60% probability of error, a classification accuracy that is equivalent to random chance. This is truly poor performance for HMM Classification with randomized initial HMMs. The Markov Process performs better across the board, even producing as low as 20% probability of error in the 4x4 grid, and giving 30-45% error in the rest of the subdivision strategies. Consistent with this trend, the grid surprisingly had the best overall performance in the Markov Model results. The discussion and conclusions made in the original writeup therefore continue to hold, although a Markov Process trained on 4x4 grid region data now demonstrates much promise towards classifying individuality using HMMs.
Figure 19. HMM Classification Probability of Error.
Figure 19

Acknowledgements

I would like to thank Leo Tsourides for lending me his tracker to use for this project, helping me set it up on my machine, and answering the many questions I had about how it works. Thanks to my advisor Professor Deb Roy and the entire Cognitive Machines group for the Human Speechome data. Finally, I would like to thank Professor Picard and the TA's for this class, Hyungil Ahn and Dawei Shen, for their guidance and insights throughout the evolution of this project.

References

[1] Roy, D., Patel, R., DeCamp, P., Kubat, R., Fleischman, M., Roy, B., Mavridis, N., Tellex, S., Salata, A., Guinness, J., Levit, M., Gorniak, P. The Human Speechome Project. 28th Annual Meeting of the Cognitive Science Society, Vancouver, Canada. July 2006.

[2] Rabiner, L. R. A Tutorial On Hidden Markov Models and Selected Applications in Speech Recognition, Proceedings of the IEEE, Vol 77 No 2, Feb. 1989.

[3] Duda, R. O., Hart, P. E., and Stork, D. G. Pattern Classification. New York, NY: John Wiley & Sons, 2001.

[4] Cao, Y. Efficient K-Nearest Neighbor Search using JIT. March 27, 2008. http://www.mathworks.com/matlabcentral/fileexchange/19345

[5] Baldwin, D. A. & Baird, J. A. (2001). Discerning intentions in dynamic human action. Trends in Cognitive Sciences, 5(4), 171-178.