Motivation This part of the project project aims to explore how simple clustering methods can help alleviate an area of complexity in this research that is often taken for granted. Namely, the complexity and expense that arise from having to label activities from these simple sensor firings. The manual labeling process is inexact and may take as much as five hours of labeling per one hour of sensor activations for even the most basic activity labeling. This leads to the strong desire for a method to reduce the amount of data that must be labeled. This project begins to tackle this problem by using the most elementary of semi-supervised techniques, namely cluster and label. The idea is to asses how some of the more basic unsupervised clustering methods cluster the sensor activity. Based on these clusters and a small set of labeled data, we are curious to see if the clusters can be automatically labeled and used with reasonable accuracy. Basically to see if we can find structure in the data. This project is just a first approach to this problem. Features The clustering methods used in this project are well defined and studied. However, there are still some free variables in these algorithms, the selection of which requires an artistic touch. The first such design choice is what features to use with each clustering method. The second design question is how to measure similarity of features. The first question is discussed here, while discussion of the second question is put off until the discussion of the clustering methods used .
Approaches All code is in Matlab. All methods were trained with a set of test data held out during training. 10-fold cross validation was used to estimate variables. 1. K Means clustering
Method Overview: K-Means is one of the most basic clustering methods used to
solve the an unsupervised classification problem. The procedure follows a simple
and easy way to classify a given data set through k clusters fixed a
priori. The main idea is to define k centroids, one for each cluster. The
next step is to take each point belonging to a given data set and associate it
to the nearest centroid. After all points have been assigned, the first step is
completed. At this point k new centroids are calculated based on the groupings
in the last step. After k new centroids are formed, a new binding has to
be done between the same data set points and the nearest new centroid. A loop
has been generated. As a result of this loop we may notice that the k centroids
change their location step by step until convergence. The main thing we played around with here was the number of clusters and the similarity measures. The two metrics we used were MINKOWSKI distance and WEIGHTED SUM. The motivation behind weighted sum was modeling the competition between certain feature types such as x,y location and period of sensor activation. While both may be good measures, a combination of the two may be insightful, although not with equal contributions. We decided to play with the following free parameters to see how they effected our results:
The detailed results show that the best measure seems to be based on location. As would be expected, the k-means tends to be very sensitive to feature scaling issues and the weight matrix seems to increase accuracy when two features of different scales, such as time and distance, were considered together, but decreases accuracy when just using time or just using distance as a metric. This is logical since in the single type metric case, all the scaling is doing is clumping the data together, thus making distinguishing the differences more difficult. 2. Fuzzy C-Means Clustering Method Overview: A generalization to K-Means is fuzzy C-Means. Fuzzy c-means clustering is a form of overlapping clustering. In other words, a single feature may belong to more than one cluster. This method is based on optimizing the following objective function :
where m is any real number greater than 1, uij is the degree of membership of xi in the cluster j, xi is the ith of d-dimensional measured data, cj is the d-dimension center of the cluster, and ||*|| is any norm expressing the similarity between any measured data and the center. We decided to play with the following free parameters to see how they effected our results:
It was also interesting to note that the degree of fuzziness (M) was
helpful for features that were more stochastic with respect to what activities
they belong to. Thus for the duration feature we saw an increase in accuracy
from k-means. This is because a sensor activation of ten seconds can belong to
any activity more readily than say being in the kitchen. I (probably) won't be
in the kitchen and taking a bath. Thus the fuzziness hindered classification
using features that tend to be more strongly correlated with certain activities,
such as rooms. Our data doesn't seem to overlap much, such as duration of
activities. Activities seemed to be pretty sequential, thus, once again, low
values of M (referenced as b in the figures) tend to be more accurate due to
less blending and k-means ( basically a blending of 0) did better. 3. Hierarchical Clustering Method Overview: Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic process of hierarchical clustering is as follows:
This type of hierarchical clustering is called agglomerative clustering because we cluster from bottom up. Starting with N number of clusters and working our way towards one cluster. As with k-means, the way in which the distances between methods are clustered can be changed to change what clusters are formed. The specific form of clustering we chose was complete -linkage clustering. In complete-linkage clustering (also called the diameter or maximum method), we consider the distance between one cluster and another cluster to be equal to the greatest distance from any member of one cluster to any member of the other cluster. This was used in order to discourage the growth of elongated clusters. The complete-linkage method is so named since it basically produces a complete subgraph. We implemented hierarchical agglomerative clustering with the following features:
The main thing we play with here was finding what number of clusters was best. That is, what level of clustering was optimal. The results show that our success with hierarchical clustering was less than pleasing. It actually makes sense for our features however. We see that x,y did the best. This is because the distances could be thought of as hierarchical clusters such as the sensors around the fridge and sink and these can be grouped to the kitchen. Conversely, the way duration was created can not really be hierarchically clustered. A major problem with agglomerative clustering is that it is O(N^2) and it is highly susceptible to local minima since stages build on one another. 4. Mixture Modeling Method Overview: The model approach using mixture Gaussians. Basically K Gaussians and use EM to learn the parameters for each of the K Gaussian components and try to use this model to fit the data. We implemented hierarchical agglomerative clustering with the following features:
The main thing we play with here was finding what number of components was best. That is, what level of clustering was optimal. The stopping criterion used for this algorithm was the log-likelihood of the data. This means the log of the probability given the model parameters that we have found tells what likelihood these parameters have in explaining the given data. The higher the log-likelihood, the better is the fit to the data. The results show that almost all of our features did better with smaller K. This is again most likely due to the large number of sensors in one area of the home and the sparse number in the rest of the home. The accuracies tended to be slightly worse than the K-means algorithm, but does much better than fuzzy c-means. Also, mixture Gaussians has the advantage of using prior information which may come in handy in later work. Conclusions Overall the methods explored here seemed mediocre, but given the nature of the data, which seems close to how real data is as opposed to toy data problems, some result were encouraging. The distance features seem to be the best features considering accuracy of all the methods. The most promising methods seem to be the simplistic k-means clustering and the more robust mixture Gaussians. The nature of the data did not seem to mesh with the hierarchical clustering, but different features may be more successful with this method. Future directions Future work will need to focus on better feature selection and combination. Also, it would be very useful to try more robust semi-supervised learning algorithms instead of using unsupervised methods and merely doing dumb cluster and classifying. Balancing the data may bring around better results as well. |