Conclusions

Our results show that despite trying different algorithms and data reduction techniques, our results do not improve much. In some cases, the results even get worse. It s hard to identify any patterns in the way the different algorithms are influenced by the data set and vice versa. The outcome so far suggests the following problems:

The data is not linear.

This problem can be solved by using algorithms that do not require linear input data. There are a number of algorithms to choose from. Support Vector Machines (SVM) and Neural Nets are two examples. These approaches will eventually be tested and described by Aaron.

The features are not sufficient for classification.

Our results indicate that there is a lot of overlap within our data set; it is hard to find natural clusters. Our initial assuption that the content of the MySpace profiles can be used for distinguishing social individual from spam did not work as well as we exptected. For example, some companies fill in just as many personal details as individual. In many cases they fill in information that corresponds to their target group. At the same time, there are individuals who do not bother about filling in information, even though they are very social.

The distribution within the data set is too uneven.

The charts presented on this webpage show that the distribution between the three different groups is fairly even. However, the within-distribution is very uneven; mostly due to the way we picked out the profiles. Maybe, since we are already manipulating the data set, it is worth creating a more even data set and see what the outcome would be?

The data set is too small.

This problem is related to the previous one. We used a total of 800 data points in the study; 600 for training and 200 for testing. Nowadays, 800 data points is not a lot in the domain of pattern recognition. A larger data set may generate a more evenly distributed data set.

Bad self-esteem.

The overall conclusion is that it is important to know the data set that you are analyzing. In addition to testing different algorithms, it is important to look at the data properly, plot graphs, and make tables. In this project we did not trust our own intuition and domain knowledge enough to take it into account. Most likely, we would have reached better results faster if we had.