Data analysis

This section describes and illustrated the data set used in the project, as well as the features that were used to describe and classify the data. We highlight the most significant problems that we encountered during the initial phase of the project and describe how we solved and got around them.

Data distribution

Raw data relative Raw data

This graph illustrates the distribution of the 800 data points used in this project. As you can see the distribution is rather uneven, with high number of [1,5]-profiles (low social, high promotion) and [2:5, 1] (social 2, 3, 4 or 5,low promotion). The large quantity of [1,5]-profiles is a direct consequence of the way in which we picked out the 400 adds used in the project, since we since we specifically looked for profiles that we thought were straightforward.
Further, we divided the data into 600 training data points and 200 testing data points. The graphs below, illustrate the absolute distribution within each data set.

Training Training Testing Testing

The charts above show that the distribution between the different data sets are fairly even. However, as pointed out before, the within-distribution is very uneven.

Feature analysis # 1

The next step in the project involved choosing features. After having labeled some 2000 profiles, we had some ideas regarding which features might be significant in the classification process. We focused on "low-level" (content-based) features that easily could be extracted from the html code of the profiles. Below is a list of the original set of features, as well as the data type of each feature.

number of friends integer
number of YouTube movies integer
number of personal details integer
number of comments integer
number of "thanks for the add" in the comments integer
number of surveys integer
status text
children text
number of "I", "I'm", "me", "my" (first person) integer
number of "we", "our", "he", "she" (second/third person) integer
missing image boolean
mp3 player boolean
static url boolean
school information boolean
blurbs boolean
customized page boolean
network information boolean
company information boolean
blogs boolean

The features listed above are all very different, both regarding data type and distribution. How do you compare and mix data that are so different in nature? One approach to get around this problem is to normalize the data in order to get the same range for all features. We normalized the data by translating true/false into 1/0 and by dividing all integer features by the maximum value within the feature. By doing that, we made all feature values lie within the range 0:1.
The two text-based features, "status" and "children" are not as easy to normalize. As a matter of fact, we are still not sure that we did it correctly. The MySpace template allows the user to choose from, respectively, 5 and 6 options for "status" (swinger, in a relationship, single, divorced, married) and "children" (I don't want kids, Someday, Undecided, Love kids but not for me, Proud parent, No answer). Thus, we chose to normalize the values by annotating them based on their position in the template. More specifically, "I don't want kids" was translated into 1, whereas "Someday" was translated into a 2. Thereafter, the integers were normalized. The major problem with this approach is that the scale of status and children attributes are not a natural scale. Therefore, the numbers that we annotate them with are in a sense wrong, or meaningless. For this reason, we eventually decided to exclude these two features.

Extreme values were another problem that we encountered while collecting features. For example, in the "number of friends" feature set, one profile had over 90,000,000 friends. When we normalized the data, this lead to extremely small numbers. To solve this problem, we took the log of all values after having normalized them.

Feature analysis # 2

In this section Aaron will explain how we extracted new features and why we chose them.