Introduction

Background

This project was conducted as part of the MIT class MAS 622J: Pattern recognition and analysis. The project focuses on a relatively new phenomena: MySpace - an online community that "lets you meet your friends' friends". Even though MySpace is first and foremost a community that aims at supporting social activities and interactions, people are using it in many different ways and with different intentions. More specifically, there are individuals and companies that are utilizing the popularity of MySpace primarily to promote products and services. MySpace has become a subject to spam.

The scope of the project was to examine whether it is possible to distinguish socially oriented profiles from promotional and commercial profiles. Would it be possible to create a spam filter, similar to the filters that are being used for filtering out unwanted email? One thing that makes the case of MySpace somewhat more complex is the fact that there is no natural classifcation. There are no obvious "unwanted" profiles, since "unwanted" has a different meaning to different people. Take DJ's for example. It is plausible that a DJ uses his MySpace profile to promote his services at the same time as he is interacting with friends and fans on a social level. Would he or she still be classified as spam? Below, are five examples of MySpace profiles that illustrate the wide range of profiles, as well as the challenge in classifying them.

Click the images to view the profiles.

Regular Private Motorola DJ Empty
Regular Private Commercial Social and promotional Empty

The "empty" profile illustrates what a default profile looks like. All MySpace profiles are based on a template. The user chooses how much information he or she wants to fill in. Each profile has a specific section for personal information, friends, and comments. In addition to adding personal information, the user can customize the profile by adding for example images, surveys and an mp3-player.

Project outline

After having defined the scope of the project and the questions that we wanted to answer, we translated the problem into a more pattern classification-friendly format. We created two categories for rating the profiles: "social" and "promotion", both having the range [1 2 3 4 5]. Thus, each profile is rated with two digits; one for how social the profile is and one for how promotional it is. For example, the Motorola add above would be rated [1,5]; 1 for low social activity and 5 for commercial features. In a similar way, the empty profile would be rated [1,1]. Our initial assumption was that the number of spam-like profiles is increasing and will become a significant problem in the near future.

Data collection

We collected MySpace profiles by creating a scrapper that automatically extracted ~20,000 profiles. We chose a block of fairly recent profiles (none older than 1,5 years) with the intention of maximizing the number of high-promotional profiles. We then created a web-based interface for manual labeling of profiles. After having labeled about 2000 profiles we started realizing that the rate of spam was not as high as we had hoped for. Out of the 20,000 profiles, not even 2% had been labeled as spam. Hence, we decided to manually create a "futuristic scenario" where the distribution is 50/50. We manually picked out 400 high-promotion profiles (promotion rating 4 or 5), and randomly picked out another 400 from the original data set, giving us a total of 800 profiles or data points. We also realized that the number of [1,1]-profiles was much higher than we expected. Since these profiles have a very significant look, and thus are very easy to detect, we decided to exclude them from the data set. Thus, the randomly extracted set of 400 data points did not include any [1,1] profiles. The images below show the distribution of the initial 17 features, in relation to the two "classes" social and promotion.

Click on the image to see a larger version of the image. There is one chart per feature. All charts have the same axes: x: values (0:1), y: social rating, z: promotion rating.

data_all