Feature selection

The "Original 11"

After determining the character of the data using the descriptive statistics above, we used a greedy forwards feature search to arrive at 11 features which seemed to best discriminate those listings which were fully funded from those which closed unfunded. These included:

  1. Credit Grade
  2. Amount Requested
  3. Borrower Rate
  4. Debt to Income Ratio
  5. Group membership (true/false)
  6. Has an image (true/false)
  7. Current delinquencies
  8. Delinquencies last 7 years
  9. Open credit lines
  10. Income
  11. Bid count*

While "bid count" proved to be a very useful feature in distinguishing unfunded listings from funded loans, we ultimately removed it, since it is an artifact of the bidding process rather than a characteristic of the loan application.

We performed a principle component analysis of these features to project the data down to 2 and 3 dimensions to gauge the separability of the data.

2 dimension PCA 3 dimension PCA

The following shows the portion of the variance present in the 10-feature population which was contributed to by each of the top 10 principle components:

variance and PCA

The "96"

Unlike traditional bank loans, in which applicants are scrutinized based largely on their financial profile, peer to peer lending involves much more direct human communication, including pictures and descriptions. Thus, we were interested in seeing if we could take advantage of the descriptive textual features available in the data set.

To do this, we calculated the frequency of every word, pair of words, and set of three words in the listing descriptions, titles, and member endorsements. We used these features to construct tables of those words which most discriminate the data set (full description results, full title results).

Description words

0.440.580.14cards and other
0.290.420.14monthly expenses housing
0.440.580.14clothing household expenses
0.410.540.13and other loans
0.420.560.14other expenses
0.290.430.14expenses housing
0.440.570.13clothing household
0.450.580.13car expenses

Title words

0.0240.0580.010cards and other
0.0220.0420.007monthly expenses housing
0.0210.0580.005clothing household expenses
0.0090.0540.005and other loans
0.0620.0560.022other expenses
0.0490.0430.017expenses housing
0.0150.0570.012clothing household
0.0260.0580.011car expenses

An additional problem we encountered was dealing with the presence of null values in the database. Many of the features, including numeric ones, have null values &emdash; often with high priors. To preserve the information carried by the null value, we split these features into two columns: one which contains a binary value indicating whether the field was null or not, and the second field representing the value of the feature (or zero if it was null). Based on these parameters, we arrived at a list of 96 features.

We then implemented a sequential forwards floating search with which to choose a subset of those 96 features which best describes the data. (forwards floating search code) The forwards floating search algorithm turned out to be very slow, despite using a simple linear discriminant as an evaluation function. To speed the calculation, it proved necessary to either further subsample the feature set (and thus increase the bias of the feature selection) or to set a cap on the number of features considered (thus decreasing the optimality).

This graph shows a tiny subset of the forward floating search algorithm's progress. The backtracking mechanism is what allows this algorithm to escape some local minima. However, better clusters of features are often not found until more features have been chosen, which might then propagate all the way back to smaller numbers of features. trawl

This graph shows the number of times each particular feature was chosen, giving an indication of how good the feature is at discriminating the data. If the features had each been statistically independent, one would expect that the graph would show a straight decreasing line. The extent to which the graph exhibits a stepping pattern indicates the dependence of some features.
feature frequency

Results: the best feature sets found for each feature size from 1 to 96 features for: