EXPERIMENTS ON THE RESTAURANT GAME

Table of Contents

Introduction : The Restaurant Game
Pattern Classification Problems
Game Personality Classification
    Binary Classification
        Feature Extraction
        Preprocessing
        Classification
        Applications
    Multi-Class Classification
    Multi-label Classification using OVA
Implementation Details
Speech Acts Classification
    Attribute Selection
    Active Learning
    Classification Experiments
    Incorporating Context

Introduction The Restaurant Game

This report deals with two pattern recognition problems derived from the context of the Restaurant Game [1]. The Restaurant Game is an online two player game. One player assumes the role of a waitress and the other assumes the role of a customer. The players interact in a restaruant. The goal of the game is to study social interactions to better understand and model in-game AI characters.

Each game in the restaurant game consists of a session in which a transaction consisting of ordering food, paying and other social interactions occur in the restaurant. The conversation between the waitress and the customer is logged in a file. As of [date] there are 8000 log files corresponding to 8000 unique games. At the end of the game the players also rate each other's personality. The ratings were typed in directly into a text box and hence represents a open choice of possible personalities. The following table shows some of the most occuring ratings for a waitress. There were some others not included in this table.

Rating	Frequency
polite	3200
nice	1560
friendly	432
funny	421
crazy	410
kind	367
rude	314
slow	304
fast	291
quick	289
good	279

Table 1. Popular personality classifications for waitress

Pattern Classification Problems

This report details exploration of two pattern classification problems associated with the restaurant game. The first pattern classification problem was the classification of personality of a player into a set of classes based on the conversation and actions of players. The second task examined in this report is the classification of a the "speech acts" of the players into a set of possible speect act types. A speech act is a sequence of words which makes sense and occurs in speech (spoken , written etc). For this problem I considered each continous stream of typed in text to be a speech act.

Game Personality Classification:

Pattern Classifiaction Problem: To classify the personality of players based on the their conversation and actions.
Three methods of classification were attempted Binary Classification, Multi-Class Classification and Multi-Label classification using multiple One Versus All classifiers. For all these classifiers I used the same method of feature extraction and pre-processing.

1. Binary Classification
Reduction of the class space : Players rated each other's personality using three tags. There was no restriction on the text of the rating. The most common ratings are given in Table 1. Since there was no restriction on the rating text some players had input meaningless ratings such as "????" and "asdf". The first task at hand was to reduce the number of classses to a meaningful set. To start off one set of ratings were grouped into "good" and another set of ratings were grouped into "bad". The ratings which fell under good and bad are given in Table 2.

Good	Bad
polite, nice, kind, friendly, helpful, good	rude, bad, horrible

Feature Extraction:
The two main features in this pattern recognition problem can be broadly put under conversation and actions. Conversation was modelled using the bag of words approach. Actions were restricted to the in-built actions each of the characters could do (examples of actions include : GOTO, STOP, PICKUP, PUTDOWN, GIVE, SITON, EAT etc)

Features from conversation:The bag of words feature representation with TF-IDF was used. No stemming was done. The NLTK python toolkit was used to extract words from the game log files. Each personality instance was represented by a vector. Each component of the vector represented the TF-IDF for the word v_i = f_i* idf_i . The vocabulary size was 2902.

Features derived from actions:
For the waitress, I considered the act of serving a customer an item he orders as yielding a positive score.
{order --> waitress PUTSDOWN --> customer EATS : +1} This sequence of actions makes sure that the even if the waitress does put down some item, but the customer doesn't eat it (chances are that the customer might've not ordered that item or ordered it and then changed his mind to order something else) implies that the waitress hasn't been attentive enough to the customer's order hence earns a negative point -1

For the customer, after going through a number of log files, I found that the most striking action that would affect the customer's personality rating would be the act of eating something he didn't order. For instance, eating flowers on the table, or eating trash. So for this experiment I used this action to penalize the customer's behavior.
{customer EATS item --> if (item = non-menu item) : -1}

Future work in the area of feature extraction for this project could include extracting other actions that may influence the personality classification of the character.

Preprocessing : The vocabulary size was too large in order to do meaningul and fast experiments. In order to reduce the number of attributes information gain of attributes was used to filter out attributes with low information gain. Information gain of a feature is intuitively the amount of information that a particular attribute contributes to the class if that feature were included. This is computed using the difference in entropy of the class without the feature and with the feature:

InformationGain(S, A) = H(S) - Σ (|S_j| / |S|) H (S_j)

InformationGain(S, A) is the information gain if the attribute A is known. H(S) is the entropy of the class distribution. Attribute A takes on a values represented by j. H(S_j) is the entropy of the class distribution given that attribute A takes on it's j^thvalue. S_jrepresents the examples in the training set where attribute takes on value j

Figure 1 shows the results of feature reduction on the error rate of Naive Bayes and SVM. Naive Bayes and SVM were chosen as Naive Bayes is a weak learner whereas SVM is a high variance learner. As can be seen from the results SVM needs a large number of attributes to reach its optimal error rate, while Naive Bayes needs fewer attributes to do so.

Fig 1. Comparing Accuracy(%) for SVM and Naive Bayes for varying attribut numbers

Classification

Learning Curve: The following figure shows the performance of Naive Bayes and SVM as the number of training examples are increased. As can be seen from the figure as the number of examples increases SVM outperforms Naive Bayes. With very little data Naive Bayes performs better. The reason for this is that space of functions represented by SVMs is quite large when compared to that represented by Naive Bayes. Hence as a result of this SVM needs a large amount of data to narrow down on the best possible function in this space while Naive Bayes does so with only little data. But since SVM is much richer in terms of the functions that it can represent SVM also gets closer to the true than Naive Bayes can .

Fig 2. Learning curve comparison for SVM and Naive Bayes

Effect of Kernels :
Table shows the effect of kernels on accuracy using SVM using 1000 files. As the order of the polynomial increases one can see a decrease in accuracy. This at first appeared counter-intuitive to me, but on closer examination, I attributed this trend to the fact that the feature vector itself was already very large (on the order of 2500 features) and increasing the polynomial exponent meant mapping the data points to a higher dimension. Given that the feature vector is itself large, to gain some increase in accuracy when mapping to a higher dimension the number of input samples also should have increased commensurately. However, running SVM classification on data is a time-consuming process and hence I didn't attempt increasing the data samples for higher order kernels.

Polynomial degree	SVM Accuracy %
1	85.8
2	85.8
3	84
4	80
5	78.7

Table 2. Test performance for varying degrees of polynomial kernels

Performance using SVM with Radial Basis Function
I also performed 10-fold cross validation using a SVM classifier with a RBF kernel to compare performance with polynomial kernels. The results in Table 3 below indicate that it doesn't outperform the SVM classifier with polynomial kernel (in particular linear kernel)

Gamma	SVM Accuracy % with RBF kernel C =1	SVM Accuracy % with RBF kernel C = 2
0.01	75.3	79.3
0.1	84.6	85.1
1.0	82.3	83.7

Table 3. Performance using RBF kernel for the SVM classifier and varying gamma and C for this kernel

Here, C is the complexity parameter and Gamma is the

Overall, for the classification of a personality as good or bad (positive or negative) I found that SVM with a linear kernel worked best, giving an accuracy of 85.8%. SVM with a RBF kernel (C=2, gamma = 0.1) performs nearly as well giving an accuracy of 85.1%

A Note On Applications: With proliferation of user-generated text content on the web in the forms on comments, blogs etc a system to automatically segregate text into classes such as polite vs. non polite , rude will greatly enhance web experience.

2. Multi-Class Classification

To experiment with a larger number of classes I sorted the top personality classification categories into 5 main buckets viz. Rude, Kind, Funny, Fast, Crazy. Some of these included more than one label as shown below.

Rude : Rude, Bad, Horrible
Kind : Kind, Polite*, Nice*, Friendly, Good
Funny : Funny, Fun
Fast : Fast, Quick, Efficient
Crazy : Crazy

* A caveat with using the label Polite. The survey listed polite as one of the examples of a personality classification and hence when skimming through game log files, I found unjustified use of this label. A typical example was "rude, polite, crazy". Again, in order to be polite or for lack of a better label, participants might've included one of the sample labels.

I first tried this experiment with 1000 samples randomly picked from the set of 8000 log files. What resulted was a very highly skewed set with most of the samples belonging to the 'Kind' bucket. Also, classification accuracy was 34.2% which wasn't good at all.

In order to better represent each of those classes in the dataset, I then used equal priors for each of the 5 buckets. So, I picked 200 files from each of those categories randomly. While picking files which had Polite or Nice, I made sure there was at least one other tag from the Kind bucket, to reduce the possibility that of it being an ambiguous label.

Table 3 shows results from testing this method of classifying files into each of the 5 buckets using again Naive Bayes and SVM.

Features	SVM % accuracy	Naive Bayes % accuracy
100	48.6	44.4
500	54.9	43.9
1000	55.9	41.1
2000	58	44.4
2300	59.3	39.3
2500	57.8	43.7
2900	57.8	44.4

Table 5. Multi-Class classification accuracies using SVM and Naive Bayes for different number of attributes

From Table x we observe an accuracy as high as 59.3% for SVM. This result is reasonably good, considering the fact that with equal priors random accuracy could be 20%. Here again, when experimenting with the number of features I used Information Gain to reduce the number of attributes. The fact that the highest accuracy was obtained for a number of attributes less that the total goes to show that attribute selection using information gain actually tries to remove noisy features which don't add information to the overall classification process.

3. Multi-label classification using Multiple One Versus All Classifiers

I experimented with one versus all classification because it could be a useful method of extending binary classification to multiple label classification problems. The idea is to build n one versus all classifiers and then output the final classes based on the highest posterior probabilities. I also wanted to explore the possibility of outputting the top 3 labels based on the highest 3 posterior probabilities. To get a baseline accuracy of multi-label classification I considered the 5 classes used in the previous section : Rude, Kind, Funny, Fast and Crazy and the intersections between them. Each instance now belonged to more than one class. The class of an instance was a five bit string indicating the presence of these personality ratings. For example: If a player was rated as Rude, Funny and Crazy the class string for the player was "10101". The accuracy of random classification was 23%. The accuracy of SVM with a linear kernel was 25.64%. This indicates that the naive approach of considering each class label combination as a separate class might not be suitable for this problem. One reason which this method fails could be the reason that some of the class labels are dependent on the other labels. For example if a person is labelled as rude it is highly unlikely that the same person will be labelled as kind.

Speech Act Classification

The problem consists of classifiying a speech act into one of many classes. A speech act is an utterance of a string of words. For the restaurant game data, a speech act was taken to be a continous uninterrupted input of text from a player. This ranges from single words to mulitple sentences. The different classes of speech acts are [2]:

Greeting
Assertion
Expressive
Question
Confirmation
Directive
Other
Denial
Promise

1. Attribute Selection:

The data set consisted of 4299 speech acts in one of the above 9 classes. The bag of words representation was used to represent speech acts. The vocabulary size across the data set was 1942 words. In order to reduce the number of features the top n words having the highest information gain were chosen. In all these experiments a Naive Bayes classifier was used to evaluate the number of features to use. All the multi-class accuracies reported in this section were computed using one-against-all classification.

The table given below shows the accuracy obtained by using different number of features. The accuracies were obtained using ten fold cross validation. The accuracy obtained using all the features is 67.19% with Naive Bayes.

No. of words	Naive Bayes
5	42.60
25	57.42
50	61.93
75	64.90
100	66.03
150	66.823
200	66.829

2. Active Learning

The labelled data for speech acts came from 100 files of game play. The entire dataset for the restaurant game consists of 8000 game play files.The currently labelled set represents a small fraction of the whole data set. In order to evaluate whether any form of active learning was required I perforemd a small learning curve experiement to evaluate if the accuracy of classification had saturated or not. The following table shows the accuracy obtained using different numbers of training instances out of 4299 examples. Different numbers of training examples were obtained by doing cross-validation with different number of folds. As can be seen from the table use of more examples for training does not increase the accuracy of classification. Hence it was decided not to pursue labelling of more examples using active learning.

Training Examples	Fold	J48	SVM
2150	2	65.16%	74.84%
3440	5	71.5%	-
3870	10	76.17%	77.65%
4012	15	74.67%	-
4084	20	74.67%	-

J48 is Weka's module for constructing a C4.5 decision tree. Pruning was used to construct the tree. The confidence factor used for pruning was 0.25. The minimum number of instances per leaf was set to 2. The training data was divided into 3 folds. One fold was used in pruning the tree and the remaining two folds were used in growing the tree. The subtree raising operation was considered during pruning.

The SVM used was a linear kernel SVM with C set to 1. The training data was normalized.

3. Classification Experiments

Adaboost
In order to increase the accuracy of Naive Bayes I experimented with AdaBoostM1 using 100 features and a random sample of 1/3rd the size of the whole dataset. The table given below shows the accuracy of the AdaboostM1 with the number of iterations. The threshold on the weights was set to 100 to perform weight pruning and no resampling was done.

Adaboost Classifier	Accuracy	Iterations
Naive Bayes	68.72 68.7 66.61	100 50 1

Support Vector Machines

I also experimented with the kernel of the SVM algorithm. The table given below shows the result of changing the degree of the polynomial kernel and the effect on accuracy. As can be seen from the table the linear kernel performs the best. One explanation for this might be that input data is already high dimensional in nature and projecting the data into higher dimensions decreases accuracy as one needs even more data to find the best SVM in more higher dimensions.

4. Incorporating Context

Given that a speech act belongs to the "Question" class one might infer that the probability of the next speech act belonging to the class "Assertive" is high. In a similar manner , given that a current speech act is in "Greeting" there is a high probability that the next speech act also belongs to the class "Greeting". In order to take into account this intuitive notion of context I tried modifying the feature vector of a speech act with information from past speech acts. None of my initial results are positive. Of the two combinations I tried, one was augmenting the feature vector with a scaled past speech act feature vector. The other variation was representing the current feature vector a linear combination of the past and curent feature vectors.

One method which was not tried is to use the classification of the previous speech act and transition probabilities from the training data to modify the posterior probabilities. For example, if the previous speech act was classified as "Greeting" one might incorporate P("Greeting" followed by Class j ) into the current posterior for Class j.

Implementation Details

I used a Java based toolbox WEKA for the purpose of classification. For text processing I used the NLTK which is Python based.

References

[1]Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. New York, NY: John Wiley & Sons

[2]Bekkerman R., Allan J., Using Bigrams in Text Categorization

[3]Support Vector Machines (Tutorial), Andrew W. Moore, School of Computer Science, Carnegie Mellon University