EXPERIMENTS ON THE RESTAURANT GAME
Introduction : The Restaurant Game
Pattern Classification Problems
Game Personality Classification
Binary Classification
Feature Extraction
Preprocessing
Classification
Applications
Multi-Class Classification
Multi-label Classification using OVA
Implementation Details
Speech Acts Classification
Attribute Selection
Active Learning
Classification Experiments
Incorporating Context
Introduction The Restaurant Game
This report deals with two pattern recognition problems derived from the context of the Restaurant Game [1].
The Restaurant Game is an online two player game. One player assumes
the role of a waitress and the other assumes the role of a customer.
The players interact in a restaruant. The goal of the game is to study
social interactions to better understand and model in-game AI
characters.
Each game in the restaurant
game consists of a session in which a transaction consisting of
ordering food, paying and other social interactions occur in the
restaurant. The conversation between the waitress and the customer is
logged in a file. As of [date] there are 8000 log files corresponding
to 8000 unique games. At the end of the game the players also rate each
other's personality. The ratings were typed in directly into a text box
and hence represents a open choice of possible personalities. The
following table shows some of the most occuring ratings for a waitress.
There were some others not included in this table.
Rating
| Frequency
|
polite
| 3200
|
nice
| 1560
|
friendly
| 432
|
funny
| 421
|
crazy
| 410
|
kind
| 367
|
rude
| 314
|
slow
| 304
|
fast
| 291
|
quick
| 289
|
good
| 279
|
Table 1. Popular personality classifications for waitress
Pattern Classification Problems
This
report details exploration of two pattern classification problems
associated with the restaurant game. The first pattern classification
problem was the classification of personality of a player into a set of
classes based on the conversation and actions of players. The
second task examined in this report is the classification of a the
"speech acts" of the players into a set of possible speect act types. A
speech act is a sequence of words which makes sense and occurs in
speech (spoken , written etc). For this problem I considered each
continous stream of typed in text to be a speech act.
Game Personality Classification:
Pattern Classifiaction Problem: To classify the personality of players based on the their conversation and actions.
Three
methods of classification were attempted Binary Classification,
Multi-Class Classification and Multi-Label classification using
multiple One Versus All classifiers. For all these classifiers I used
the same method of feature extraction and pre-processing.
1. Binary Classification
Reduction
of the class space : Players rated each other's personality using three
tags. There was no restriction on the text of the rating. The most
common ratings are given in Table 1. Since there was no restriction on
the rating text some players had input meaningless ratings such as
"????" and "asdf". The first task at hand was to reduce the number of
classses to a meaningful set. To start off one set of ratings were
grouped into "good" and another set of ratings were grouped into "bad".
The ratings which fell under good and bad are given in Table 2.
Good
| Bad
|
polite, nice, kind, friendly, helpful, good
| rude, bad, horrible
|
Feature Extraction:
The
two main features in this pattern recognition problem can be broadly
put under conversation and actions. Conversation was modelled using the
bag of words approach. Actions were restricted to the in-built actions
each of the characters could do (examples of actions include : GOTO,
STOP, PICKUP, PUTDOWN, GIVE, SITON, EAT etc)
Features from conversation:The bag of words feature representation with TF-IDF was used. No stemming was done. The NLTK
python toolkit was used to extract words from the game log files. Each
personality instance was represented by a vector. Each component of the
vector represented the TF-IDF for the word vi = fi * idfi . The vocabulary size was 2902.
Features derived from actions:
For the waitress, I considered the act of serving a customer an item he orders as yielding a positive score.
{order
--> waitress PUTSDOWN --> customer EATS : +1} This sequence of
actions makes sure that the even if the waitress does put down some
item, but the customer doesn't eat it (chances are that the customer
might've not ordered that item or ordered it and then changed his mind
to order something else) implies that the waitress hasn't been
attentive enough to the customer's order hence earns a negative point -1
For
the customer, after going through a number of log files, I found that
the most striking action that would affect the customer's personality
rating would be the act of eating something he didn't order. For
instance, eating flowers on the table, or eating trash. So for this
experiment I used this action to penalize the customer's behavior.
{customer EATS item --> if (item = non-menu item) : -1}
Future
work in the area of feature extraction for this project could include
extracting other actions that may influence the personality
classification of the character.
Preprocessing
: The vocabulary size was too large in order to do meaningul and fast
experiments. In order to reduce the number of attributes information
gain of attributes was used to filter out attributes with low
information gain. Information
gain of a feature is intuitively the amount of information that a
particular attribute contributes to the class if that feature were
included. This is computed using the difference in entropy of the
class without the feature and with the feature:

InformationGain(S, A) = H(S) - Σ (|Sj| / |S|) H (Sj)
InformationGain(S, A) is the information gain if the attribute A is known. H(S) is the entropy of the class distribution. Attribute A takes on a values represented by j. H(Sj) is the entropy of the class distribution given that attribute A takes on it's jthvalue. Sj represents the examples in the training set where attribute takes on value j
Figure 1 shows the results of feature reduction on the error rate of
Naive Bayes and SVM. Naive Bayes and SVM were chosen as Naive Bayes is
a weak learner whereas SVM is a high variance learner. As can be seen
from the results SVM needs a large number of attributes to reach its
optimal error rate, while Naive Bayes needs fewer attributes to do
so.

Fig 1. Comparing Accuracy(%) for SVM and Naive Bayes for varying attribut numbers

Classification
Learning
Curve: The following figure shows the performance of Naive Bayes and
SVM as the number of training examples are increased. As can be seen
from the figure as the number of examples increases SVM outperforms
Naive Bayes. With very little data Naive Bayes performs better. The
reason for this is that space of functions represented by SVMs is quite
large when compared to that represented by Naive Bayes. Hence as a
result of this SVM needs a large amount of data to narrow down on the
best possible function in this space while Naive Bayes does so with
only little data. But since SVM is much richer in terms of the
functions that it can represent SVM also gets closer to the true than
Naive Bayes can .

Fig 2. Learning curve comparison for SVM and Naive Bayes
Effect of Kernels :
Table
shows the effect of kernels on accuracy using SVM using 1000 files. As
the order of the polynomial increases one can see a decrease in
accuracy. This at first appeared counter-intuitive to me, but on closer
examination, I attributed this trend to the fact that the feature
vector itself was already very large (on the order of 2500 features)
and increasing the polynomial exponent meant mapping the data points to
a higher dimension. Given that the feature vector is itself large, to
gain some increase in accuracy when mapping to a higher dimension the
number of input samples also should have increased commensurately.
However, running SVM classification on data is a time-consuming process
and hence I didn't attempt increasing the data samples for higher order
kernels.
Polynomial degree
| SVM Accuracy %
|
1
| 85.8
|
2
| 85.8
|
3
| 84
|
4
| 80
|
5
| 78.7
|
Table 2. Test performance for varying degrees of polynomial kernels
Performance using SVM with Radial Basis Function
I
also performed 10-fold cross validation using a SVM classifier with a
RBF kernel to compare performance with polynomial kernels. The results
in Table 3 below indicate that it doesn't outperform the SVM classifier
with polynomial kernel (in particular linear kernel)
Gamma
| SVM Accuracy % with RBF kernel C =1
| SVM Accuracy % with RBF kernel C = 2 |
0.01
| 75.3
| 79.3
|
0.1
| 84.6
| 85.1
|
1.0
| 82.3
| 83.7
|
Table 3. Performance using RBF kernel for the SVM classifier and varying gamma and C for this kernel
Here, C is the complexity parameter and Gamma is the
Overall,
for the classification of a personality as good or bad (positive or
negative) I found that SVM with a linear kernel worked best, giving an
accuracy of 85.8%. SVM with a RBF kernel (C=2, gamma = 0.1) performs
nearly as well giving an accuracy of 85.1%
A Note On Applications:
With proliferation of user-generated text content on the web in the
forms on comments, blogs etc a system to
automatically segregate text into classes such as polite vs. non polite
, rude will greatly enhance web experience.
2. Multi-Class Classification
To
experiment with a larger number of classes I sorted the top personality
classification categories into 5 main buckets viz. Rude, Kind, Funny,
Fast, Crazy. Some of these included more than one label as shown below.
Rude : Rude, Bad, Horrible
Kind : Kind, Polite*, Nice*, Friendly, Good
Funny : Funny, Fun
Fast : Fast, Quick, Efficient
Crazy : Crazy
*
A caveat with using the label Polite. The survey listed polite as one
of the examples of a personality classification and hence when skimming
through game log files, I found unjustified use of this label. A
typical example was "rude, polite, crazy". Again, in order to be polite
or for lack of a better label, participants might've included one of
the sample labels.
I first tried this
experiment with 1000 samples randomly picked from the set of 8000 log
files. What resulted was a very highly skewed set with most of the
samples belonging to the 'Kind' bucket. Also, classification accuracy
was 34.2% which wasn't good at all.
In order
to better represent each of those classes in the dataset, I then used
equal priors for each of the 5 buckets. So, I picked 200 files from
each of those categories randomly. While picking files which had Polite
or Nice, I made sure there was at least one other tag from the Kind
bucket, to reduce the possibility that of it being an ambiguous label.
Table 3 shows results from testing this method of classifying files into each of the 5 buckets using again Naive Bayes and SVM.
Features
| SVM % accuracy
| Naive Bayes % accuracy
|
100
| 48.6
| 44.4
|
500
| 54.9
| 43.9
|
1000
| 55.9
| 41.1
|
2000
| 58
| 44.4
|
2300
| 59.3
| 39.3
|
2500
| 57.8
| 43.7
|
2900
| 57.8
| 44.4
|
Table 5. Multi-Class classification accuracies using SVM and Naive Bayes for different number of attributes
From
Table x we observe an accuracy as high as 59.3% for SVM. This result is
reasonably good, considering the fact that with equal priors random
accuracy could be 20%. Here again, when experimenting with the number
of features I used Information Gain to reduce the number of attributes.
The fact that the highest accuracy was obtained for a number of
attributes less that the total goes to show that attribute selection
using information gain actually tries to remove noisy features which
don't add information to the overall classification process.
3. Multi-label classification using Multiple One Versus All Classifiers
I
experimented with one versus all classification because it could be a
useful method of extending binary classification to multiple label
classification problems. The idea is to build n
one versus all classifiers and then output the final classes based on
the highest posterior probabilities. I also wanted to explore the
possibility of outputting the top 3 labels based on the highest 3
posterior probabilities. To get a baseline accuracy of multi-label
classification I considered the 5 classes used in the previous
section : Rude, Kind, Funny, Fast and Crazy and the intersections
between them. Each instance now belonged to more than one class. The
class of an instance was a five bit string indicating the presence of
these personality ratings. For example: If a player was rated as Rude,
Funny and Crazy the class string for the player was "10101". The
accuracy of random classification was 23%. The accuracy of SVM with a
linear kernel was 25.64%. This indicates that the naive approach of
considering each class label combination as a separate class might not
be suitable for this problem. One reason which this method fails could
be the reason that some of the class labels are dependent on the other
labels. For example if a person is labelled as rude it is highly
unlikely that the same person will be labelled as kind.
Speech Act Classification
The
problem consists of classifiying a speech act into one of many classes.
A speech act is an utterance of a string of words. For the restaurant
game data, a speech act was taken to be a continous uninterrupted input
of text from a player. This ranges from single words to mulitple
sentences. The different classes of speech acts are [2]:
- Greeting
- Assertion
- Expressive
- Question
- Confirmation
- Directive
- Other
- Denial
- Promise
1. Attribute Selection:
The
data set consisted of 4299 speech acts in one of the above 9 classes.
The bag of words representation was used to represent speech acts. The
vocabulary size across the data set was 1942 words. In order to reduce
the number of features the top n words having the highest information
gain were chosen. In all these experiments a Naive Bayes classifier was
used to evaluate the number of features to use. All the multi-class
accuracies reported in this section were computed using one-against-all
classification.
The
table given below shows the accuracy obtained by using different number
of features. The accuracies were obtained using ten fold cross
validation. The accuracy obtained using all the features is
67.19% with Naive Bayes.
No. of words | Naive Bayes
|
5
| 42.60 |
25
| 57.42
|
50
| 61.93
|
75
| 64.90
|
100
| 66.03
|
150
| 66.823
|
200
| 66.829
|
2. Active Learning
The
labelled data for speech acts came from 100 files of game play. The
entire dataset for the restaurant game consists of 8000 game play
files.The currently labelled set represents a small fraction of the
whole data set. In order to evaluate whether any
form of active learning was required I perforemd a small learning
curve experiement to evaluate if the accuracy of classification had
saturated or not. The following table shows the accuracy obtained using
different numbers of training instances out of 4299 examples. Different
numbers of training examples were obtained by doing cross-validation
with different number of folds. As can be seen from the table use of
more examples for training does not increase the accuracy of
classification. Hence it was decided not to pursue labelling of more
examples using active learning.
Training Examples
| Fold
| J48
| SVM
|
2150
| 2
| 65.16%
| 74.84%
|
3440
| 5
| 71.5%
| -
|
3870
| 10
| 76.17%
| 77.65%
|
4012
| 15
| 74.67%
| -
|
4084
| 20
| 74.67%
| -
|
J48
is Weka's module for constructing a C4.5 decision tree. Pruning was
used to construct the tree. The confidence factor used for pruning was
0.25. The minimum number of instances per leaf was set to 2. The
training data was divided into 3 folds. One fold was used in pruning
the tree and the remaining two folds were used in growing the tree. The
subtree raising operation was considered during pruning.
The SVM used was a linear kernel SVM with C set to 1. The training data was normalized.
3. Classification Experiments
Adaboost
In
order to increase the accuracy of Naive Bayes I experimented with
AdaBoostM1 using 100 features and a random sample of 1/3rd the size of
the whole dataset. The table given below shows the accuracy of the
AdaboostM1 with the number of iterations. The threshold on the weights
was set to 100 to perform weight pruning and no resampling was done.
Adaboost Classifier
| Accuracy
| Iterations
|
Naive Bayes
| 68.72 68.7 66.61
| 100 50 1
|
Support Vector Machines
I
also experimented with the kernel of the SVM algorithm. The table given
below shows the result of changing the degree of the polynomial kernel
and the effect on accuracy. As can be seen from the table the linear
kernel performs the best. One explanation for this might be that input
data is already high dimensional in nature and projecting the data into
higher dimensions decreases accuracy as one needs even more data to
find the best SVM in more higher dimensions.
4. Incorporating Context
Given
that a speech act belongs to the "Question" class one might infer that
the probability of the next speech act belonging to the class
"Assertive" is high. In a similar manner , given that a current speech
act is in "Greeting" there is a high probability that the next speech
act also belongs to the class "Greeting". In order to take into account
this intuitive notion of context I tried modifying the feature vector
of a speech act with information from past speech acts. None of my
initial results are positive. Of the two combinations I tried, one was
augmenting the feature vector with a scaled past speech act feature
vector. The other variation was representing the current feature vector
a linear combination of the past and curent feature vectors.
One
method which was not tried is to use the classification of the previous
speech act and transition probabilities from the training data to
modify the posterior probabilities. For example, if the previous speech
act was classified as "Greeting" one might incorporate P("Greeting"
followed by Class j ) into the current posterior for Class j.
Implementation Details
I used a Java based toolbox WEKA for the purpose of classification. For text processing I used the NLTK which is Python based.
References
[1]Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. New York, NY: John Wiley & Sons
[2]Bekkerman R., Allan J., Using Bigrams in Text Categorization
[3]Support Vector Machines (Tutorial), Andrew W. Moore, School of Computer Science, Carnegie Mellon University