Empaceptor

Empaceptor performs a series of operations on the text it is given to build up a set of associations. Empaceptor makes use of two main packages to do its text processing: the General Architecture for Text Engineering (GATE) package, and WordNet. GATE is a very powerful and full-featured set of tools for performing various kinds of information extraction, among other things. It provides word tokenizing, sentence splitting, part-of-speech tagging, and entity extraction out-of-the- box, and it can be easily extended to extract other kinds of information. The capabilities of WordNet have been described in depth in many other sources.

The main premise of the Empaceptor text processing is simple: words that occur in the same sentence as the base set of emotional keywords should take on a partial meaning of the nearby keyword. The more often a word occurs in one emotional context, the stronger the association should be, and a single word may take on many different associations.

Before text can be processed, some setup is required. An editable list of emotional keywords is loaded. Here is the base set of emotions. Each keyword is looked up in the WordNet dictionary to retrieve the associated synsets. This allows for much broader coverage than simply using the original keywords to search. The initial set of 68 expands to nearly 200 synsets. Once these are loaded, text can be processed as either a document to be learned from, or as a document to be scored for emotional content. Here is how text is processed in learning mode:

The document, as a Java String object, is handed to the the GATE text processing system.
GATE returns a Document object which contains a set of Annotations of different types such as Token, Whitespace, Sentence, Person, etc.
The Empaceptor code walks this Document, finding all of the Tokens that represent words (as opposed to punctuation or numbers).
Each word Token is passed to WordNet to retrieve its synsets (adjectival only).
The synsets of the Token are compared with the pre-loaded emotional synsets to see if there is overlap. If so, every word in the Sentence (whose boundaries are determined by the GATE annotations) is marked has having that emotional content. The occurrence count is also incremented for each of the Tokens. If more than one emotional word occurs in the sentence, all of the words are counted for each emotional occurrence.
If none of the Tokens in a Sentence belong to the emotional synsets, then the words are simply marked as having occurred.

The learned associations are stored in a simple XML format so that the server can be started and stopped without losing previous associations. It can be reset by simply discarding the XML file. Later in the project, this was slightly modified to note if a given Token was also marked as a Person annotation. The assumption was that if the name of person occurs in the same sentence as an emotion, it is more likely that the emotion is related to that Person. Therefore, Person tokens are counted as twice as likely to be emotional when occurring with an emotion, as compared to a regular token.

With a the trained system, it is possible to determine the emotional valence of any word:

If a word did not occur in the training corpus, it is given a score of 0.0 for every emotional category.
For any other word, the score is the number of emotional occurrences divided by the total number of occurrences.

Here is an example, assuming that the emotional list contains only 'happy'.

I am happy about my new car. I put it in my garage.

Here is the resulting table:

Word	Happy Weight	Word	Happy Weight
I	1/2 (.5)	car	1/1 (1)
am	1/1 (1.0)	put	0/1 (0)
happy	1/1 (1.0)	it	0/1 (0)
about	1/1 (1.0)	in	0/1 (0)
my	1/2 (0.5)	garage	0/1 (0)
new	1/1 (1)

One feature of this system, which is beginning to be evident in this table, is that common words such as 'I' are used so frequently that they lose any emotional association.

To score an entire document:

The document is chopped up in a quick and dirty fashion (GATE is not used) into a set of tokens.
Each word token is passed to the scoring algorithm above, once for each of the available emotional synsets.
The score for each token for each emotion is aggregated, and then divided by the number of tokens, in order to normalize for the length of the document. This gives a final rating for the document for each of the available emotions.

Interacting with Empaceptor - mbox import

Importing of mbox files is relatively straightforward. A reference to the file itself is passed to Empaceptor. Empaceptor uses a library from the GNU project that implements one of the Java Mail API service provider interfaces. It can parse the mbox file and returns an array of Message objects. These Message objects must be further parsed to throw away much of the header information, basically everything except the Subject: line. The rest of the data is formatted as a String and handed to the learning algorithm above. This service can be invoked from the EmpaceptorGUI described below.

Interacting with Empaceptor - email service

In order to invoke the email processing service, Empaceptor must be used with an email program that supports the execution of arbitrary shell commands as filter steps. The Empaceptor Java files must be placed in a standard location that can be referenced from an Email client. When an incoming message is received, it starts up the email client service class, and pipes the message to Standard In, along with the desired emotion for scoring (as an argument to the email service class). The email service reads the desired emotion argument, and the email off of the incoming pipe, and then opens a standard TCP/IP socket to Empaceptor. The client service sends the information, and waits for a response of either "yes" or "no" from Empaceptor. Empaceptor uses the scoring algorithm described above to determine the strength of each emotion in the text. It then looks up the value for the desired emotion, and compares it with a configurable threshold. It returns "yes" if the value is above the threshold. Empaceptor does not attempt to learn from messages coming through the email service.

If the client gets back a "yes", it exits with code '1', otherwise it exits with code '0'. If something goes wrong, it exits with code '15' to avoid confusion. The email program can then use the result to take an action, such as tagging the message with a color. The server that accepts incoming requests from the email service clients can be started and stopped from the EmpaceptorGUI described below.

Interacting with Empaceptor - SMTP server

Empaceptor will open a port to allow incoming SMTP connections. An email program simply needs to be told to send its outgoing mail to that port. The SMTP server is a slightly modified version of the jes code developed by Eric Daugherty. It is modified only to send any incoming message to Empaceptor before sending out to the world. It sends the content of the message to both the learning algorithm, and the scoring algorithm in order to gather historical data about trends (currently unimplemented). The SMTP server can be started and stopped from the EmpaceptorGUI described below.

The EmpaceptorGUI

The EmpaceptorGUI is the user interface to the Empaceptor system. When launched it opens a small window on the desktop with a menu to access a set of tasks. It can do the following:

Open an mbox file to process
Start & stop the server to handle incoming email client requests.
Set the threshold for what qualifies as having emotional content (default is 0.1)
Start & stop the SMTP server
Configure the list of available emotions (although old messages will not be reprocessed to check for newly added emotions)
Open a test box for sending messages to Empaceptor without needing to use email. Text entered will be scored, and the results shown in a table.

Usage Story & Shortcomings

I attempted to use Empaceptor to see how it would perform on my daily volume of email. I began by parsing my Sent mail file, approximately 440 messages, as the initial training data. I ran the Empaceptor email service and SMTP server, and configured my email client, Evolution, to invoke the service looking for "happy" messages, and set the success threshold at 0.1. I set the filter rule to set the message color to purple if the Empaceptor email service returned a positive result (a return value of 1) so that the results would be easy to discern at a glance.

The system was too slow at first, so I reconfigured the message processing to make scoring very fast, and learning happen asynchronously in a separate thread so that the email process would not have to wait. This made the system usable for day-to-day use. I ran the system for approximately 5 days to see what it thought were happy messages. Some interesting notes:

It did a decent job. I couldn't come up with a metric for determining 'accuracy', but it erred too much toward false positives, especially on short messages.
440 messages was not really enough training data. It was too biased toward certain words such as my name.
It was easily fooled by idiomatic usage of emotional terms. I had an email conversation about putting together a group to go to Happy Hour, and it marked all of those messages as 'Happy'. The same was true of one-liners such as "I'd be happy to take care of that".

Future Work

As with any software project, there are many features that did not make it into this prototype. They include:

The ability to look at the last 10,100,1000 messages to see trends. What emotions are the strongest? What is gaining strength? What is losing strength?
More sophisticated text processing. This might include a deeper parse of the individual sentences, or making more use of the GATE Annotations to extract more of the structure, giving fewer false positives.
Take advantage of recency. Emotional associations should weaken over time, so that the most recent usages are more indicative of the user's current state of mind.
Get it out into the world! I am going to make it available under the GPL (since it relies on GPL software) for download, if others want to try to interface with it. I need to figure out all of the licensing issues before I make a full release.

Related Work

Liu et al (2003) have done quite a bit of inspiring work on affective text processing. My work differs from their email processing work in that I am using a low-knowledge approach, as opposed to their use of common sense reasoning, and I am attempting to interface with existing end-user tools, while they produced a new email client in which the analysis takes place.

The Eudora mail client uses 'Chili Peppers' in its Mood Watch feature to rate the level of vitriol in an incoming message. It uses simple keyword spotting to find vulgarity, racial epithets, and other strong offensive language. While I share in the spirit of making affective computing tools available with little support from the user, my approach is not aimed necessarily at scoring of emails (although that is a feature), but more on learning the user's associations between words and evoked emotions, albeit in a very simple way. I am attempting a knowledge-based approach that is somewhere in between the common sense reasoning of Liu, and the keyword spotting of Mood Watch.

Thanks

I would like to give particular thanks (as I often do in these rapid development projects) to the open source community, and the wealth of tools that are available. Especially to the developers of GATE, jes, and the GNU classpath, classpathx, and inetlib libraries.

I would also like to thank Prof. Picard for her feedback and support.

Empaceptor - Lightweight Automated Emotional Association

Elias Holman - MAS630 Final Project - Powerpoint summary

Overview

Empaceptor General Architecture

Empaceptor text processing