Spam Filters

Recall & Accuracy VS N-grams - Naive Bayes Feature Hashing

The goal of this project is to build and evaluate some online spam filter algorithms that can deal with infinite examples and features. The spam filters are supposed to operate in the following setting: whenever the user receives an e-mail, the filter should classify it as spam or ham (= legitimate e-mail). Then, the user labels the e-mail as spam or ham and the filter should be updated with this information. The (base) features of the spam filter are the n-grams (groups of words) that appear in the e-mail. It is not known beforehand which words will appear, how many words and which ones will prove to be important. The words can even change over time, for example, when the spammers have learned that all their e-mails with “FREE TRY NOW” get blocked, they might switch to “WINNER WINNER WINNER”.

We will implement two techniques for spam filtering in Java: Naive Bayes and the Perceptron classifier. In addition, we will also use two techniques to deal with infinite and unknown features: Feature Hashing and Count-Min Sketch.

Naive Bayes

Naive Bayes is a simple, standard, method to train a classifier. It was one of the original methods for spam filtering [2].

In the traditional formulation of Naive Bayes with binary features, the feature being ‘true’ or ‘false’ are considered equally informative. In our case, the features are the words that (do not) appear in the e-mail. Naively, the presence of a word in an e-mail is thus considered as informative as the absence of that word:

$$\operatorname{Pr}(S | \operatorname{text})=\frac{\operatorname{Pr}(S) \Pi_{w \in \operatorname{text}} \operatorname{Pr}(w | S) \Pi_{w \notin \operatorname{text}}(1-\operatorname{Pr}(w | S))}{\operatorname{Pr}(\operatorname{text})}$$

However with a large vocabulary, the absence of a specific word is not informative. Intuitively, an e-mail only contains a subset of the words that could have been used to deliver the message. Therefore, a word that does not occur in an e-mail is considered unobserved, i.e.: it may or may not be in the e-mail. Under this assumption, the Naive Bayes formula simplifies to:

$$\operatorname{Pr}(S | \operatorname{text})=\frac{\operatorname{Pr}(S) \Pi_{w \in \operatorname{text}} \operatorname{Pr}(w | S)}{\operatorname{Pr}(\operatorname{text})}$$

The calculations for Naive Bayes should be done in the logarithmic domain, otherwise they might suffer from underflows. Hint: sums should be calculated as follows:

$$\log (a+b)=\log a+\log (1+\exp (\log b-\log a))$$

The probabilities in the formula are estimated from counts in the data. However, we must avoid zero-counts. Naively, a zero-count corresponds to a zero probability. If a spam e-mail comes in with the words “free”, “viagra” and “roses” in it and “roses” never occurred in spam before, then this mail would be falsely classified as ham. To prevent such situations, methods such as Laplace estimates can be used where ‘1’ is added to every count. This is equivalent to initializing the classifier with a spam e-mail that has all possible words and a ham e-mail that has all possible words.

Perceptron

The perceptron classifier dates from 1958 and is one of the first artificial neural networks, laying the foundations of the current deep neural networks. For this project, we will implement learning with the delta rule and stochastic gradient descent.

Attention! The perceptron expects the classes to be 1 and -1, while the implementation provides them as 1 and 0. Keep this in mind when writing the perceptron code.

Dealing with Infinite features

In this project we will use two ways to deal with the infinite and beforehand-unknown vocabulary used in the e-mails: Feature Hashing and Count-Min Sketch.

Feature Hashing

E-mails can be represented as word vectors, where every entry of the array represents a word which has value ‘1’ if this word appears in the e-mail and ‘0’ otherwise. With an infinite stream of e-mails, the size of this array is unknown and potentially infinite. To deal with this, feature hashing maps the word vector to an array of finite size, using a hash function. The mapped array of finite size is then used as the feature vector in the learning algorithm.

Feature hashing can be interpreted as a sketching method for classifiers with infinite features: For Naive Bayes, it approximates the count of each word with an overestimate, namely the count of all the words that hash to the same value. For the perceptron classifier, it approximates the weights for each word by the weight for all the words that hash to the same value.

In our implementation we can use Murmur hash for hashing words. However, the provided implementation only provides hash functions with ranges of size $2^{32}$ or $2^{64}$. We may adjust the range to the desired range which is given as an input parameter ($2^{logNbOfBuckets}$) in the method hash(String str).

Count-Min Sketch

Count-min sketch [1] is a sketching algorithm for keeping approximate counts, or numbers in general. It is related to feature hashing, but reduces the estimation error by using multiple hashing functions. The name “count-min” comes from the non-negative case, where the number, like a count, can only be increased. In this case the count of the feature value can only be an overestimate. By taking the minimal count accross the counts of the different hashing values for this feature, the estimation error is minimized. The general case, where the numbers can be reduced, like the weights of the perceptron, uses the median instead of the minimum.

Like for feature hashing, we can use Murmur hash, but we may adjust the hash range.

Classifier Evaluation

Because the spam filters are online learners, the algorithm should be evaluated periodically. In contrast to using a fixed test set, a classifier is evaluated against examples that will subsequently be used to update the model. The provided skeleton implementation uses a reporting period r: every r examples the classifier is evaluated on the next r examples, before adding them to the model and repeating the process.

The provided code evaluates the classifiers using accuracy (#correctly classified / #total classified). However this is not the only way to evaluate the performance of a classifier. Implement at least two additional evaluation metrics based on the values of the contingency table (#true positives, #false positives, #true negatives, and #false negatives) and discuss why this evaluation metrics are fit for evaluating a spam filter.

References

[1] Graham Cormode and Shan Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1):58–75, 2005.

[2] Mehran Sahami, Susan Dumais, David Heckerman, and Eric Horvitz. A bayesian approach to filtering junk e-mail. In Learning for Text Categorization: Papers from the 1998 workshop, volume 62, pages 98–105, 1998.

Feiyang Tang
Feiyang Tang
Ph.D. Candidate in Machine Learning

Data Enthusiast, ENFJ-T. Travelling, hiking and crime series lover. Multilingual.

Related