Saturday, September 29, 2012

Spam or Ham? Naive Bayes - the Base Case for Spam Classification


Source:
1. “Week 3: Naive Bayes, Laplace Smoothing, APIs and Scraping data off the web” –

2. “Reflections after Jake’s lecture” –


People need spam-classifiers, because they don’t want spam in their inbox or important messages to be thrown out. But how? Should we use Linear Regression, K-Nearest Neighbors, or Naive Bayes?

Why is linear regression not going to work?

Say you make a feature for each lower case word that you see in any email and then we used R’s “lm function”:      lm(spam ~ word1 + word2 + …)

However, that’s too many variables compared to observations! We have on the order of 10,000 emails with on the order of 100,000 words. This will definitely overfit. Technically, this corresponds to the fact that the matrix in the equation for linear regression is not invertible. Moreover, maybe can’t even store it because it’s so huge.

Why not k-nearest neighbors?

To use k-nearest neighbors we would still need to choose features, probably corresponding to words, and you’d likely define the value of those features to be 0 or 1 depending on whether the word is present or not. This leads to a weird notion of “nearness”.

Again, with 10,000 emails and 100,000 words, we’ll encounter a problem: it’s not a singular matrix this time but rather that the space we’d be working in has too many dimensions. This means that, for example, it requires lots of computational work to even compute distances, but even that’s not the real problem.

The real problem is even more basic: even your nearest neighbors are really far away. This is called “the curse of dimensionality”, which makes for a poor algorithm.

How about naive bayes?

Naive Bayes is considered the base case for classification. It makes simplifying assumptions about the independence of features, but still manages to perform fairly well. In the machine learning literature, researchers introducing new classification algorithms compare the performance of their new proposed algorithms to that of Naive Bayes. Also note the goal is to label or classify unlabeled observations, and the actual parameters themselves (estimated probabilities) aren’t commonly considered of interest. The classification is evaluated using a mis-classification rate. For spam detection in particular, consider which is worse: false positives (calling something spam when it’s not, e.g. putting email from your boss in the spam folder); or false negatives (letting spam get into your inbox).

K-nearest-neighbors has one tuning parameter (k, the number of neighbors), while Naive Bayes can include two hyper-parameters in the numerator and denominator for smoothing. Naive Bayes is a linear classifier, while k-nearest neighbors is not. The curse of dimensionality and large feature sets are a problem for k-nearest neighbors, while Naive Bayes performs well. K-nearest-neighbors requires no “training” (just load in the data set), while Naive Bayes does. All classification problems we looked at so far have been supervised learning (the data comes labeled).

More references:
   Idiot’s Bayes – not so stupid after all?"
   Naive Bayes at Forty: The Independence Assumption in Information"
   Spam Filtering with Naive Bayes – Which Naive Bayes?"

No comments:

Post a Comment