Naïve Bayes Spam Filter - From Scratch
Naïve Bayes Spam Filter - From Scratch
A library-less approach
Probability is all around us. It’s in the weather updates we receive daily, to sports team strategies to insurance premium calculations. It is also heavily prevalent in the field of Machine Learning, as one of the mathematical driving forces powering modern algorithms.
It also powers earlier algorithms, such as the Naïve Bayes Classifier. This was a unique approach to applying Bayes’ Theorem to real world problems when the classifier was first devised in the 1990s, and is best known for it’s application as an early method of spam filtering - the topic of this article.
Bayes’ Theorem
Image by mattbuck , CC BY-SA 3.0, via Wikimedia Commons
Bayes’ theorem was invented by Thomas Bayes in 1763, when he published a work titled An Essay towards solving a Problem in the Doctrine of Chances (1763). In this essay, Bayes describes how conditional probability can be used to estimate the likelihood of certain events occurring, given certain external events have also occurred. The formula for the theorem is as follows: $$ \begin{equation} \Pr(A|B)=\frac{\Pr(B|A)\Pr(A)}{\Pr(B)} \end{equation} $$ Where:
- Pr(A) = the probability of event A occurring,
- Pr(B) = the probability of event B occurring,
- Pr(A|B) = Conditional Probability - the probability of event A occurring, given event B also occurs,
- Pr(B|A) = Conditional Probability - the probability of event B occurring, given event A also occurs.
This formula can alternatively be expanded out into:
$$ \begin{equation} \Pr(A|B)=\frac{\Pr(B|A)\Pr(A)}{\Pr(B|A)\Pr(A)+\Pr(B|\neg A)\Pr(\neg A)} \end{equation} $$
This alternative equation also computes Pr(A|B), but is more useful in the context of spam filtering than the original simpler equation. We will look more in-depth at how this is done below.
Spam Filtering
Photo by Hannes Johnson on Unsplash
The concept of spam filtering is simple - detect spam emails from authentic (non-spam/ham) emails. To do this, the goal would be to get a measure of how ‘spammy’ an incoming email is. The extended form of Bayes’ Rule comes into play here. $$ \begin{equation} \Pr(S|W)=\frac{\Pr(W|S)\Pr(S)}{\Pr(W|S)\Pr(S)+\Pr(W|\neg S)\Pr(\neg S)} \end{equation} $$ With Bayes’ Rule, we want to find the probability an email is spam, given it contains certain words. We do this by finding the probability that each word in the email is spam, and then multiply these probabilities together to get the overall email spam metric to be used in classification.
The the probability of an email being spam S given a certain word W appears is defined by the left hand side of the above equation Pr(S|W).
The right hand side gives the formula to compute this probability. This is:
- the probability the word occurs in the email given it is a spam email Pr(W|S) multiplied by the probability of an email being spam Pr(S),
- divided the probability the word occurs in the email given it is a spam email multiplied by the probability of an email being spam,
- plus the probability the word occurs in the email given it is a non-spam email Pr(W|¬S) multiplied by the probability of an email being non-spam Pr(¬S).
Probabilities can range between 0 and 1. For this spam filter, we will define that any email with a total ‘spaminess’ metric of over 0.5 (50%) will be deemed a spam email.
When the Pr(S|W) has been found for each word in the email, they are multiplied together to give the overall probability that the email is spam. If this probability is over the ‘spam threshold’ of 0.5, the email is classified as a spam email.
Coding a Spam Filter from Scratch:
ML libraries such as scikit-learn are brilliant for testing out-of-the box algorithms on your data. However it can be beneficial to explore the inner workings of an algorithm in order to better understand and appreciate it by writing it from scratch. Here we will write an implementation of Naïve Bayes Classifier for Spam Filtering in pure python, without the aid of external libraries.
We’ll also see where the classifier gets the “Naïve” part of its moniker from.
Training and testing data:
Define some training and test data for each class, spam and ham. These emails will be short for brevity and clarity.
train_spam = ['send us your password', 'review our website', 'send your password', 'send us your account']
train_ham = ['Your activity report','benefits physical activity', 'the importance vows']
test_emails = {'spam':['renew your password', 'renew your vows'], 'ham':['benefits of our account', 'the importance of physical activity']}
Iterate over the labelled spam emails and, for each word w in the entire training set, count how many of the spam emails contain w:
# make a vocabulary of unique words that occur in known spam emails
vocab_words_spam = []
for sentence in train_spam:
sentence_as_list = sentence.split()
for word in sentence_as_list:
vocab_words_spam.append(word)
print(vocab_words_spam)
['send', 'us', 'your', 'password', 'review', 'our', 'website', 'send', 'your', 'password', 'send', 'us', 'your', 'account']
Convert each list element to a dictionary key. This will delete duplicates, as dictionaries cannot have multiple keys with the same name. Convert remaining keys back to list:
vocab_unique_words_spam = list(dict.fromkeys(vocab_words_spam))
print(vocab_unique_words_spam)
['send', 'us', 'your', 'password', 'review', 'our', 'website', 'account']
How do you determine how ‘spammy’ a word is?
According to this overview on Naïve Bayes spam filtering, ‘Spamicity’ can be calculated by taking the total number of emails that have already been hand-labelled as either spam or ham, and using that data to compute word spam probabilities, by counting the frequency of each word.
This uses the same type of conditional probability found in Bayes’ Rule. We can count how many spam emails have the word “send” and divide that by the total number of spam emails - this gives a measure of the word’s ‘spamicity’, or how likely it is to be in a spam email.
The same approach can be applied for evaluating how ‘hamicity’ or authenticity of a word.
Smoothing
Smoothing should be applied to avoid the scenario that a word would appear zero times in the spam training examples, but would appear in the ham training example, or vice-versa.
We want to avoid this scenario, as this would lead to a 0 as the numerator when performing the spamicity calculation: zero divided by anything is zero. A spam-bot could send us an email with this word, and the final email spam calculation would be returned as 0, when we multiplied all the other word spam probabilities together - any other numbers multiplied by zero will be zero. This fraudulent email now having 0% spamicity would be classified as ham, and pass quietly into our inbox.
The solution is to add 1 to every word count, so there will never be a zero word count. Add 2 to the denominator to account for this increase to the numerator. We add 2 here in order to account for the number of classes we have - a spam class and a ham class. This would keep the total probability at 1.0 if you were to add up all the individual word probabilities together. $$ \Pr( send | S ) = (3+1)/(4+2) = 0.6666 $$ In this example, the word “send” has a spamicity of 75% (66% with smoothing applied.) This is over the defined spam threshold of 0.5, meaning it has a high probability of being ‘spammy’ word.
dict_spamicity = {}
for w in vocab_unique_words_spam:
emails_with_w = 0 # counter
for sentence in train_spam:
if w in sentence:
emails_with_w+=1
print(f"Number of spam emails with the word {w}: {emails_with_w}")
total_spam = len(train_spam)
spamicity = (emails_with_w+1)/(total_spam+2)
print(f"Spamicity of the word '{w}': {spamicity} \n")
dict_spamicity[w.lower()] = spamicity
Number of spam emails with the word send: 3
Spamicity of the word 'send': 0.6666666666666666
Number of spam emails with the word us: 2
Spamicity of the word 'us': 0.5
Number of spam emails with the word your: 3
Spamicity of the word 'your': 0.6666666666666666
Number of spam emails with the word password: 2
Spamicity of the word 'password': 0.5
Number of spam emails with the word review: 1
Spamicity of the word 'review': 0.3333333333333333
Number of spam emails with the word our: 4
Spamicity of the word 'our': 0.8333333333333334
Number of spam emails with the word website: 1
Spamicity of the word 'website': 0.3333333333333333
Number of spam emails with the word account: 1
Spamicity of the word 'account': 0.3333333333333333
print(dict_spamicity)
{'send': 0.6666666666666666, 'us': 0.5, 'your': 0.6666666666666666, 'password': 0.5, 'review': 0.3333333333333333, 'our': 0.8333333333333334, 'website': 0.3333333333333333, 'account': 0.3333333333333333}
Calculate Hamicity of non-spam words:
# make a vocabulary of unique words that occur in known ham emails
vocab_words_ham = []
for sentence in train_ham:
sentence_as_list = sentence.split()
for word in sentence_as_list:
vocab_words_ham.append(word)
vocab_unique_words_ham = list(dict.fromkeys(vocab_words_ham))
print(vocab_unique_words_ham)
['Your', 'activity', 'report', 'benefits', 'physical', 'the', 'importance', 'vows']
dict_hamicity = {}
for w in vocab_unique_words_ham:
emails_with_w = 0 # counter
for sentence in train_ham:
if w in sentence:
print(w+":", sentence)
emails_with_w+=1
print(f"Number of ham emails with the word '{w}': {emails_with_w}")
total_ham = len(train_ham)
Hamicity = (emails_with_w+1)/(total_ham+2) # Smoothing applied
print(f"Hamicity of the word '{w}': {Hamicity} ")
dict_hamicity[w.lower()] = Hamicity
# Use built-in lower() to keep all words lower case - useful later when
# comparing spamicity vs hamicity of a single word - e.g. 'Your' and
# 'your' will be treated as 2 different words if not normalized to lower # case.
Your: Your activity report
Number of ham emails with the word 'Your': 1
Hamicity of the word 'Your': 0.4
activity: Your activity report
activity: benefits physical activity
Number of ham emails with the word 'activity': 2
Hamicity of the word 'activity': 0.6
report: Your activity report
Number of ham emails with the word 'report': 1
Hamicity of the word 'report': 0.4
benefits: benefits physical activity
Number of ham emails with the word 'benefits': 1
Hamicity of the word 'benefits': 0.4
physical: benefits physical activity
Number of ham emails with the word 'physical': 1
Hamicity of the word 'physical': 0.4
the: the importance vows
Number of ham emails with the word 'the': 1
Hamicity of the word 'the': 0.4
importance: the importance vows
Number of ham emails with the word 'importance': 1
Hamicity of the word 'importance': 0.4
vows: the importance vows
Number of ham emails with the word 'vows': 1
Hamicity of the word 'vows': 0.4
print(dict_hamicity)
{'your': 0.4, 'activity': 0.6, 'report': 0.4, 'benefits': 0.4, 'physical': 0.4, 'the': 0.4, 'importance': 0.4, 'vows': 0.4}
Compute Probability of Spam P(S):
This computes the probability of any one email being spam, by dividing the total number of spam emails by the total number of all emails.
prob_spam = len(train_spam) / (len(train_spam)+(len(train_ham)))
print(prob_spam)
0.5714285714285714
Compute Probability of Ham P(¬S):
This computes the probability of any one email being ham, by dividing the total number of ham emails by the total number of all emails.
prob_ham = len(train_ham) / (len(train_spam)+(len(train_ham)))
print(prob_ham)
0.42857142857142855
Given a set of unlabelled test emails, iterate over each, and create list of distinct words:
tests = []
for i in test_emails['spam']:
tests.append(i)
for i in test_emails['ham']:
tests.append(i)
print(tests)
['renew your password', 'renew your vows', 'benefits of our account', 'the importance of physical activity']
# split emails into distinct words
distinct_words_as_sentences_test = []
for sentence in tests:
sentence_as_list = sentence.split()
senten = []
for word in sentence_as_list:
senten.append(word)
distinct_words_as_sentences_test.append(senten)
print(distinct_words_as_sentences_test)
[['renew', 'your', 'password'], ['renew', 'your', 'vows'], ['benefits', 'of', 'our', 'account'], ['the', 'importance', 'of', 'physical', 'activity']]
test_spam_tokenized = [distinct_words_as_sentences_test[0], distinct_words_as_sentences_test[1]]
test_ham_tokenized = [distinct_words_as_sentences_test[2], distinct_words_as_sentences_test[3]]
print(test_spam_tokenized)
[['renew', 'your', 'password'], ['renew', 'your', 'vows']]
Ignore the words that you haven’t seen in the labelled training data:
The reason you would ignore totally unseen words is that you have nothing to really base their level of spamicity or hamicity off. Simply dropping them therefore will not affect the overall probability one way or another, and will not have a significant impact on the class the algorithm selects. It makes more sense to drop unseen words than to add a 1-count with smoothing therefore.
reduced_sentences_spam_test = []
for sentence in test_spam_tokenized:
words_ = []
for word in sentence:
if word in vocab_unique_words_spam:
print(f"'{word}', ok")
words_.append(word)
elif word in vocab_unique_words_ham:
print(f"'{word}', ok")
words_.append(word)
else:
print(f"'{word}', word not present in labelled spam training data")
reduced_sentences_spam_test.append(words_)
print(reduced_sentences_spam_test)
'renew', word not present in labelled spam training data
'your', ok
'password', ok
'renew', word not present in labelled spam training data
'your', ok
'vows', ok
[['your', 'password'], ['your', 'vows']]
reduced_sentences_ham_test = [] # repeat for ham words
for sentence in test_ham_tokenized:
words_ = []
for word in sentence:
if word in vocab_unique_words_ham:
print(f"'{word}', ok")
words_.append(word)
elif word in vocab_unique_words_spam:
print(f"'{word}', ok")
words_.append(word)
else:
print(f"'{word}', word not present in labelled ham training data")
reduced_sentences_ham_test.append(words_)
print(reduced_sentences_ham_test)
'benefits', ok
'of', word not present in labelled ham training data
'our', ok
'account', ok
'the', ok
'importance', ok
'of', word not present in labelled ham training data
'physical', ok
'activity', ok
[['benefits', 'our', 'account'], ['the', 'importance', 'physical', 'activity']]
Stemming - remove non-key words:
Removal of non-key words can help the classifier focus on what words are most important. Deciding what words to remove in this specific case can involve some trial and error, as including or excluding words from a small toy dataset like this will influence the overall classification result.
test_spam_stemmed = []
non_key = ['us', 'the', 'of','your'] # non-key words, gathered from spam,ham and test sentences
for email in reduced_sentences_spam_test:
email_stemmed=[]
for word in email:
if word in non_key:
print('remove')
else:
email_stemmed.append(word)
test_spam_stemmed.append(email_stemmed)
print(test_spam_stemmed)
remove
remove
[['password'], ['vows']]
test_ham_stemmed = []
non_key = ['us', 'the', 'of', 'your']
for email in reduced_sentences_ham_test:
email_stemmed=[]
for word in email:
if word in non_key:
print('remove')
else:
email_stemmed.append(word)
test_ham_stemmed.append(email_stemmed)
print(test_ham_stemmed)
remove
[['benefits', 'our', 'account'], ['importance', 'physical', 'activity']]
Bayes’ Rule
We now use the formula for Bayes’ Rule to compute the probability of spam given a certain word from an email. We have already calculated all the necessary probabilities and conditional probabilities needed for the right-hand side of the equation. $$ \begin{equation} \Pr(S|W)=\frac{\Pr(W|S)\Pr(S)}{\Pr(W|S)\Pr(S)+\Pr(W|\neg S)\Pr(\neg S)} \end{equation} % $$ Computing the overall probability for the entire email is then obtained by multiplying of all these individual word probabilities together.
This approach speaks to the reason this Bayes’ classifier is ‘Naïve’ - it does not take into account the relationship of any one word to the next. It assumes there is no order to the words appearing in a sentence, and that they are all independent of each other, as if language was a random pass though a dictionary. This of course is not the case, which is why this early form of spam filtering has been overtaken by more advanced forms of NLP, such as using word embeddings.
Classify spam test emails:
def mult(list_) : # function to multiply all word probs together
total_prob = 1
for i in list_:
total_prob = total_prob * i
return total_prob
def Bayes(email):
probs = []
for word in email:
Pr_S = prob_spam
print('prob of spam in general ',Pr_S)
try:
pr_WS = dict_spamicity[word]
print(f'prob "{word}" is a spam word : {pr_WS}')
except KeyError:
pr_WS = 1/(total_spam+2) # Apply smoothing for word not seen in spam training data, but seen in ham training
print(f"prob '{word}' is a spam word: {pr_WS}")
Pr_H = prob_ham
print('prob of ham in general ', Pr_H)
try:
pr_WH = dict_hamicity[word]
print(f'prob "{word}" is a ham word: ',pr_WH)
except KeyError:
pr_WH = (1/(total_ham+2)) # Apply smoothing for word not seen in ham training data, but seen in spam training
print(f"WH for {word} is {pr_WH}")
print(f"prob '{word}' is a ham word: {pr_WH}")
prob_word_is_spam_BAYES = (pr_WS*Pr_S)/((pr_WS*Pr_S)+(pr_WH*Pr_H))
print('')
print(f"Using Bayes, prob the the word '{word}' is spam: {prob_word_is_spam_BAYES}")
print('###########################')
probs.append(prob_word_is_spam_BAYES)
print(f"All word probabilities for this sentence: {probs}")
final_classification = mult(probs)
if final_classification >= 0.5:
print(f'email is SPAM: with spammy confidence of {final_classification*100}%')
else:
print(f'email is HAM: with spammy confidence of {final_classification*100}%')
return final_classification
for email in test_spam_stemmed:
print('')
print(f" Testing stemmed SPAM email {email} :")
print(' Test word by word: ')
all_word_probs = Bayes(email)
print(all_word_probs)
Testing stemmed SPAM email ['password'] :
Test word by word:
prob of spam in general 0.5714285714285714
prob "password" is a spam word : 0.5
prob of ham in general 0.42857142857142855
WH for password is 0.2
prob 'password' is a ham word: 0.2
Using Bayes, prob the the word 'password' is spam: 0.7692307692307692
###########################
All word probabilities for this sentence: [0.7692307692307692]
email is SPAM: with spammy confidence of 76.92307692307692%
0.7692307692307692
Testing stemmed SPAM email ['vows'] :
Test word by word:
prob of spam in general 0.5714285714285714
prob 'vows' is a spam word: 0.16666666666666666
prob of ham in general 0.42857142857142855
prob "vows" is a ham word: 0.4
Using Bayes, prob the the word 'vows' is spam: 0.35714285714285715
###########################
All word probabilities for this sentence: [0.35714285714285715]
email is HAM: with spammy confidence of 35.714285714285715%
0.35714285714285715
Result: 1 out of 2 SPAM emails correctly classified.
Next we test how likely the stemmed HAM test emails are to be SPAM.
-
This should be closer to zero than to 0.5, as this is an example of finding the probability of SPAM for emails that contain only words not previously seen in HAM training data
-
(as the stemmed HAM words have all never had their ‘spamicity’ calculated for the spamicity dictionary, therefore stemming will add a count of just 1 to these words, giving them a low probability and low spamicity.)
-
for email in test_ham_stemmed:
print('')
print(f" Testing stemmed HAM email {email} :")
print(' Test word by word: ')
all_word_probs = Bayes(email)
print(all_word_probs)
Testing stemmed HAM email ['benefits', 'our', 'account'] :
Test word by word:
prob of spam in general 0.5714285714285714
prob 'benefits' is a spam word: 0.16666666666666666
prob of ham in general 0.42857142857142855
prob "benefits" is a ham word: 0.4
Using Bayes, prob the the word 'benefits' is spam: 0.35714285714285715
###########################
prob of spam in general 0.5714285714285714
prob "our" is a spam word : 0.8333333333333334
prob of ham in general 0.42857142857142855
WH for our is 0.2
prob 'our' is a ham word: 0.2
Using Bayes, prob the the word 'our' is spam: 0.847457627118644
###########################
prob of spam in general 0.5714285714285714
prob "account" is a spam word : 0.3333333333333333
prob of ham in general 0.42857142857142855
WH for account is 0.2
prob 'account' is a ham word: 0.2
Using Bayes, prob the the word 'account' is spam: 0.689655172413793
###########################
All word probabilities for this sentence: [0.35714285714285715, 0.847457627118644, 0.689655172413793]
email is HAM: with spammy confidence of 20.873340569424727%
0.20873340569424728
Testing stemmed HAM email ['importance', 'physical', 'activity'] :
Test word by word:
prob of spam in general 0.5714285714285714
prob 'importance' is a spam word: 0.16666666666666666
prob of ham in general 0.42857142857142855
prob "importance" is a ham word: 0.4
Using Bayes, prob the the word 'importance' is spam: 0.35714285714285715
###########################
prob of spam in general 0.5714285714285714
prob 'physical' is a spam word: 0.16666666666666666
prob of ham in general 0.42857142857142855
prob "physical" is a ham word: 0.4
Using Bayes, prob the the word 'physical' is spam: 0.35714285714285715
###########################
prob of spam in general 0.5714285714285714
prob 'activity' is a spam word: 0.16666666666666666
prob of ham in general 0.42857142857142855
prob "activity" is a ham word: 0.6
Using Bayes, prob the the word 'activity' is spam: 0.2702702702702703
###########################
All word probabilities for this sentence: [0.35714285714285715, 0.35714285714285715, 0.2702702702702703]
email is HAM: with spammy confidence of 3.447324875896305%
0.03447324875896305
Result: 2 out of 2 HAM emails correctly classified.
Observations:
This implementation of the Naïve Bayes classifier seemed to work well for most cases. It could classify both HAM test emails as HAM, however it could not accurately classify one of the HAM emails - specifically the ‘renew your vows’ email. This is due to the word ‘vows’ only occurring in the HAM training set, and had the major influence on he classifier decision, as the word ‘renew’ was dropped for not featuring in either HAM or SPAM training sets at the beginning, and the ‘your’ was stemmed.
Different methodologies for spam filtering have since evolved, including different versions of Naïve Bayes (for example Multinomial Naïve Bayes, which has a handy scikit-learn implementation.)
Conclusions:
Naïve Bayes is a useful algorithm in many respects, especially for solving low-data text classification problems. The way in which it can make accurate predictions with limited training data by assuming naïve independence of words is its main advantage, but ironically also a disadvantage when comparing it to more advanced forms of NLP. Other approaches, such as classifiers which use word embeddings, can encode meaning such as context from sentences to obtain more accurate predictions. Still, there is no denying that Naïve Bayes can act as a powerful spam filter.