Email Spam Classification

In this post, I will use the Python programming language, Natural Language Processing (NLP), Bag of Words model, and Multinomial Naive Bayes algorithm to classify if an email is spam or not.

Problem:

Spam emails are unsolicited or junk email that gets sent out in bulk. A classic example of spam emails are emails from so-called Nigerian princes that promise to send you huge amounts of money if you provide them your bank accout details. Spam are annoying because they take up a lot of storage space and communication bandwidth. According to (Awad 2011) “ 40% of all emails are spam which about 15.4 billion email per day and that cost internet users about $355 million per year.” According to (Winder, 2020) “Over half of all global email traffic is spam”. As you can tell, it is a huge problem that needs to be solved.

The Machine Learning Process:

From https://martink.me/articles/machine-learning-is-for-muggles-too

Data:

Above is a visual representation of steps in Machine Learning. The first step is to have data to work on. For this project, I used this dataset from Kaggle. You can find many good datasets as well on UC Irvine and Google Datasets. Our data is a large csv file with one column being email texts and the second column are binary digits( 1= Spam , 0 = Not spam)

Data Pre-processing:

Data preprocessing is a technique that involves transforming raw data into an understandable format.

The various text preprocessing steps are:

  1. Tokenization: The process of breaking down a piece of text into small units called tokens. A token can a word, sentence or even punctuation like whitespace.
  2. Noise Removal: Data can be messy, so it’s important to get rid of unuseful parts of the data also known as “noise”, by converting lowercasing all letters, removing punctuations marks, and removing stop words such as “the” and is”.
  3. Normalization: Normalization is the process of converting a word into its root form. We want the words “walking” and “walk” to be interpreted the same.

Cleaning text is helpful when you want to do text-classification and your gathered data is not clean such as social media comments or emails.

Bag of Words

What is BoW?

A bag of words is a representation of text that describes the occurrence of words within a document. It is called a “bag” of words because it is only interested in if the words occur in the document.

Why is BoW used?

Text can be messy and unstructured, but machine learning algorithms prefer structured and fixed-length inputs. We use the Bag-of-Words technique to convert variable-length texts into a fixed-length vector by counting how many times each word appears. This process is known as vectorization.

Naive Bayes Algorithm

Naive Bayes is a supervised learning classification algorithm. The Naive Bayes Algorithm is “Naive” because it assumes the occurrence of one feature does not affect the probability of occurrence of other features. In other words, features are not related to each other. “Bayes” refers to English mathematician, Thomas Bayes, due to his work in the bayes theorem.

Why use Naive Bayes?

Naïve Bayes works really well even with small sample size, has the cability to provide a bettery accuracy score, and is faster than it’s alternatives. It also performs well in multi-class prediction. Last but not least, it perform well in case of categorical input variables (Yes/No) compared to numerical variable (100, 440, e.t.c).

Code:

File

Step by Step with Outputs

Output:

Output:

Note: There are 5731 Entries in total and most of our rows are “non-null” (no empty). Let’s see exactly how many rows are empty

Output:

Note: We notice there are 2 empty vallues in the column “spam” and we must remove them to prevent errors. Additionally, we must also remove duplicate data.

Output:

Note: Now we shouldn’t have any null values and no duplicates. We notice that our code drops from 5731 to 5696 rows.

Now it’s time for the longest process: cleaning the data.

Output:

Notes: Hard to see the difference here; however, if you select specfic rows, you will be able to see there is a big difference.

Output:

Sources: