Using Naïve Bayes Machine Learning Algorithm to design a smart filter to automate removal of irrelevant responses and spam

TheBlueyed is a talent acquisition platform focused on high potential candidates. Candidates go through a six-stage verification process before their dossier is generated and they are pitched to employers. In Stage 3, candidates are asked questions about their work where they are expected to write 500 words for each field. These fields are prone to spam and irrelevant responses. We employed Naïve Bayes machine learning algorithm to filter out these irrelevant responses. This system resulted in 95% spam elimination. 

Naïve Bayes Supervised Learning Alogorithm

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features. Given a class variable y and a dependent feature vector x_1 through x_n, Bayes’ theorem states the following relationship

Using the naive independence assumption that

for all i, this relationship is simplified to

Since P(x_1, \dots, x_n) is a constant given the input, we can use the following classification rule:

and we can use Maximum A Posteriori (MAP) estimation to estimate P(y) and P(x_i \mid y); the former is then the relative frequency of class y in the training set.

The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of 

Why we chose Naïve Bayes Supervised Learning Alogorithm

Naive Bayes classifiers works quite well in many real-world situations, popular document classification and spam filtering. They require a small amount of training data to estimate the necessary parameters. 

Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods. The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one dimensional distribution. This in turn helps to alleviate problems stemming from the curse of dimensionality.

Using Naïve Bayes Supervised Algorithm to detect irrelevant responses

In Stage 3, after every response is recorded, the response appears on a dashboard where the admin marks the response either relevant or irrelevant.  

Over time, all recorded responses are classified as relevant or irrelevant. 

Using Naïve Bayes Gaussian Algorithm, when a new response is recorded, the response is matched with all relevant responses and marked either relevant or irrelevant. 

Accuracy of the Algorithm

On an average, the accuracy score is 0.97

Combating Information Loss

There is a 3% information loss due to  accuracy. This loss reduces with more entries as the algorithm is able to predict better. 

Meanwhile, to combat information loss, we do not delete any response and employ a manual verification regularly. 


After employing Naïve Bayes Supervised Learning Algorithm to detect irrelevant responses, TheBlueyed has saved 600 hours of work per month or 4 employees have been freed to focus on other important things.

The Code

import sys
from time import time
from email_preprocess import preprocess
#features_train and features_test are the features for the training and testing 
datasets, respectively labels_train and 
labels_test are the corresponding item 
labels features_train, features_test, 
labels_train, labels_test = preprocess()
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
#Train the model using the Training Data and sort authentic stories in the Test Data
pred = clf.fit(features_train, labels_train).predict(features_test)
#Calculating Accuracy
accuracy = clf.score(features_test, labels_test);