Machine Learning Classification Algorithms

18 min read

By Kshitij Makwana and Satyapriya Chaudhari

Let me start by asking a very basic question. What is Machine Learning? Machine learning is the process of teaching a computer system certain algorithms that can improve themselves with experience.

A very technical definition would be,

"A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience." -  Tom Mitchell, 1997

Just like humans, the system will be able to perform simple classification tasks and complex mathematical computations like regression. It involves the building of mathematical models that are used in classification or regression.

To ‘train’ these mathematical models, you need a set of training data. This is the dataset over which the system builds the model. This article will cover all your Machine Learning Classification needs, starting with the very basics.

Here, we will talk about:


The mathematical models are divided into two categories, depending on their training data - supervised and unsupervised learning models.

Fig. 1. Machine Learning Algorithms
Fig. 1. Machine Learning Algorithms

Supervised Learning

When building supervised learning models, the training data used contains the required answers. These required answers are called labels. For example, you show a picture of a dog and also label it as a dog.

So, with enough pictures of a dog, the algorithm will be able to classify an image of a dog correctly. Supervised learning models can also be used to predict continuous numeric values such as the price of the stock of a certain company. These models are known as regression models.

In this case, the labels would be the price of the stock in the past. So the algorithm would follow that trend. Few popular algorithms include

  • Linear Regression
  • Support Vector Classifiers
  • Decision Trees
  • Random Forests

Unsupervised Learning

In unsupervised learning, as the name suggests, the dataset used for training does not contain the required answers. Instead, the algorithm uses techniques such as clustering and dimensionality reduction to train.

A major application of unsupervised learning is anomaly detection. This uses a clustering algorithm to find out major outliers in a graph. These are used in credit card fraud detection.


Types of Supervised Models

Supervised models are trained on labeled dataset. It can either be a continuous label or categorical label.

Regression models

Regression is used when one is dealing with continuous values such as the cost of a house when you are given features such as location, the area covered, historic prices etc. Popular regression models are:

  • Linear Regression
  • Lasso Regression
  • Ridge Regression

Classification models

Classification is used for data that is separated into categories with each category represented by a label. The training data must contain the labels and must have sufficient observations of each label so that the accuracy of the model is respectable. Some popular classification models include:

  • Support Vector Classifiers
  • Decision Trees
  • Random Forests Classifiers

There are various evaluation methods to find out the accuracy of these models also. We will discuss these models, the evaluation methods and a technique to improve these models called hyperparameter tuning in greater detail.


Types of Classification

Based on the number and level of classes present in the dataset, there are three types of classification.

Binary Classification

This type of classification has only two categories. Usually, they are boolean values - 1 or 0, True or False, High or Low. Some examples where such a classification could be used is in cancer detection or email spam detection where the labels would be positive or negative for cancer and spam or not spam for spam detection.

Let us take an example. We are using a breast cancer detection dataset that can be downloaded from here.

id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se area_se smoothness_se compactness_se concavity_se concave points_se symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
0 842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 1.0950 0.9053 8.589 153.40 0.006399 0.04904 0.05373 0.01587 0.03003 0.006193 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 0.5435 0.7339 3.398 74.08 0.005225 0.01308 0.01860 0.01340 0.01389 0.003532 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 0.7456 0.7869 4.585 94.03 0.006150 0.04006 0.03832 0.02058 0.02250 0.004571 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 0.4956 1.1560 3.445 27.23 0.009110 0.07458 0.05661 0.01867 0.05963 0.009208 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 0.7572 0.7813 5.438 94.44 0.011490 0.02461 0.05688 0.01885 0.01756 0.005115 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678
Fig. 2. Scatter Plot - Texture Mean vs. Radius Mean
Fig. 2. Scatter Plot - Texture Mean vs. Radius Mean

Here you can see the two ‘classes’ - ‘M’ stands for malignant and ‘B’ stands for benign. As you can see, the classes are well divided and are easily differentiable to the naked eye for these two features. However, this will not be true for all pairs of features.

Models that can be used for such a classification are:

  • Logistic Regression
  • Support Vector Classifiers

You can also use Decision Trees, Random Forests and other algorithms but Logistic Regression and Support Vector Classification are used exclusively for binary classification.

Multi-class Classification

Multi-class classifiers or multinomial classifiers can distinguish between more than two classes. A wide known dataset for the multiclass classification is the penguin dataset. The penguin dataset can be loaded from the seaborn library.

  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 MALE
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 FEMALE
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 FEMALE
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 FEMALE

We can produce a scatter plot of bill_length_mm vs flipper_length_mm parameters to see a clear variation of the three classes.

Fig. 3. Scatter Plot - Flipper Length vs. Bill Length
Fig. 3. Scatter Plot - Flipper Length vs. Bill Length

Algorithms such as Random Forests and Naive Bayes can easily build a multiclass classifier model. Other algorithms like Support Vector Classifiers and Logistic Regression are used only for Binary Classification.

Multi-label Classification

This type of classification occurs when a single observation contains multiple labels. For example, a single image might contain a car, a truck and a human. The algorithm must be able to classify each of them separately. Thus it has to be trained for many labels and should report True for a car, truck and human and False for any other labels it has trained for.


Classifier Models

Before we dig into the any classification model in detail, we need to prepare the data for training the algorithm. The first step is to pre-process and clean the data. For a detailed understanding of Data Cleaning, you can check out this article.
The cleaning we need for this dataset is to change the string names of the flowers to integer values so the algorithm can classify them properly. We also need to drop the observations having NaN values. For more techniques on how to handle missing and NaN values read this article.

  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 1 Torgersen 39.1 18.7 181.0 3750.0 MALE
1 1 Torgersen 39.5 17.4 186.0 3800.0 FEMALE
2 1 Torgersen 40.3 18.0 195.0 3250.0 FEMALE
4 1 Torgersen 36.7 19.3 193.0 3450.0 FEMALE
5 1 Torgersen 39.3 20.6 190.0 3650.0 MALE

After cleaning the data, you need to separate the feature variables from the label variable. Feature variables are the columns that contain a measurable piece of data that is used to build the model.

After that, you need to split the data into training and test datasets. As the name suggests, the training dataset is used to train the algorithm while we check the accuracy of the model on the test dataset. We can do this split using the sklearn.model_selection.train_test_split() function. Usually, we do a 75-25 split of the dataset. 75% of the main data is the training data and the rest of it is the test data.

We pass the random state argument to the function as it will result in a reproducible train test split.


K - nearest neighbors

K-nearest neighbors is an instance-based or a memory-based algorithm. What this means is that the algorithm memorizes the dataset during training and uses this to predict classes later on. This makes prediction faster but training the model takes time. The ‘k’ in k-NN stands for the number of nearest neighbors that it will look into to make the prediction.

For simplicity, we will first look at an example where k=1. This means that while making a prediction, the algorithm will look for a point in the training data which is closest to our test point and assigns the class of that training point.

Fig. 4. Scatter Plot - Flipper Length vs. Bill Length
Fig. 4. Scatter Plot - Flipper Length vs. Bill Length

Here I have plotted all the points of our training data and built the model to receive areas assigned to each class. I have used the value of ‘k’ as 1. To assign these regions, the algorithm searches for the closest training point to each point on the plot and assigns the class of the training point. The boundary that separates these regions is called the decision boundary.

For higher values for k, like for k=2, it searches for two closest points and picks out a majority vote of the results. If the two closest points belong to class 1 and 2, they will randomly assign to any class. That is how it breaks ties.

However, if the value of k is 3 and the 3 closest points belong to classes 1,2 and 1, it will choose the majority class and assign class 1 to the test point. Higher values of k will lead to smoother decision boundaries.

Now let us see how to train the dataset. Import the KNeighborsClassifier from sklearn, create the classifier and train the model.

The parameters passed to KNeighborsClassifier are used to improve the model. We will discuss them in detail later.

Now it is time to test the model.

0.9761904761904762

A 97% accuracy is pretty much what we desire. We can improve this accuracy by changing the model parameters. We will talk about this later in Hyperparameter Tuning.

The n_neighbors parameter is the value of k - the number of closest neighbours the algorithm considers. A higher value might improve accuracy.

The weights parameter refers to assigning different weights to the k nearest neighbors. We have chosen the uniform parameter. Another one which can be assigned is ‘distance’. It gives higher weightage to the neighbors closer to the test point.

The metrics parameter is for choosing the type of distance. We have chosen the euclidean distance. Other values might be ‘manhattan’  or ‘minkowski’.

The complete documentation for K-Nearest Neighbors can be found here.


Support Vector Classifier (SVC)

SVC uses the Support Vector Machine (SVM) algorithm to perform binary classification. The simplest form of the SVM is the Linear SVM which uses a linear decision boundary to perform classification.

Linear SVM uses the signum function to classify data points. If the value inside the function is positive it assigns it the value +1. If the value is negative, the value assigned is -1.

$$f(x,w,b) = sign(wx +b)$$

Here x is the set of feature variables, w is the weights assigned to those variables and b is the bias term or the constant term that is added.

The decision boundary is chosen such that the perpendicular distance to the closest points of both the classes is equal and maximum.

The equation of the decision boundary would be

$$y=wx + b$$

Fig. 5. Linear Decision Boundary
Fig. 5. Linear Decision Boundary

The above plot shows the decision boundary and the classification for the breast cancer dataset for two features - ‘radius_mean’ and ‘texture_mean’.

Let us see how to implement this using scikit learn.

SVC(C=1, kernel='linear')

The parameter C is called the regularization parameter. The value of C decides how well the model fits the training data. A higher value of C implies less regularization and fits the training data with as much accuracy as possible. This might sometimes lead to overfitting which is not desired.

A smaller value of C implies more regularization and leaves some wiggle room for the model to find a better decision boundary and could lead to a lower training accuracy but a higher test accuracy.

The plots below show the effect of varying the value of C.

Fig. 6. Decision Boundary with C = 0.0005
Fig. 6. Decision Boundary with C = 0.0005
Fig. 7. Decision Boundary with C = 100
Fig. 7. Decision Boundary with C = 100

You can see how the decision boundary changes for different values of C.

The kernel parameter decides the function used by the model. You can also use non-linear kernels like ‘rbf’ to create non-linear decision boundaries. Essentially,  the kernel changes the dimensionality of the data. The kernel takes the current feature space and transforms it into a different feature space.

The ‘rbf’ or the radial basis function kernel uses this expression to transform the feature space -

$$f(x1,x2) = exp(-||x1-x2||2)$$

Here f(x1,x2) is the new feature space and x1 and x2 are feature variables. 𝛾 is a hyperparameter that regulates the kernel.

SVC(C=1, gamma=0.01)
Fig. 8. Non-linear Decision Boundary

Let us test this model on the test data.

0.9020979020979021

Decision Tree

Decision Trees are the easiest to understand machine learning classification algorithm. They are the most human-like algorithm as they classify most objects just like we humans do. They ask simple yes-no questions and classify objects based on their answers. Let us see a decision tree to understand better.

Fig. 9. Decision Tree
Fig. 9. Decision Tree

This decision tree is also a good way to determine which is our best feature to perform classification. In our case, the petal length is doing a great job at classifying the species. As we move down the tree, each node contains an element or characteristic that is now being divided into more nodes. To split the tree, we use splitting measures like Gini Index.

The first line in each node is the question the algorithm is asking. Depending on whether the answer is true or false, it further performs the classification. As simple as that. Each node also displays the Gini impurity of the node, the total number of samples in that node and the number of samples of each class present in the node.

The Gini impurity is a representation of how pure the node is. If the node contains samples belonging to one class only, then the value of Gini impurity is zero. This means the node is pure.

In general, the Gini impurity is represented by -

Gi is the Gini impurity of the ith node, pi,k is the ratio of the number of samples of class k in the ith node to the total number of samples of the ith node. For example, in the root node,

Let us now look at how a Decision Tree is implemented in Python.

0.9404761904761905

Decision Trees are usually preferred over k-NN because they work for all types of feature variables - categorical, continuous, binary etc. You can read more about them here.

The problem with Decision Trees is that even the most tuned trees tend to overfit the training data.

We can solve this problem by using an algorithm that constitutes an ensemble of trees known as the Random Forest Classifier.


Random Forest Classifier

This algorithm works on the concept of ‘The Wisdom of the Crowd’.

A thousand random people will collectively give a more accurate aggregated result than an expert. Random Forest Classifiers will make an ensemble of decision trees and predict the aggregated result of all the trees. The decision trees randomly use features from the data to perform classification.

Ensemble learning is one of the most widely used techniques in Machine Learning and produces models with the highest accuracies.

Random Forests use Bagging in their algorithm. Bagging involves dividing the dataset into random subsets with replacement and using these subsets for training the decision tree. The n_estimators parameter decides the number of decision trees in the Random Forest.

We can also find which features are more important in classification. We can use the feature importance function.

bill_length_mm 0.5186342825858854
bill_depth_mm 0.3420978480595717
flipper_length_mm 0.032320642858410395
body_mass_g 0.10694722649613257

We can improve the accuracy of these trees by Boosting them. We can use  Gradient Boosting to do so.

The basic concept of boosting is that subsequent trees learn from their predecessors. All the decision trees are not independent and rely on each other to improve their accuracy.

0.9761904761904762

Hyperparameter Tuning

We can improve our model by tuning the fine balance between all the parameters in our algorithm. Many techniques are present but by far the most popular is Cross-Validation and Grid Search.

Cross-Validation

Before training the model, we perform train test splits using a random seed. There are many possible combinations of training and test datasets. Generally, we consider only one combination.

Using the cross-validation method, we split the training dataset into a further smaller training set and a validation set. This process is repeated several times and what we get is an improved model with higher accuracy.

We use k-fold cross-validation. It randomly divides the dataset k times to create distinct subsets called folds and then trains and evaluates the model k times, each time on a different subset.

Now we will try to improve on the kernelized SVM model we had built before.

array([0.93023256, 0.89411765, 0.87058824, 0.89411765, 0.91764706])

Using the kernelized SVM we had scored an accuracy of 0.902. Using cross-validation, we have improved on our accuracy and we have the best score of 0.9302.

The GridSearchCV function in python provides a great method to fine-tune your parameters in such a way that you get the best combination of the parameters to get the highest score.

We just need to input what parameters we want to experiment with and the list of values we want to juggle between. The function will run the model with all the values of the parameters and return the one with the best score.

GridSearchCV(estimator=RandomForestClassifier(),
             param_grid={'max_depth': [3, 5, 8], 
             'n_estimators': [10, 15, 20]})

You can find out which parameters give the best score.

{'max_depth': 3, 'n_estimators': 20}

You can also find the best score.

0.9800000000000001

Performance Evaluation

There are many ways to evaluate model performance. We have just talked about model accuracy scores. However, more sophisticated techniques such as confusion matrix, precision-recall and ROC-AUC can give a better rundown of your model. Let us have a look at them.

Confusion matrix

A Confusion matrix for a binary classifier is a matrix containing the true and false negatives and positives and gives you an idea of the class which is difficult for the algorithm to classify.

First, let us look at the terms. In the cancer dataset, a positive class would be malignant and a negative class would be benign.

  • True Positive - The algorithm classifies a malignant observation correctly.
  • False Positive - The algorithm classifies cancer as malignant when it is benign.
  • True Negative - The algorithm classifies a benign class correctly.
  • False Negative - The algorithm classifies a malignant observation as benign.
  Actually Positive Actually Negative
Predicted Positive True Positives (TPs) False Positives (FPs)
Predicted Negative False Negatives (FNs) True Negatives (TNs)

This is the confusion matrix. The true values are on the diagonal of the matrix. It is used to calculate the precision and recall scores.

Precision tells us about the accuracy of the positive predictions by our algorithm.

The recall is the ratio of the positives that are correctly detected by our model. It is also known as the true positive rate (TPR).

array([[119,  40],
       [ 15, 252]])

This is the confusion matrix on the cancer dataset using the gradient boosted classifier.

Here we can see the number of true positives and true negatives and calculate the precision and recall.

Precision-Recall

Let us look at how precision and recall can affect model performance.

Precision:  0.8880597014925373
Recall:  0.7484276729559748

We can also combine our precision and recall scores in one entity known as the F1 score. It is the harmonic mean of precision and recall scores.

0.8122866894197952

High precision and recall scores will result in a higher F1 score. Unfortunately, we cannot have both - high precision and high recall.

However, there are cases in which you prefer a higher precision or a higher recall. In our cancer dataset, we would prefer more false positives than false negatives.  So we need a higher recall. We need to build our algorithms accordingly.

We can also plot a precision-recall curve.

Fig. 10. Precision & Recall
Fig. 10. Precision & Recall

Now we can pick the best combination of precision and recall which gives us the highest threshold. Alternatively, we can plot precision vs recall directly and get the trade-off we need.

ROC - AUC

The receiver operating characteristic (ROC) is a great method of evaluating your binary classifier. The ROC is a plot of True Positive Rate (Recall) vs False Positive Rate. FPR is a ratio of the negative classes that have been falsely predicted as positive.

We can compute the TPR and the FPR using roc_curve() function.

Fig. 11. ROC Curve
Fig. 11. ROC Curve

The dotted line represents the ROC of a random classifier. The Area Under Curve (AUC) of a good classifier is close to one.

0.846123948837538

Using SVC for Trading

We will be using Support Vector Machines to implement a simple trading strategy in python. The target variables would be to either buy or sell the stock depending on the feature variables.

We will be using the google finance S&P 500 dataset to train and test the model. We can download the data using the yfinance library. You can install the library using the command ‘pip install yfinance’. After installation, load the library and download the dataset.

Fig. 12. Close Price of S&P 500
Fig. 12. Close Price of S&P 500

Now we must decide the target variable and split the dataset accordingly. For our case, we will use the closing price of the stock to determine whether to buy or sell the stock. If the next trading day's close price is greater than today's close price then, we will buy the S&P500 index, else we will sell the S&P500 index. We will store +1 for the buy signal and -1 for the sell signal.

X is a dataset that holds the predictor's variables which are used to predict the target variable, ‘y’. X consists of variables such as 'Open - Close' and 'High - Low'. These can be understood as indicators based on which the algorithm will predict the option price.

Now we can split the train and test data and train a simple SVC on this data.

0.5026455026455027

Resources to learn Machine Learning


Bibliography


Summary

Machine Learning can be classified into supervised and unsupervised models depending on the presence of target variables. They can also be classified into Regression and Classification models based on the type of target variables.

Various types of classification models include k - nearest neighbors, Support Vector Machines, Decision Trees which can be improved by using Ensemble Learning. This leads to Random Forest Classifiers which are made up of an ensemble of decision trees that learn from each other to improve model accuracy.

You can also improve model accuracy by using hyperparameter tuning techniques such as GridSearchCV and k - fold Cross-Validation. We can test the accuracy of the models by using evaluation methods such as Confusion Matrices, finding a good Precision-Recall tradeoff and using the Area under the ROC curve. We have also looked at how we can use SVC for use in trading strategies.

Disclaimer: All investments and trading in the stock market involve risk. Any decisions to place trades in the financial markets, including trading in stock or options or other financial instruments is a personal decision that should only be made after thorough research, including a personal risk and financial assessment and the engagement of professional assistance to the extent you believe necessary. The trading strategies or related information mentioned in this article is for informational purposes only.

Learning Track: Machine Learning & Deep Learning in Financial Markets