By Ishan Shah
What is cross validation in machine learning?
Cross validation in machine learning is a technique that provides an accurate measure of the performance of a machine learning model. This performance will be closer to what you can expect when the model is used on a future unseen dataset.
- Determine if the machine learning model is good in predicting buy signal and/or sell signal
- Demonstrate the performance of your machine learning trading model in different stress scenarios
- Comprehensively do the cross validation in machine learning trading model
Importing the libraries
import quantrautil as q import numpy as np from sklearn import tree
- quantrautil - this will be used to fetch the price data of the AAPL stock from yahoo finance.
- numpy - to perform the data manipulation on AAPL stock price to compute the input features and output. If you want to read more about numpy then it can be found here.
- tree from sklearn - Sklearn has a lot of tools and implementation of machine learning models. 'tree' will be used to create Decision Tree classifier model.
Fetching the dataThe next step is to import the price data of AAPL stock from quantrautil. The get_data function from quantrautil is used to get the AAPL data for 19 years from 1 Jan 2000 to 31 Dec 2018 as shown below. The data is stored in the dataframe aapl.
aapl = q.get_data('aapl','2000-1-1','2019-1-1') print(aapl.tail())
[*********************100%***********************] 1 of 1 downloaded Open High Low Close Adj Close \ Date 2018-12-24 148.149994 151.550003 146.589996 146.830002 146.830002 2018-12-26 148.300003 157.229996 146.720001 157.169998 157.169998 2018-12-27 155.839996 156.770004 150.070007 156.149994 156.149994 2018-12-28 157.500000 158.520004 154.550003 156.229996 156.229996 2018-12-31 158.529999 159.360001 156.479996 157.740005 157.740005 Date Volume 2018-12-24 37169200 2018-12-26 58582500 2018-12-27 53117100 2018-12-28 42291400 2018-12-31 35003500
Creating input and output datasetIn this step, I will create the input and output variable.
- Input variable: I have used '(Open-Close)/Open', '(High - Low)/Low', standard deviation of last 5 days returns (std_5), and average of last 5 days returns (ret_5)
- Output variable: If tomorrow’s close price is greater than today's close price then the output variable is set to 1 and otherwise set to -1. 1 indicates to buy the stock and -1 indicates to sell the stock.
# Features construction aapl['Open-Close'] = (aapl.Open - aapl.Close)/aapl.Open aapl['High-Low'] = (aapl.High - aapl.Low)/aapl.Low aapl['percent_change'] = aapl['Adj Close'].pct_change() aapl['std_5'] = aapl['percent_change'].rolling(5).std() aapl['ret_5'] = aapl['percent_change'].rolling(5).mean() aapl.dropna(inplace=True) # X is the input variable X = aapl[['Open-Close', 'High-Low', 'std_5', 'ret_5']] # Y is the target or output variable y = np.where(aapl['Adj Close'].shift(-1) > aapl['Adj Close'], 1, -1)
Training the machine learning modelAll set with the data! Let's train a decision tree classifier model. The DecisionTreeClassifier function from tree is stored in variable ‘clf’ and then a fit method is called on it with ‘X’ and ‘y’ dataset as the parameters so that the classifier model can learn the relationship between X and y.
clf = tree.DecisionTreeClassifier(random_state=5) model = clf.fit(X, y)
Cross validation of the machine learning modelIf the cross validation is done on the same data from which the model learned then it is a no brainer that the performance of the model is bound to be spectacular.
from sklearn.metrics import accuracy_score print('Correct Prediction: ', accuracy_score(y, model.predict(X), normalize=False)) print('Total Prediction: ', X.shape)
Correct Prediction: 4775 Total Prediction: 4775
print(accuracy_score(y, model.predict(X), normalize=True)*100)
How do you overcome this problem of using the same data for training and testing?One of the easiest and most widely used ways is to partition the data into two parts where one part of the data (training dataset) is used to train the model and the other part of the data (testing dataset) is used to test the model.
# Total dataset length dataset_length = aapl.shape # Training dataset length split = int(dataset_length * 0.75) split
# Splittiing the X and y into train and test datasets X_train, X_test = X[:split], X[split:] y_train, y_test = y[:split], y[split:] # Print the size of the train and test dataset print(X_train.shape, X_test.shape) print(y_train.shape, y_test.shape)
(3581, 4) (1194, 4) (3581,) (1194,)
# Create the model on train dataset model = clf.fit(X_train, y_train)
# Calculate the accuracy accuracy_score(y_test, model.predict(X_test), normalize=True)*100
K-Fold Cross Validation TechniqueDon’t worry! K-fold cross validation technique, one of the most popular methods helps to overcome these problems. This method splits your dataset into K equal or close-to-equal parts. Each of these parts is called a "fold". For example, you can divide your dataset into 4 equal parts namely P1, P2, P3, P4. The first model M1 is trained on P2, P3, and P4 and tested on P1. The second model is trained on P1, P3, and P4 and tested on P2 and so on. In other words, the model i is trained on the union of all subsets except the ith. The performance of the model i is tested on the ith part. When this process is completed, you will end up with four accuracy values, one for each model. Then you can compute the mean and standard deviation of all the accuracy scores and use it to get an idea of how accurate you can expect the model to be. Some questions that you might have.
How do you select the number of folds?The choice of the number of folds must allow the size of each validation partition to be large enough to provide a fair estimate of the model’s performance on it and shouldn’t be too small, say 2, such that we don’t have enough trained models to perform cross validation.
Why is this better than the original method of a single train and test split?Well, as discussed above that by choosing a different length for the train and the test data split, the model performance can vary quite a bit, depending on the specific data points that happen to end up in the training or testing dataset. This method gives a more stable estimate of how the model is likely to perform on average, instead of relying completely on a single model trained using a single training dataset.
How can you perform cross validation of the model on a dataset which is prior to the dataset used to train the model? Is it not historically accurate?If you train the model on a data from January 2010 to December 2018 and test on data from January 2008 to December 2009. Rightly so, the performance which we will obtain from the model will not be historically accurate. One of the limitations of this method. However, from the other side, this method can help to perform cross validation of how the model would have performed in the stress scenarios such as 2008. For example, when investors ask how the model would perform if stress scenarios such as the dot com bubble, housing bubble, or qe tapering occur again. Then, you can show the out-of-sample results of the model when such scenarios occurred. That should be likely performance when such scenarios occur again.
Code K-fold in PythonTo code, KFold function from sklearn.model_selection package is used. You need to pass the number of splits required and whether to shuffle (True) the data points or not (False) to shuffle it and store it in a variable say kf. Then, call split function on kf and X as the input. The split function splits the index of the X and returns an iterator object. The iterator object is iterated using for loop and the integer index of train and test is printed.
from sklearn.model_selection import KFold kf = KFold(n_splits=4,shuffle=False)
<generator object _BaseKFold.split at 0x0000000009936F68>
print("Train: ", "TEST:") for train_index, test_index in kf.split(X): print(train_index, test_index)
Train: TEST: [1194 1195 1196 ... 4772 4773 4774] [ 0 1 2 ... 1191 1192 1193] [ 0 1 2 ... 4772 4773 4774] [1194 1195 1196 ... 2385 2386 2387] [ 0 1 2 ... 4772 4773 4774] [2388 2389 2390 ... 3579 3580 3581] [ 0 1 2 ... 3579 3580 3581] [3582 3583 3584 ... 4772 4773 4774]
# Initialize the accuracy of the models to blank list. The accuracy of each model will be appended to this list accuracy_model =  # Iterate over each train-test split for train_index, test_index in kf.split(X): # Split train-test X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y[train_index], y[test_index] # Train the model model = clf.fit(X_train, y_train) # Append to accuracy_model the accuracy of the model accuracy_model.append(accuracy_score(y_test, model.predict(X_test), normalize=True)*100)# Print the accuracy print(accuracy_model)
[50.502512562814076, 49.413735343383586, 51.75879396984925, 49.79044425817267]
Stability of the model
Confusion MatrixWith the above accuracy, you got an idea about the accuracy of the model. But what is the model's accuracy in predicting each label such as Buy and Sell? This can be determined by using the confusion matrix. In the above example, the confusion matrix will tell you the number of times the actual value was 'buy' and predicted was also 'buy', actual value was 'buy' but predicted was 'sell' and so on.
# Import the pandas for creating a dataframe import pandas as pd # To calculate the confusion matrix from sklearn.metrics import confusion_matrix # To plot %matplotlib inline import matplotlib.pyplot as plt import seaborn as sn # Initialize the array to zero which will store the confusion matrix array = [[0,0],[0,0]] # For each train-test split: train, predict and compute the confusion matrix for train_index, test_index in kf.split(X): # Train test split X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y[train_index], y[test_index] # Train the model model = clf.fit(X_train, y_train) # Calculate the confusion matrix c = confusion_matrix(y_test, model.predict(X_test)) # Add the score to the previous confusion matrix of previous model array = array + c # Create a pandas dataframe that stores the output of confusion matrix df = pd.DataFrame(array, index = ['Buy', 'Sell'], columns = ['Buy', 'Sell']) # Plot the heatmap sn.heatmap(df, annot=True, cmap='Greens', fmt='g') plt.xlabel('Predicted') plt.ylabel('Actual') plt.show()