By **Lamarcus Coleman**

**Introduction**

In this post we're going to learn how we can address a key concern of linear models, the assumption of linearity. We'll take a look at Linear Regression, a foundational statistical learning technique, learn what's happening under the hood of the model,some things that we want to be aware of, and then learn more about some of the weaknesses of the model. We'll then introduce the idea of polynomial regression as being a solution to a key weakness of linear models, namely Linear Regression in this post.

**Let's Review Linear Regression**

*, or the function that accurately describes the relationship between our independent and dependent variables. This establishment of causality between indicators and trades is the key feature that is common to all machine learning based trading models. In parametric models we save some time by making some assumptions about the structure of our function,*

**f(x)***. In Linear Regression, we assume that*

**f(x)***, is linear. You may recall from algebra that the equation for a line is*

**f(x)***, where y is our response, m is the slope, or derivative in calculus terms, and*

**y = mx + b***, is our intercept, or the value of*

**b***when*

**y***, or our explanatory variable is equal to*

**x***. The Simple Linear Regression equation is below: Here the coefficients of the line simply take a different form and we add an error term,*

**0***. The error term accounts for noise that we can't model, or randomness. Though libraries and packages make our lives easier by abstracting away a lot of the computations, there's quite a few things happening when we build a machine learning based Linear Regression model. Our objective is to approximate our coefficients,*

**E**_{i}*, or our intercept, and*

**B**_{0}*, our slope. Once we have values for our coefficients, we can then plug them into our linear equation so that we can predict the value of our response given some value for*

**B**_{1}*. But how do we find these coefficients?*

**X***is below: where*

**B**_{1 }*(i.e. uppercase Y), is the mu of*

**Y***, and*

**y***, is the mu of*

**X***.*

**x***is below: where*

**B**_{0 }*is the mu of*

**Y***,*

**y***, is our predicted slope, and*

**B**_{1}*, is the mu of our explanatory variable.*

**X***is*

**H**_{0}

**B**_{1}*, and our alternate hypothesis or*

**= 0***is*

**HA**

**B**_{1 }*. Essentially, the null is stating that our slope is 0, or that there is no relationship between our variables. To test this, we calculate how many deviations from 0 our*

**≠ 0***is. In the above equation,*

**B**_{1}*is our predicted slope and*

**B**_{1}*is the standard error of our slope. The standard error is a way measuring the deviation of our slope.*

**SE(B**_{1})*is below:*

**B**_{1}*=*

**σ**^{2 }*, or is the variance of our error term. We use our data to estimate this variable by calculating our RSE or residual sum of squares by the equation below:*

**V ar(E)***n*is the number of observations and

*RSS*is our residual sum of squares. Our RSS can be found by the following equation:

*is the actual value of the response and*

**Y**_{i}*is our prediction for the*

**y**_{i}**i**observation.

_{th}*. Notice that I didn't say that we would either reject or accept the null. This is because failing to reject the null does not necessarily mean that we accept the null. We just weren't able to reject it at some significance level, or within some confidence interval.*

**H**_{0}*α*, or p-value. Our p-value tells us that given the null hypothesis being true, how likely we are to observe a value of our test statistic greater than or equal to the one that we observed. Stated another way, our p-value tells us the probability of our null hypothesis being true given the t-statistic we observed. It is also the lowest significance level where by we can reject the null hypothesis,

**H**_{0.}*, variable for each of our*

**B**_{1}*x*terms. This simply expresses the relationship between that specific

*x*and our response

*y*. Our test statistic and distribution also changes from the t-statistic and Gaussian distribution to the f-statistic and f-distribution.

#### Let's Build a Model

#data analysis and manipulation import numpy as np import pandas as pd #data collection import pandas_datareader as pdr #data visualization import matplotlib.pyplot as plt import seaborn as snsLet's import our data.

#setting our testing period start='2016-01-01' end='2018-01-01' pnb=pdr.get_data_yahoo('PNB.NS',start, end)

pnb.head()Let's see how PNB performed over our sample period.

plt.figure(figsize=(10,6)) plt.plot(pnb['Close']) plt.title('PNB 2016-2018 Performance') plt.show()

#making a copy of our data frame PNB=pnb.copy()

#creating our predictor variables #Lag 1 Predictor PNB['Lag 1']=PNB['Close'].shift(1) #Lag 2 Predictor PNB['Lag 2']=PNB['Close'].shift(2) #Higher High Predictor PNB['Higher High']=np.where(PNB['High'] > PNB['High'].shift(1),1,-1) #Lower Low Predictor PNB['Lower Low']=np.where(PNB['Low'] < PNB['Low'].shift(1),1,-1)

Now let's review our PNB dataframe.

PNB.head()

Now let's import our Linear Regression Model from sci-kit learn.

from sklearn.linear_model import LinearRegression

Now that we have our model, let's import our train*test*split object from sklearn.

from sklearn.model_selection import train_test_split

Now we're ready to create training and testing sets for our data. But first, let's initialise our X and y variables. Remember X denotes our predictor variables, and y denotes our response or what we're actually trying to predict.

#creating our predictor variables X=PNB.drop(['Open','High','Low','Close','Volume','Adj Close'],axis=1) #initializing our response variable y=PNB['Close']

Now we're ready to split our data into training and testing sets.

X_train, X_test, y_train, y_test= train_test_split(X,y,test_size=.20, random_state=101)

lm=LinearRegression()

Now we can pass in our training sets into our model. We must fill the NaN values in our X_train set first.

#fitting our model to our training data lm.fit(X_train.fillna(0),y_train)

#making predictions predictions=lm.predict(X_test.fillna(0))

Now that we have our predictions, we can check to see how well our model performed. There a variety of different things that we would like to view to assess our model's performance. Our model's * R^{2}*, R-Squared tells us the percentage variance in our response that can be explained by our predictors. We would also like to look at our model's errors. Our errors tell us how much our model deviated from the actual response value. Our objective is to create a model that achieves the lowest possible error.

**B**_{1 }*x*

*i*or the slopes for each of our features.

#checking our model's coefficients lm.coef_

*value. We will import metrics from sklearn.*

**R**^{2 }from sklearn import metrics

#getting our R-Squared value print('R-Squared:',metrics.explained_variance_score(y_test,predictions))

#printing our errors print('MSE:',metrics.mean_squared_error(y_test,predictions)) print('MAE:',metrics.mean_absolute_error(y_test,predictions)) print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test,predictions)))

MSE: 11.1707624541 MAE: 2.48608751953 RMSE: 3.3422690577

Let's now plot our actual response and our predicted responses. We can also calculate our residuals or errors.

sns.jointplot(predictions,y_test,kind='regplot') plt.xlabel('Predictions') plt.ylabel('Actual')

<matplotlib.text.Text at 0x1f40c5ef9e8>

The above plot shows that our predictions and actual responses are highly correlated. It also shows a really small p-value as well. Let's now plot our residuals.

residuals=y_test-predictions

#plotting residuals sns.distplot(residuals,bins=100)

<matplotlib.axes._subplots.AxesSubplot at 0x1f40b94d5c0>

sns.jointplot(residuals,predictions)

<seaborn.axisgrid.JointGrid at 0x1f409fcfa20>

*, or predictor variable. Here, because we have multiple predictors, we plot our residuals, or errors, to our predictions.*

**X****Polynomial Regression**

*and*

**B**_{0 }*by calling the .coef_ and .intercept methods on our lm model Checking our*

**B**_{1}

**B**_{1}lm.intercept_

array([ 0.93707237, 0.01904699, 0.27980978, -1.93943157])

Checking our **B _{0}**

lm.intercept_

6.0817053905353902

We'll create a dataframe to better visualize our coefficients.

coefficients=pd.DataFrame(lm.coef_,index=X_train.columns,columns=['Coefficients'])

`coefficients`

Now we can clearly see our * B_{1}*s for each feature. Let's visualise our features so that we have an idea of which ones we would like to transform.

sns.jointplot(PNB['Lag 1'], PNB['Close'], kind='regplot')

<seaborn.axisgrid.JointGrid at 0x1f40a83f5c0>

Our lower low feature has the lowest * B_{1 }*coefficient. Let's visualise it.

plt.figure(figsize=(10,6)) sns.boxplot(x='Lower Low',y='Close',hue='Lower Low',data=PNB)

C:\Anaconda3\lib\site-packages\seaborn\categorical.py:482: FutureWarning: remove*na is deprecated and is a private function. Do not use. box*data = remove*na(group*data[hue_mask])

<matplotlib.axes._subplots.AxesSubplot at 0x1f40bded5c0>

We can see a significant drop between the coefficient of our Lag 1 feature and our Lag 2 feature. Let's visualise our Lag 2 feature.

plt.figure(figsize=(10,6)) sns.lmplot('Lag 2', 'Close', data=PNB)

plt.figure(figsize=(10,6)) sns.boxplot(x='Higher High',y='Close',data=PNB)

C:\Anaconda3\lib\site-packages\seaborn\categorical.py:454: FutureWarning: remove*na is deprecated and is a private function. Do not use. box*data = remove*na(group*data)

<matplotlib.axes._subplots.AxesSubplot at 0x1f40c719ac8>

PNB.head()

Now let's add our * Lag2^{2}* variable.

PNB['Lag 2 Squared']=PNB['Lag 2']**2

Let's recheck our dataframe.

PNB.head()

Okay now let's rebuild our model and see if we were able to reduce our MSE.

polynomial_model=LinearRegression()

We can now set our X and y variables.

#dropping all columns except our features and storing in X X_2=PNB.drop(['Open','High','Low','Close','Adj Close','Volume','Lag 2'],axis=1) #Initializing our Response y_2=PNB['Close']

X_train_2, X_test_2, y_train_2, y_test_2=train_test_split(X_2,y_2,test_size=0.2,random_state=101)

Now that we have our training and testing data, we can use it to fit our model and generate predictions.

#fitting our model polynomial_model.fit(X_train_2.fillna(0),y_train_2)

LinearRegression(copy*X=True, fit*intercept=True, n_jobs=1, normalize=False)

We've just fitted our polynomial_model to our training data. Now let's use it to make predictions using our testing data.

predictions_2=polynomial_model.predict(X_test_2)

We've just made our predictions. We can now calculate our coefficients, residuals, and our errors. Let's start by creating a dataframe to hold the coefficients from our polynomial model.

polynomial_df=pd.DataFrame(polynomial_model.coef_,index=X_train_2.columns,columns=['Polynomial Coefficients'])

Now let's view our polynomial df.

polynomial_df.head()

Now let's create our residuals and plot them. Then we will recalculate our models' metrics.

sns.distplot(residuals_2,bins=100)

<matplotlib.axes._subplots.AxesSubplot at 0x1f40a75f978>

R_Squared_2=metrics.explained_variance_score(y_test_2,predictions_2)

#printing the R-Squared of our polynomial model print('Polynomial Model R-Squared:',R_Squared_2)

Polynomial Model R-Squared: 0.988328254058

Now let's calculate our errors.

#Calculating MSE MSE_2=metrics.mean_squared_error(y_test_2,predictions_2) #Calculating MAE MAE_2=metrics.mean_absolute_error(y_test_2,predictions_2) #Calculating RMSE RMSE_2=np.sqrt(MSE_2) #Printing out Errors print('Polynomial MSE:',MSE_2) print('Polynomial MAE:', MAE_2) print('Polynomial RMSE:',RMSE_2)

**Okay. Let's Review!**

*and .intercept*methods to get our coefficients and create our linear equation.

**Learn More**

*Disclaimer: All investments and trading in the stock market involve risk. Any decisions to place trades in the financial markets, including trading in stock or options or other financial instruments is a personal decision that should only be made after thorough research, including a personal risk and financial assessment and the engagement of professional assistance to the extent you believe necessary. The trading strategies or related information mentioned in this article is for informational purposes only.*

```
```