Multiple Variable Regression

 

In this guide we’re going to apply what we know about single variable linear regression and expand that to multiple variables. The two concepts are nearly identical but it’s rare that a dependent variable is explained by only one variable. Because of this, multiple regression attempts to explain a dependent variable by using more than one feature variable.

We recently explored the linear relationship between final exam scores and final grades. In this section we’re going to continue our analysis of final exam scores. But this time final exam scores are dependent on a multitude of features. 

Before moving forward it’s important to address the mathematical difference between single and  multivariable regression lines. While similar, multiple variable regression is expressed by the following formula.

Just like in simple linear regression “y” is the dependent variable and “x” represents the feature variables. In multivariable regression, we still use the constant alpha but it no longer represents the y-intercept. Most importantly, in multivariable regression, beta is simply a leading coefficient for every feature that provides information on it’s relationship with the dependent variable. If that doesn’t make sense, we’ll be going through an example which should help clear up any confusion.

Unlike our last section, we’re only covering this topic using Scikit-learn. But at the end of this section I have included the code for StatsModels if you are interested in exploring that method. So, as usual we’re going to start by importing our libraries, and like the last section were going to be using NumPy, Pandas, Matplotlib, and Scikit-learn. Once the data frame is imported we can move forward with segmenting the data. As I said, we’re going to be using final exam scores as the dependent variable and the feature variables will be exams one, two, and three, along with quiz scores. 

       

The train/test split and regression model for multiple variables is structurally identical to the simple regression model. So, I’d like you to take the time to set this up on your own. For guidance you’ll need a training and testing set for both variables, adjust the test set to encompass twenty percent of the data, create a regressor, fit the model, and finish by deriving the coefficient of determination.  

From here we can run the code and check the r squared value. 

 

As you can see we have a pretty solid r squared value, suggesting our features have a reasonably strong correlation to final exam grades. But we still don’t know how much each feature variable affects the prediction of final exam grades. To answer that we can analyze the leading coefficients.

The coefficients tell us that exam three has the largest influence when predicting final exam grades. Followed by exam one, exam three, and finally quiz scores. The last bit of information needed is the intercept, and then we’ll have all the information needed to make predictions using the equation of the regression line. 

Now that we've determined the intercept we can pass it into the equation for a multiple variable regression line as well as the coefficients. We can then manually predict final exam grades by pluging in exam one, two, three, and quiz scores.   

Fortunately for us we can also use the predict function in Scikit-learn to do the math for us. Let’s say a student scored an 83 on their first exam, a 90 on exam two, 78 on exam 3, and a 91 for their quiz score. Using the sample grades we can create an array, apply the NumPy newaxis function to convert it to a matrix, then pass it into the predict function. Then when we run this, what’s returned is a predicted final exam grade of 86.66.

  

We can also use the predict function to compare actual final exam grades with predicted final exam grades. To do that we’ll pass in the entire features_test data set into the predict function. Then for a visual representation we can create a Pandas dataframe with two columns. The first column consisting of actual results, we’ll use the dependent_test data for that. The second column is the exam_prediction data we obtain through the predict function. Or we can make  a little bar graph comparing the two.

This isn’t really providing us with any critical data, but it does offer a nice visualization on how well the model performs. In fact, the majority of the predicted grades are within just a few percentiles.