# Regression

Regression can be used for estimating relationships between variables and/or for prediction purposes. There are several approaches to conducting a regression, including linear and non-linear. To introduce the topic, we will discuss **simple linear** regression, **multiple linear** regression, and **logistic** regression.

## Simple Linear Regression

Linear regression is a tool used in statistics to find a line to model past data. It is sometimes referred to as finding the line of best fit. The basic model for a simple linear regression, or a regression model containing only one input variable, is:

where:- is the intercept parameter
- signifies the independent, explanatory, input, or regressor variable
- is the slope parameter
- signifies the dependent, output, or response variable
- and the final epsilon term represents the random error

As an example, let’s look at the **mtcars** dataset in R. This dataset contains information about 11 different characteristics on 32 different cars. Let’s first save this example dataset in R as `cars`

and then use `names()`

to view the names of the 11 different variables.

```
cars <- mtcars
names(cars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
```

*****Note**: More information about this dataset can be found using `help('mtcars')`

Suppose we are interested in a model where the displacement of a vehicle’s engine is used to explain the variation in fuel efficiency. In other words, we're interested in using the size of the engine to explain how much fuel it needs to operate. Then our assumed regression model would be:

Before we continue investigating this model, there are four ‘**LINE**’ assumptions we need to discuss:

- Checking for
**L**linearity: A linear relationship needs to exist between the**response**and the**regressor**. Check by looking at a scatterplot of x versus y. - Checking for
**I**ndependence: The individual**errors of the model**must be independent of each other. Though it is unlikely to obtain the actual errors, we can check this assumption by looking at a scatterplot of the residuals versus the order in which the observations were taken. - Checking for
**N**ormality: Here we assume that the**errors****of the model**are normally distributed. This can be investigated by looking at a Normal probability plot or a Normal Q-Q plot of the residuals in addition to a histogram of the residuals. - Checking for
**E**qual variance: We assume that the**errors of the model**have homogeneous variance. No matter what value the regressor takes within our modeling space, the errors of the model are equally spread out. For this assumption, we can view a scatterplot of the residuals against the predicted (or fitted) values. If met, you should see a uniform spread of all points in the direction of the residuals.

Notice that the first assumption is related to a relationship between the independent and dependent variables, while the remaining three assumptions are all related to the errors. Therefore, assuming a linear relationship exists should be verified before fitting a model to the data and estimating the parameters. If the linear assumption makes sense, then the remaining three assumptions concerning the errors need to be checked. This requires estimating the two parameters (intercept and slope) in order to come up with the residuals (differences between observed and predicted response), which give us some idea as to the characteristics of the errors given we have correctly assumed the true model. If the model meets these assumptions, we can then proceed to investigate the usefulness of this model.

### Linearity Assumption

First, let's check if assuming a linear relationship makes sense:

```
cars <- mtcars
plot(cars$mpg, cars$disp, main = "MPG v.s. Displacement",
ylab = "Displacement", xlab = "MPG")
```

Though there appears to be a possible non-linear decay of mpg as displacement increases, a negative linear relationship is also plausible.

As stated earlier, to check the next three assumptions we need residuals which means we need to actually fit the model. To fit the linear model, or “best fit line”, we use the `lm()`

function.

`outputlm <- lm(formula, data)`

Attribute meanings:

*formula*: specifies the mathematical relationship between y and x. In our example, the variables are mpg and disp, respectively*data*: specifies the name of the dataset which contains the variables in the formula

Therefore, for our example:

`carslm <- lm(mpg ~ disp, cars)`

Notice that we saved the output of the result as *carslm*. We did this in order to save all the information of the model into a single R object. In doing so, we can index into this object to retrieve certain information such as parameter estimates, predicted values, and residuals. In R documentation *carslm* would be referred to as an ‘lm’ or ‘linear model’ object, which means we can treat *carslm* as a **list** structure. We can see this by using the `class()`

and `typeof()`

functions in R.

` class(carslm)`

## [1] "lm"

typeof(carslm)

## [1] "list"

Recall from the Basic Syntax page, that we can index into the elements of a list by either using double brackets if we know which position our information resides or by using the *$* operator if we know the names of the list elements. To view this information of the list elements we can use:

```
names(carslm)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
```

Before making the plots to check our remaining assumptions, let’s retrieve and save the necessary information in our R session environment.

` resids <- carslm$residuals # or carslm[[2]] would also work`

y_hat <- carslm$fitted.values

### Independent Assumption

Recall, to check if the errors are independent we need to view a scatterplot of the residuals versus anything that could influence this assumption, like the order in which the observations were taken. We don’t really know the order in which they were taken (it’s not provided in the dataset), but for the sake of this example let’s assume the row order corresponds to when the observations were collected.

`plot(y_hat, type = 'b') # type='b' used to connect observations`

Based upon this plot, there doesn't appear to be any clear issues with independence. It’s important to note if we can assume a simple random sample was taken that this assumption will likely be met.

### Normality Assumption

Next, we need to check the normality of the errors by viewing the distribution of the residuals. We can do this by looking at both the histogram and Normal Q-Q plot of the residuals.

`hist(resids)`

```
qqnorm(resids)
qqline(resids, col = 'red') # creates red reference line
```

We would like to see the histogram depict a symmetric, bell-shaped curve centered around zero and the points of the Normal Q-Q plot follow the diagonal reference line as closely as possible. The residuals appear to be centered around zero, but both plots suggest that the distribution of residuals is skewed towards the positive end. This is likely related to a questionable linearity assumption of our model (recall the non-linear decay?) or even to a somewhat small sample size, which suggest that the Normality assumption may be in question depending upon how knowledgeable we are of the data.

### Equal Variance Assumption

Finally, we can check for equal variance of the errors by looking at the residuals versus predicted values.

`plot(y_hat, resids); abline(h = 0, col = 'red')`

We would like to see a **uniform scatter of points** in equal amounts above and below the horizontal reference of zero. However, we can see that there is a slight negative and then positive trend occurring around a prediction of 20 mpg. Also likely related to our initial (perhaps incorrect) assumption that a simple linear model using engine displacement alone is sufficient in explaining the variation in fuel efficiency.

### Inference and Prediction

If we deem all conditions met, we can continue with the model. One concept to investigate is the statistical significance of the regressor in explaining at least some of the variation in the response. In doing so, we are conducting what is called **inference**. To carry this out in R, we use `summary()`

on our model object.

```
summary(carslm)
##
## Call:
## lm(formula = mpg ~ disp, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8922 -2.2022 -0.9631 1.6272 7.2305
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.599855 1.229720 24.070 < 2e-16 ***
## disp -0.041215 0.004712 -8.747 9.38e-10 ***
## ===
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared: 0.7183, Adjusted R-squared: 0.709
## F-statistic: 76.51 on 1 and 30 DF, p-value: 9.38e-10
```

Looking at the *Coefficients* section of the output, we can see that engine displacement is highly significant (p-value = 9.38e-10). However, keep in mind that we can confidently interpret these results in this manner only if all of the ‘LINE’ assumptions were found to have been met.

**Prediction** is another idea to consider. For example, say we are interested in predicting the fuel efficiency of a vehicle with an engine displacement of 192 cubic inches. This value is not contained in the dataset, so we would need to estimate a response for this particular displacement value. One approach is to use the `predict.lm()`

function.

`predict.lm(object, newdata)`

Attribute meanings:

*object*: is an ‘lm’ object*newdata*: is a dataframe consisting of the x value(s) you wish to predict for

```
predict.lm(carslm, data.frame(disp = 192))
## 1
## 21.68655
```

Another approach is to use the estimated coefficients from *carslm* and simply calculate the result using the **prediction equation**

```
b0 <- carslm$coefficients[[1]]
b1 <- carslm$coefficients[[2]]
b0 + b1 * 192
## [1] 21.68655
```

## Multiple Linear Regression

Multiple linear regression is an extension of the simple case in that now the model may contain multiple regressors.

The idea is that with each regressor that is added, more error is accounted for. However, there is a point where too many regressors can be added to the model. This is known as **overfitting**. When we’ve overfit a model, we are making the equation more complex than it needs to be, which can have negative consequences if conducting inference.

The *mtcars* dataset will again be used as an example dataset. This time, we will investigate using number of engine cylinders *cyl*, vehicle weight *wt*, and size of carburetor *carb* to explain the variation in horsepower *hp*.

The initial model we might consider could be the one containing all three regressors:

or

Similar to the simple linear model case, we use the `lm()`

function to produce our model object. However, notice that we change the model argument in lm to now include three regressors. That is, the formula argument now contains *cyl + wt + carb* to be regressed against *hp*.

```
cars <- mtcars
lm_CylWtCarb <- lm(hp ~ cyl + wt + carb, data = cars)
```

Also, as in the simple model case, we need to check our assumptions:

```
# Linear relationship of hp with each of the three regressors
hp <- cars$hp
cyl <- cars$cyl
wt <- cars$wt
carb <- cars$carb
par(mfcol = c(1,3)) # create a 1-by-3 array to hold all 3 plots together
plot(cyl, hp)
plot(wt, hp)
plot(carb, hp)
```

*****Note**: For checking if errors are independent we’ll assume independence as before since a random sample was taken.

```
# Normality of Errors
resids <- lm_CylWtCarb$residuals
par(mfcol = c(1,2))
hist(resids)
qqnorm(resids); qqline(resids, col = 'red')
```

```
# Equal Variance of Errors
hp_hat <- lm_CylWtCarb$fitted.values
plot(hp_hat, resids); abline(h = 0, col = 'red')
```

Next, let’s assume we are interested in seeing if all three factors (i.e., regressor variables) are statistically significant when in the same model.

```
summary(lm_CylWtCarb)
##
## Call:
## lm(formula = hp ~ cyl + wt + carb, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.002 -20.308 1.087 21.443 53.590
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -48.6395 20.0553 -2.425 0.02200 *
## cyl 23.1827 5.1573 4.495 0.00011 ***
## wt 0.1441 8.8500 0.016 0.98712
## carb 18.2828 3.9278 4.655 7.13e-05 ***
## ===
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.01 on 28 degrees of freedom
## Multiple R-squared: 0.827, Adjusted R-squared: 0.8084
## F-statistic: 44.61 on 3 and 28 DF, p-value: 8.546e-11
```

We can see that both *cyl* and *carb* are significant to at least an α-level of 0.001; however, the weight of a vehicle, *wt*, does not appear to explain a significant amount of the variation in horsepower when in a model containing the other two regressors (p−value = 0.98712).

Therefore, we should consider fitting a *reduced* model. That is, where *wt* is not included.

```
lm_CylCarb <- lm(hp ~ cyl + carb, cars)
summary(lm_CylCarb)
##
## Call:
## lm(formula = hp ~ cyl + carb, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.04 -20.23 1.04 21.61 53.47
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -48.558 19.081 -2.545 0.0165 *
## cyl 23.244 3.489 6.662 2.64e-07 ***
## carb 18.285 3.858 4.740 5.23e-05 ***
## ===
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.49 on 29 degrees of freedom
## Multiple R-squared: 0.827, Adjusted R-squared: 0.8151
## F-statistic: 69.31 on 2 and 29 DF, p-value: 8.959e-12
```

Looking at the output, both *cyl* and *carb* are statistically significant with p-values < 0.0001. If we compare other model statistics (e.g., both the residual standard error is lower and Adjusted R-squared is higher for the reduced model) we also see that removing *wt* improves, or at least appears not to degrade, the quality of the fit. Also notice that the parameter estimates for the remaining two regressors barely change ( from 23.183 to 23.244 and from 18.283 to 18.285), which is further evidence that we were likely overfitting the model when we unnecessarily included the weight variable.

****Note:** This does NOT mean that weight alone isn’t significant in explaining the variation in horsepower for vehicles in this dataset! Actually, if you fit the simple linear regression model containing only *wt* you would see that it is highly significant. Instead, the result above illustrates that the combination of knowing how many cylinders and what size of carburetor an engine has makes knowing the weight of the car insignificant when we consider explaining the variation in horsepower (…at least statistically speaking, whether or not it is of practical significance is another story).

In addition to the standard output summary provides, there are many other statistical tools we can implement to further compare different multiple linear regression models for this dataset. Model diagnostics, variable selection methods, consideration for multicoliniarity issues, use of information criteria, and cross validation are just a few of these other tools. Also, keep in mind that if we actually favored this model over the initial one, we would still need to check the model assumptions because this reduced model is another model. Can you verify that the reduced model

meets all the ‘LINE’ assumptions? Try it on your own!

## Logistic Regression

Recall the ‘LINE’ assumptions.

**L**inearity relationship**I**ndependent errors**N**ormally distributed errors**E**qual variance of errors

However, some relationships by their very nature violate these assumptions. In these cases it is preferable to apply an appropriate modeling technique that acknowledges the true nature of the response. For example, consider a survey about whether a political figure will win an upcoming election.

The outcome (i.e., the response variable y) is not strictly quantitative. Rather, it is binary. That is, the outcome is one of only two options, “win” or “lose.” However, we can look at this as 1 or 0 and consider the result a probability of success or failure.

In these types of situations we use what is called a Generalized Linear Model (GLM). In R, GLMs are fit using the function `glm()`

.

`glm(formula, family = gaussian, data,...)`

Attribute meanings:

*formula*: a symbolic description of the model to be fit*family*: name of the assumed error distribution and link function*data*: dataframe, list, or environment containing the variables in the model

**Note:** For details about other arguments of `glm()`

type `help('glm')`

in your R console.

Example:

Let’s consider an experiment that investigates the survival rate of 80 tadpoles when exposed to different levels of ultraviolet light:

0% (low), 1%, 9%, 16%, 23%, 31%, 40%, 57%, 78%, and 100% (high).

The survival status is recorded by the variable `Survived`

, where (0 = died, 1 = survived).

First, we will define the two variables and then create the dataset `Exposure`

.

```
UV.Level<-c(0.01,1,0.23,0.00,0.16,0.16,0.09,0.23,0.01,0.00,0.57,0.16,1.00,0.01,1,0.4,0.78,0.23,0.4,0)
Survived <- c(0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0)
Exposure <- data.frame(UV.Level, Survived)
```

Knowing that `Survival`

is binary data, let’s create a plot to better visualize what this means:

`plot(Survived ~ UV.Level, data = Exposure)`

From a visual inspection of the plot it appears as though there are deaths resulting from a wide variety of UV levels while there are only a couple that survived at lower levels of UV exposure. Therefore, this dataset suggests that UV level isn’t likely to have a significant relationship with the probability of survival.

Next, let’s use the `glm()`

function to model the probability of survival as a function of UV level and use the resulting inference to guide our decision.

```
glmFit <- glm(Survived ~ UV.Level, data = Exposure, family = binomial)
summary(glmFit)
##
## Call:
## glm(formula = Survived ~ UV.Level, family = binomial, data = Exposure)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.4636 -0.4634 -0.4608 -0.4518 2.1482
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.17657 1.01039 -2.154 0.0312 *
## UV.Level -0.06476 2.16039 -0.030 0.9761
## ===
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13.003 on 19 degrees of freedom
## Residual deviance: 13.002 on 18 degrees of freedom
## AIC: 17.002
##
## Number of Fisher Scoring iterations: 4
```

The results seem to support our initial visual inspection of the data. With the parameter estimate for UV level having a p-value = 0.9761, the effect appears insignificant in explaining the variation in the probability of survival.