Lab 8.3.3 - Random Forests and Bagging

Objective: Learn about random forests, bagging, and boosting

Data set: Housing Values in Suburbs of Boston(MASS package).

Clear global environment, remove variables from workspace, load packages
The MASS package contains the Boston data, randomForest package performs random forests and bagging

rm(list=ls())

installIfAbsentAndLoad <- function(neededVector) {
  for(thispackage in neededVector) {
    if( ! require(thispackage, character.only = T) )
    { install.packages(thispackage)}
    require(thispackage, character.only = T)
  }
}


needed <- c("randomForest", "MASS") # MASS package contains Boston data
installIfAbsentAndLoad(needed)

OVERVIEW

Bagging

Can be used for Regression or Classification to reduce variance of learning method (model fit)
Known as Bootstrap Aggregation –> Bootstrap method resamples data with replacement (see page 187)
Take different samples from same training data set.
Train learning method on bootstrapped training sets
Average all predictions (regression) or find class predicted that is majority (classification)

(Note: m = p, m is number of predictors selected at each split and p is total number of predictors)

Random Forests

Strong predictors cause predictions from bagged trees to be highly correlated.
To prevent this, randomly select m predictors at each decision tree split (criteria for how you partition data set)
m = sqrt(p) where p is # of predictors

Goals of Lab:

Practice using random forest and bagging methods on actual data set in R
Compare results from methods
Become confident coding Tree-Based Methods

set.seed(1)#Set the seed
head(Boston)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7

Set training and test sets

train <- sample(1:nrow(Boston), nrow(Boston)/2)
boston.test <- Boston[-train, "medv"] #Assess model performance at predicting medv

Formula describing fitted model:

randomForest function uses bagging. The difference is the value of m/mtry (Bagging m = p)
data = data frame containing variables in model
subset = which rows to be used
mtry = 13 is # of predictors randomly sampled at each split,
importance = assesses importance of predictors)

bag.boston <- randomForest(medv~.,data=Boston,subset=train,mtry=13,importance=TRUE)
bag.boston

## 
## Call:
##  randomForest(formula = medv ~ ., data = Boston, mtry = 13, importance = TRUE,      subset = train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 13
## 
##           Mean of squared residuals: 11.33119
##                     % Var explained: 85.26

predict will determine values for our response variable based on our model fit
predict(model fit used, newdata = all values from original data set not in train set)

yhat.bag <- predict(bag.boston,newdata=Boston[-train,]) 
plot(yhat.bag, boston.test)
plot(yhat.bag, boston.test)#Shows yhat.bag (predictions for bagging) against boston test set
abline(0,1)

mean((yhat.bag-boston.test)^2) #Calculates the Mean Squared Error for Test Set

## [1] 23.4579

How do we change the number of trees to grow?

We specify ntree = # of trees

bag.boston <- randomForest(medv~.,data=Boston,subset=train,mtry=13,ntree=25)
yhat.bag <- predict(bag.boston,newdata=Boston[-train,])
mean((yhat.bag-boston.test)^2)

## [1] 22.99145

What happens if we change the number of trees?

Can improve your model, but after a while will increase computation time.
Ideally we want enough trees to predict all of our observations
Default uses p/3 for regression trees and sqrt(p) for classification trees.
Default shows mtry = 4 –> 13/3 is approximately equal to 4

rf.boston <- randomForest(medv~.,data=Boston,subset=train,importance=TRUE)
rf.boston$mtry

## [1] 4

Book changed mtry or m = 13 to mtry = 6 for our random forest (appears arbitrary)

rf.boston <- randomForest(medv~.,data=Boston,subset=train,mtry=6,importance=TRUE)
yhat.rf <- predict(rf.boston,newdata=Boston[-train,])
mean((yhat.rf-boston.test)^2) # R.F MSE lower than Bagging MSE --> yielded improvement

## [1] 19.96771

importance(rf.boston) #View importance of each variable

##            %IncMSE IncNodePurity
## crim    15.9891966      991.3865
## zn       1.9621970      105.6121
## indus    4.4896898      604.7461
## chas     0.6351599       35.7491
## nox     11.5829966      758.2856
## rm      35.0766811     7875.7308
## age     11.5101992      597.7134
## dis      9.6764119      733.9281
## rad      3.6034208      120.7903
## tax      9.0904298      369.0995
## ptratio  8.5840527      827.5319
## black    8.6393825      265.6180
## lstat   28.8054240     5777.8328

%IncMSE based on mean decrease of accuracy in predictions on out of bag samples when given variables are not part of model
IncNodePurity is the total decrease in node impurity from decision splits over each specified predictor, averaged over all trees

(Note: Each bagged tree uses around 2/3 of observations. Remaining 1/3 not used to fit a tree are called out-of-bag (OOB)observations)

varImpPlot(rf.boston)#Visualizes measures for each variable

(Note: higher %IncMSE means variable is more important)

rm and lstat are most important variables

Lab 8.3.4 - Boosting

OVERVIEW

Boosting is another approach for improving the predictions resulting from a decision tree
Boosting works in a similar way to bagging, except that the trees are grown sequentially:
- Each tree is grown using information from previously grown trees
- Boosting doesn’t involve bootstrap,trees are ﬁt on a modiﬁed version of the original data set
- Slow learning model to prevent overfitting
- Fit trees using the current residuals, rather than the outcome Y

needed <- c("gbm", "MASS") #gbm fits boosted regression trees, MASS contains Boston data
installIfAbsentAndLoad(needed)

Set the seed, training and test sets

# from lab 8.3.2
set.seed(1)
train <- sample(1:nrow(Boston), nrow(Boston)/2)
boston.test <- Boston[-train, "medv"]

Distribution=“gaussian” since this is a regression problem
Distribution=“bernoulli” if it’s a classfication problem
n.trees: number of trees we want
Interaction.depth limits the depth of each tree
Summary: provides relative influence plot and statistics

boost.boston <- gbm(medv~.,data=Boston[train,],distribution="gaussian",n.trees=5000,interaction.depth=4)
summary(boost.boston)  #provides relative influence plot and statistics

##             var     rel.inf
## rm           rm 48.13967682
## lstat     lstat 28.93851185
## crim       crim  4.49413146
## dis         dis  4.23182696
## age         age  3.14221169
## nox         nox  2.88094283
## black     black  2.83238772
## ptratio ptratio  1.93050932
## tax         tax  1.80427054
## rad         rad  0.77569461
## indus     indus  0.73110525
## zn           zn  0.07442923
## chas       chas  0.02430170

rm and lstat are the most important variables
Produce partial dependence plots

par(mfrow=c(1,2))
plot(boost.boston,i="rm")

plot(boost.boston,i="lstat")

Use the boosted model to predict medv on the test set

yhat.boost <- predict(boost.boston,newdata=Boston[-train,],n.trees=5000)
mean((yhat.boost-boston.test)^2)

## [1] 19.37033

Shrinkage parameter included

boost.boston <- gbm(medv~.,data=Boston[train,],distribution="gaussian",n.trees=5000,interaction.depth=4,shrinkage=0.2,verbose=F)
yhat.boost <- predict(boost.boston,newdata=Boston[-train,],n.trees=5000)
mean((yhat.boost-boston.test)^2)

## [1] 18.68911

Using shrinkage = 0.2 leads to a slightly lower test MSE
Other tuning parameters: the number of trees

Random Forest, Bagging and Boosting

Team 6

2/21/2020

Lab 8.3.3 - Random Forests and Bagging

Objective: Learn about random forests, bagging, and boosting

Data set: Housing Values in Suburbs of Boston(MASS package).

OVERVIEW

Lab 8.3.4 - Boosting

OVERVIEW

randomForest() & gbm(): Generalized Boosted Regression Modeling