Lab 8.3.3 - Random Forests and Bagging

Objective: Learn about random forests, bagging, and boosting

Data set: Housing Values in Suburbs of Boston(MASS package).

  • Clear global environment, remove variables from workspace, load packages

  • The MASS package contains the Boston data, randomForest package performs random forests and bagging

rm(list=ls())

installIfAbsentAndLoad <- function(neededVector) {
  for(thispackage in neededVector) {
    if( ! require(thispackage, character.only = T) )
    { install.packages(thispackage)}
    require(thispackage, character.only = T)
  }
}


needed <- c("randomForest", "MASS") # MASS package contains Boston data
installIfAbsentAndLoad(needed)

OVERVIEW

Bagging

  • Can be used for Regression or Classification to reduce variance of learning method (model fit)

  • Known as Bootstrap Aggregation –> Bootstrap method resamples data with replacement (see page 187)

  • Take different samples from same training data set.

  • Train learning method on bootstrapped training sets

  • Average all predictions (regression) or find class predicted that is majority (classification)

(Note: m = p, m is number of predictors selected at each split and p is total number of predictors)

Random Forests

  • Strong predictors cause predictions from bagged trees to be highly correlated.
  • To prevent this, randomly select m predictors at each decision tree split (criteria for how you partition data set)
  • m = sqrt(p) where p is # of predictors

Goals of Lab:

  1. Practice using random forest and bagging methods on actual data set in R

  2. Compare results from methods

  3. Become confident coding Tree-Based Methods

set.seed(1)#Set the seed
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7
  • Set training and test sets
train <- sample(1:nrow(Boston), nrow(Boston)/2)
boston.test <- Boston[-train, "medv"] #Assess model performance at predicting medv

Formula describing fitted model:

  1. randomForest function uses bagging. The difference is the value of m/mtry (Bagging m = p)

  2. data = data frame containing variables in model

  3. subset = which rows to be used

  4. mtry = 13 is # of predictors randomly sampled at each split,

  5. importance = assesses importance of predictors)

bag.boston <- randomForest(medv~.,data=Boston,subset=train,mtry=13,importance=TRUE)
bag.boston
## 
## Call:
##  randomForest(formula = medv ~ ., data = Boston, mtry = 13, importance = TRUE,      subset = train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 13
## 
##           Mean of squared residuals: 11.33119
##                     % Var explained: 85.26
  • predict will determine values for our response variable based on our model fit
  • predict(model fit used, newdata = all values from original data set not in train set)
yhat.bag <- predict(bag.boston,newdata=Boston[-train,]) 
plot(yhat.bag, boston.test)
plot(yhat.bag, boston.test)#Shows yhat.bag (predictions for bagging) against boston test set
abline(0,1)

mean((yhat.bag-boston.test)^2) #Calculates the Mean Squared Error for Test Set
## [1] 23.4579

How do we change the number of trees to grow?

  • We specify ntree = # of trees
bag.boston <- randomForest(medv~.,data=Boston,subset=train,mtry=13,ntree=25)
yhat.bag <- predict(bag.boston,newdata=Boston[-train,])
mean((yhat.bag-boston.test)^2)
## [1] 22.99145

What happens if we change the number of trees?

  • Can improve your model, but after a while will increase computation time.
  • Ideally we want enough trees to predict all of our observations

  • Default uses p/3 for regression trees and sqrt(p) for classification trees.
  • Default shows mtry = 4 –> 13/3 is approximately equal to 4

rf.boston <- randomForest(medv~.,data=Boston,subset=train,importance=TRUE)
rf.boston$mtry
## [1] 4
  • Book changed mtry or m = 13 to mtry = 6 for our random forest (appears arbitrary)
rf.boston <- randomForest(medv~.,data=Boston,subset=train,mtry=6,importance=TRUE)
yhat.rf <- predict(rf.boston,newdata=Boston[-train,])
mean((yhat.rf-boston.test)^2) # R.F MSE lower than Bagging MSE --> yielded improvement
## [1] 19.96771
importance(rf.boston) #View importance of each variable
##            %IncMSE IncNodePurity
## crim    15.9891966      991.3865
## zn       1.9621970      105.6121
## indus    4.4896898      604.7461
## chas     0.6351599       35.7491
## nox     11.5829966      758.2856
## rm      35.0766811     7875.7308
## age     11.5101992      597.7134
## dis      9.6764119      733.9281
## rad      3.6034208      120.7903
## tax      9.0904298      369.0995
## ptratio  8.5840527      827.5319
## black    8.6393825      265.6180
## lstat   28.8054240     5777.8328
  • %IncMSE based on mean decrease of accuracy in predictions on out of bag samples when given variables are not part of model

  • IncNodePurity is the total decrease in node impurity from decision splits over each specified predictor, averaged over all trees

(Note: Each bagged tree uses around 2/3 of observations. Remaining 1/3 not used to fit a tree are called out-of-bag (OOB)observations)

varImpPlot(rf.boston)#Visualizes measures for each variable

(Note: higher %IncMSE means variable is more important)

  • rm and lstat are most important variables

Lab 8.3.4 - Boosting

OVERVIEW

  • Boosting is another approach for improving the predictions resulting from a decision tree
  • Boosting works in a similar way to bagging, except that the trees are grown sequentially:

    • Each tree is grown using information from previously grown trees

    • Boosting doesn’t involve bootstrap,trees are fit on a modified version of the original data set

    • Slow learning model to prevent overfitting

    • Fit trees using the current residuals, rather than the outcome Y

needed <- c("gbm", "MASS") #gbm fits boosted regression trees, MASS contains Boston data
installIfAbsentAndLoad(needed)
  • Set the seed, training and test sets
# from lab 8.3.2
set.seed(1)
train <- sample(1:nrow(Boston), nrow(Boston)/2)
boston.test <- Boston[-train, "medv"]
  • Distribution=“gaussian” since this is a regression problem

  • Distribution=“bernoulli” if it’s a classfication problem

  • n.trees: number of trees we want

  • Interaction.depth limits the depth of each tree

  • Summary: provides relative influence plot and statistics

boost.boston <- gbm(medv~.,data=Boston[train,],distribution="gaussian",n.trees=5000,interaction.depth=4)
summary(boost.boston)  #provides relative influence plot and statistics

##             var     rel.inf
## rm           rm 48.13967682
## lstat     lstat 28.93851185
## crim       crim  4.49413146
## dis         dis  4.23182696
## age         age  3.14221169
## nox         nox  2.88094283
## black     black  2.83238772
## ptratio ptratio  1.93050932
## tax         tax  1.80427054
## rad         rad  0.77569461
## indus     indus  0.73110525
## zn           zn  0.07442923
## chas       chas  0.02430170
  • rm and lstat are the most important variables

  • Produce partial dependence plots

par(mfrow=c(1,2))
plot(boost.boston,i="rm")

plot(boost.boston,i="lstat")

  • Use the boosted model to predict medv on the test set
yhat.boost <- predict(boost.boston,newdata=Boston[-train,],n.trees=5000)
mean((yhat.boost-boston.test)^2)
## [1] 19.37033
  • Shrinkage parameter included
boost.boston <- gbm(medv~.,data=Boston[train,],distribution="gaussian",n.trees=5000,interaction.depth=4,shrinkage=0.2,verbose=F)
yhat.boost <- predict(boost.boston,newdata=Boston[-train,],n.trees=5000)
mean((yhat.boost-boston.test)^2)
## [1] 18.68911
  • Using shrinkage = 0.2 leads to a slightly lower test MSE
  • Other tuning parameters: the number of trees
randomForest() & gbm(): Generalized Boosted Regression Modeling