Clear global environment, remove variables from workspace, load packages
The MASS package contains the Boston data, randomForest package performs random forests and bagging
rm(list=ls())
installIfAbsentAndLoad <- function(neededVector) {
for(thispackage in neededVector) {
if( ! require(thispackage, character.only = T) )
{ install.packages(thispackage)}
require(thispackage, character.only = T)
}
}
needed <- c("randomForest", "MASS") # MASS package contains Boston data
installIfAbsentAndLoad(needed)
Bagging
Can be used for Regression or Classification to reduce variance of learning method (model fit)
Known as Bootstrap Aggregation –> Bootstrap method resamples data with replacement (see page 187)
Take different samples from same training data set.
Train learning method on bootstrapped training sets
Average all predictions (regression) or find class predicted that is majority (classification)
(Note: m = p, m is number of predictors selected at each split and p is total number of predictors)
Random Forests
Goals of Lab:
Practice using random forest and bagging methods on actual data set in R
Compare results from methods
Become confident coding Tree-Based Methods
set.seed(1)#Set the seed
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
train <- sample(1:nrow(Boston), nrow(Boston)/2)
boston.test <- Boston[-train, "medv"] #Assess model performance at predicting medv
Formula describing fitted model:
randomForest function uses bagging. The difference is the value of m/mtry (Bagging m = p)
data = data frame containing variables in model
subset = which rows to be used
mtry = 13 is # of predictors randomly sampled at each split,
importance = assesses importance of predictors)
bag.boston <- randomForest(medv~.,data=Boston,subset=train,mtry=13,importance=TRUE)
bag.boston
##
## Call:
## randomForest(formula = medv ~ ., data = Boston, mtry = 13, importance = TRUE, subset = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 13
##
## Mean of squared residuals: 11.33119
## % Var explained: 85.26
yhat.bag <- predict(bag.boston,newdata=Boston[-train,])
plot(yhat.bag, boston.test)
plot(yhat.bag, boston.test)#Shows yhat.bag (predictions for bagging) against boston test set
abline(0,1)
mean((yhat.bag-boston.test)^2) #Calculates the Mean Squared Error for Test Set
## [1] 23.4579
How do we change the number of trees to grow?
bag.boston <- randomForest(medv~.,data=Boston,subset=train,mtry=13,ntree=25)
yhat.bag <- predict(bag.boston,newdata=Boston[-train,])
mean((yhat.bag-boston.test)^2)
## [1] 22.99145
What happens if we change the number of trees?
Ideally we want enough trees to predict all of our observations
Default shows mtry = 4 –> 13/3 is approximately equal to 4
rf.boston <- randomForest(medv~.,data=Boston,subset=train,importance=TRUE)
rf.boston$mtry
## [1] 4
rf.boston <- randomForest(medv~.,data=Boston,subset=train,mtry=6,importance=TRUE)
yhat.rf <- predict(rf.boston,newdata=Boston[-train,])
mean((yhat.rf-boston.test)^2) # R.F MSE lower than Bagging MSE --> yielded improvement
## [1] 19.96771
importance(rf.boston) #View importance of each variable
## %IncMSE IncNodePurity
## crim 15.9891966 991.3865
## zn 1.9621970 105.6121
## indus 4.4896898 604.7461
## chas 0.6351599 35.7491
## nox 11.5829966 758.2856
## rm 35.0766811 7875.7308
## age 11.5101992 597.7134
## dis 9.6764119 733.9281
## rad 3.6034208 120.7903
## tax 9.0904298 369.0995
## ptratio 8.5840527 827.5319
## black 8.6393825 265.6180
## lstat 28.8054240 5777.8328
%IncMSE based on mean decrease of accuracy in predictions on out of bag samples when given variables are not part of model
IncNodePurity is the total decrease in node impurity from decision splits over each specified predictor, averaged over all trees
(Note: Each bagged tree uses around 2/3 of observations. Remaining 1/3 not used to fit a tree are called out-of-bag (OOB)observations)
varImpPlot(rf.boston)#Visualizes measures for each variable
(Note: higher %IncMSE means variable is more important)
Boosting works in a similar way to bagging, except that the trees are grown sequentially:
Each tree is grown using information from previously grown trees
Boosting doesn’t involve bootstrap,trees are fit on a modified version of the original data set
Slow learning model to prevent overfitting
Fit trees using the current residuals, rather than the outcome Y
needed <- c("gbm", "MASS") #gbm fits boosted regression trees, MASS contains Boston data
installIfAbsentAndLoad(needed)
# from lab 8.3.2
set.seed(1)
train <- sample(1:nrow(Boston), nrow(Boston)/2)
boston.test <- Boston[-train, "medv"]
Distribution=“gaussian” since this is a regression problem
Distribution=“bernoulli” if it’s a classfication problem
n.trees: number of trees we want
Interaction.depth limits the depth of each tree
Summary: provides relative influence plot and statistics
boost.boston <- gbm(medv~.,data=Boston[train,],distribution="gaussian",n.trees=5000,interaction.depth=4)
summary(boost.boston) #provides relative influence plot and statistics
## var rel.inf
## rm rm 48.13967682
## lstat lstat 28.93851185
## crim crim 4.49413146
## dis dis 4.23182696
## age age 3.14221169
## nox nox 2.88094283
## black black 2.83238772
## ptratio ptratio 1.93050932
## tax tax 1.80427054
## rad rad 0.77569461
## indus indus 0.73110525
## zn zn 0.07442923
## chas chas 0.02430170
rm and lstat are the most important variables
Produce partial dependence plots
par(mfrow=c(1,2))
plot(boost.boston,i="rm")
plot(boost.boston,i="lstat")
yhat.boost <- predict(boost.boston,newdata=Boston[-train,],n.trees=5000)
mean((yhat.boost-boston.test)^2)
## [1] 19.37033
boost.boston <- gbm(medv~.,data=Boston[train,],distribution="gaussian",n.trees=5000,interaction.depth=4,shrinkage=0.2,verbose=F)
yhat.boost <- predict(boost.boston,newdata=Boston[-train,],n.trees=5000)
mean((yhat.boost-boston.test)^2)
## [1] 18.68911