adding BRTs to the statistical toolbox

I have been reading up on boosted regression trees (BRT) and am convinced that they are an analytical approach worth learning.  The basic principle of the BRT technique, at least as I understand it, is the fitting of a series of dichotomous classification trees, each of which is being applied to the residual variation remaining in the data after all the previous trees in the series have been applied.  Individual classification trees have the benefit of being easy to understand and interpret and inherently incorporating interactions between predictor variables.  A downside of individual classification tree is that the data is being recursively split and thus they don’t allow for “smooth” relationships between the predictor and response variables often leading to poor fits and predictive abilities. When you ‘boost’ and create series of hundreds to thousands of trees (a general recommendation is to shoot for >1000 trees), the results are not as easy to interpret, but because you have so many splits in the data, the relationships between predictor variables and the response variable begin to resemble smooth functions with potentially very complex shapes.  BRTs have been compared to generalized additive models (GAMs) but have been found to be less prone to overfitting of the training data and thus are generally better at prediction. BRTs can be used for general regression-type problems but have recently gained popularity as new and powerful tool for Species Distribution Modeling (SDM), especially when presence/absence data are available.  Elith et al. have published a very nice guide to BRTs for ecologists, along with a set of R codes included in their ‘Dismo’ package and a hands-on tutorial with accompanying datasets (see links below).  The R package includes several very nice and easy to use functions that facilitate fitting the models and model interpretation. After working through the tutorial and playing with BRTs applied to different datasets, I do see a couple of downsides to BRTs.  The first is that it appears to require large amounts of data and did not perform well when I applied it to relatively small (but still large by field ecology standards) datasets.  Another problem is that the results are dependent on several different settings that are decided by the user prior to fitting including step-size, learning rate, bag fraction, and number of trees.  As far as I can tell the only way to determine which setting you should be using is to run the model several times under different combinations of settings and look for the combination that gives you the model with the best fit when applied to test data.  Despite these limitations, I see a lot of potential in BRTs, especially for use in SDMs, and will definitely be adding them to my statistical toolbox.

Some useful links for getting started with Boosted Regression Trees:
A working guide to boosted regression trees by Elith et al. 2008
-The Dismo R package
-A tutorial on BRTs based on examples in Elith et al. 2008 and using datasets included in the Dismo R package.
-A thesis by Shane Abeare comparing BRTs to GAMs and GLMs

–Ken Feeley



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s