Giles Hooker – Ensembles of Trees and CLT’s: Inference and Machine Learning
Abstract: This talk develops methods of statistical inference based around
ensembles of decision trees: bagging, random forests, and boosting. Recent
results have shown that when the bootstrap procedure in bagging methods is
replaced by sub-sampling, predictions from these methods can be analyzed
using the theory of U-statistics which have a limiting normal distribution.
Moreover, the limiting variance that can be estimated within the sub-sampling
structure.
Using this result, we can compare the predictions made by a model learned with
a feature of interest, to those made by a model learned without it and ask
whether the differences between these could have arisen by chance. By
evaluating the model at a structured set of points we can also ask whether it
differs significantly from an additive model. We demonstrate these results in an
application to citizen-science data collected by Cornell’s Laboratory of
Ornithology.
We will examine recent developments that extend distributional results to
boosting-type estimators. Boosting allows trees to be incorporated into more
structured regression such as additive or varying coefficient models and often
outperforms bagging by reducing bias.
Bio: Giles Hooker is Associate Professor of Statistics and Data Science at
Cornell University. His work has focused on statistical methods using dynamical
systems models, inference with machine learning models, functional data
analysis and robust statistics. He is the author of “Dynamic Data Analysis:
Modeling Data with Differential Equations” and “Functional Data Analysis in R
and Matlab”. Much of his work has been inspired by collaborations particularly in
ecology and citizen science data.
Please email Meg Tully ([email protected]) for Zoom Meeting information