(this is the first in a possibly irregular series of posts about papers that catch my eye. I don’t intend to only cover MEE papers, but I had to start somewhere)
A perennial worry for anyone building models for the real world is whether they actually represent the real world. If the whole process of finding and fitting a model has been done well, the model will represent the data. But the data is only part of the real world. How can we be sure our model will extrapolated beyond the data?
If we have some extra data we can check this by seeing if the model fits well to the new data. If we don’t, then we can cheat by splitting out data into two sets: we use one to fit the model and a second to see how well the model fits to new data. This approach is used a lot, for example when fitting species distribution models (SDM), but it’s not really correct. The problem is that the data were all collected at the same time, so any peculiarities to the data (e.g. because it was collectd in a single year) will remain. A model that fits well to the data might not fit to next year’s data. We talk about this as the model over-fitting: it fits to the peculiarities of the data, rather than summarising the underlying biology.
Over the last few years there has been a lot of species distribution modelling going on. Most of this involves throwing the data into a black box along with some data from WorldClim and seeing what it spits out. Inside a lot of these black boxes use machine learning methods like neural networks and random forests. One of the reasons for using these methods is that they perform well under cross-validation: i.e. when 30% (say) of the data are randomly removed from the data in the fitting of the model, and then the model is used to predict these data. But these data are correlated with the data that is used in the ftting, so this is not an independent test of the model. So, how well do these methods work with independent data?
Well, guess what? Someone has checked. In the most recent issue of MEE, Seth Wenger of Trout Unlimited and Julian Olden of the University of Washington report their findings. Their data is the distribution of brook and brown trout in the western US. What they do is to fit distribution models to their data, and try two types of cross-valudation: first the traditional sort, removing data at random, and second by removing data in bands:
They then compared the measures of how well the models did. With the standard cross-valudation, random forests did best, followed by neural nets, and standard generalized linear mixed models (GLMMs, i.e. fitting straight lines and quadratic curves) were the the worst. But when the fitted models were used to predict the data in the bands, the situation was reverse: the GLMMs did best.
This suggests that random forests and neural nets are over-fitting the data: the reason they do so well is that the data that is used to test the model is too close to the data used to fit it. Another reason to think that they are over-fitting is that the fitted curves just don’t look sensible:
Do you really believe such a complicated curve for random forests (at the top)? This is not peculiar to this study: I’ve seen horrible plots like this in other studies (MAXENT also produces uninterpretable curves).
So, what’s the take-home message? Just that simpler models seem to do better than machine learning models, which end up being just too damn complicated. I like this result, largely because I generally use these simpler models, but also because the simpler models are easier to understand. Anything that tells us we can make life simpler is always attractive.
Wenger, S., & Olden, J. (2012). Assessing transferability of ecological models: an underappreciated aspect of statistical validation Methods in Ecology and Evolution, 3 (2), 260-267 DOI: 10.1111/j.2041-210X.2011.00170.x