A New Modelling Strategy for Conservation Practice? Ensembles of Small Models (ESMS) for Modelling Rare Species

Post provided by FRANK BREINER, ARIEL BERGAMINI, MICHAEL NOBIS and ANTOINE GUISAN

Rare Species and their Protection

Erythronium dens-canis L. – a rare and threatened species used for modelling in Switzerland. ©Michael Nobis

Erythronium dens-canis L. – a rare and threatened species used for modelling in Switzerland. ©Michael Nobis

Rare species can be important for ecosystem functioning and there is also a high intrinsic interest to protect them as they are often the most original and unique components of local biodiversity. However, rare species are usually those most threatened with extinction.

In order to help prioritizing conservation efforts, the International Union for Conservation of Nature (IUCN) has published criteria to categorize the status of threatened species, which are then published in Red Lists. Changes in a species’ geographical distribution is one of the several criteria used to assign a threat status. For rare species, however, the exact distribution is often inadequately known. In conservation science, Species Distribution Models (SDMs) have recurrently been used to estimate the potential distribution of rare or insufficiently sampled species.

Ensembles of Small Models (ESMs)

Falcaria vulgaris Bernh. - Another rare species that we used for modelling. ©Ariel Bergamini

Falcaria vulgaris Bernh. – Another rare species that we used for modelling. ©Ariel Bergamini

SDMs have a serious limitation though: they rely on a sufficient number of occurrences to provide reliable predictions. This means they can be difficult to build (due to overfitting) for rare and under-sampled species and related spatial predictions can prove unreliable. So we have a bit of a ‘catch 22’ in that rare species are the ones for which conservationists are most in need of models to compensate their insufficient sampling, but the existing models are unreliable for them due to such small sample size.

In 2010 a promising new strategy was introduced that could overcome these limitations. The authors of this novel strategy used very simple bivariate models (only two predictors at a time per model) and averaged all possible combinations of bivariate models to an ensemble weighted by cross-validated AUC score as a measure of model performance. These ensembles of small models have been shown to perform very well for a single species.

In our paper we tested the ESM strategy thoroughly and found that ESMs always performed better compared to standard SDMs when sample size is small. The fewer occurrences available for modelling the greater the gain in performance – although it’s likely that there is some absolute minimum (still to be defined) under which even small models cannot be built.

ESMs as a Tool for Conservation Management

ESMs can be helpful to improve reliability and accuracy of SDMs for lots of applications in conservation practice. Here we review some of the most promising applications of ESM.

  1. Prospective sampling

An important application of ESMs is to predict the potential habitat of a species and use this information to detect new occurrences in the field. The data which are then sampled using this field-based approach, called prospective sampling, can again be incorporated to an updated SDM. This process can be iteratively repeated until the ‘true distribution’ is known.

Prospective sampling is a suitable approach to get a better understanding of the distribution of rare and under-sampled species which could be used to assign a Red List status. However, the approach has been difficult to initiate when too few observations are available to use the standard SDM approach. ESM offers a solution to this problem and would allow researchers to initiate the iterative procedure. They could then be changed for SDMs when enough observations have been gathered.

A habitat suitability map used for the prospective sampling approach to predict the potential distribution of Leucanthemum halleri (Vitman) Ducommun in the Swiss Alps (orange-black dots show new-found occurrences which were unknown in the database before)

A habitat suitability map used for the prospective sampling approach to predict the potential distribution of Leucanthemum halleri (Vitman) Ducommun in the Swiss Alps (orange-black dots show new-found occurrences which were unknown in the database before)

  1. Climate change impact on communities

Climate change has a strong effect on the loss of biodiversity. ESMs could be used to assess climate change impacts on the distribution of rare species – allowing the species most vulnerable to climate change to be identified. This would be particularly important in the context of modelling the response of future communities to climate change, which is currently hampered by our incapacity to model all or most species in biological communities. Current community modelling efforts usually only include species observed frequently enough to be modelled with standard SDMs. Using ESMs instead of standard SDMs would allow researchers to include all species – including those with low frequencies – and increase the accuracy of the predicted communities.

  1. Invasive species risk assessment

When non-native species start to colonise a new range only little information is known about where the risk of them competing against rare species is highest. Standard SDMs calibrated at the early stages of invasions are likely to be less accurate due of data limitation, which would result in high prediction uncertainties. In such cases ESMs could be used at early stages of invasions, to improve risk assessments for exotic species prone to become invasive when sample size is limited.

  1. Translocation of species

In conservation practice rare species are often translocated to new habitats or for recolonisations to increase the potential of their long-term survival. For such assisted migration it is essential to find a habitat which is suitable for the species to increase chances of success. Suitable habitat of rare species could be best identified using ESMS.

We believe that the contribution of ESMs to nature conservation is important – as shown by ESMs being granted the MCED-award 2015 – and many applications are possible. However, ESM applications likely require more thorough testing and some useful improvements may be developed. In particular, the ESM approach may be implemented using existing statistical frameworks, such as tuning parameters in random forest (RF) or boosted regression trees (BRT) to force them to fit small models (e.g. bivariate) before ensembling, or using generalized or additive models (GLM/GAM) with a multi-model inference framework.

To find out more about Ensembles of Small Models, read our Methods in Ecology and Evolution article ‘Overcoming limitations of modelling rare species by using ensembles of small models’.

This article is part of our Virtual Issue on Endangered Species. All articles in this Virtual Issue are freely available for a limited time.

Advertisements

2 thoughts on “A New Modelling Strategy for Conservation Practice? Ensembles of Small Models (ESMS) for Modelling Rare Species

  1. I apologise if I’m missing something here but it seems tautological that the ensemble forecasts will exhibit better performance (here defined by AUC scores). This is because AUC is used in the weighting process for the models in the ensemble meaning that a composite model output that has stronger weights for components with higher AUC scores will, by definition, outcompete an alternative approach which is not necessarily trying to maximse AUC (maybe other metrics like likelihood, AIC, least-squares, or entropy).

    We could, for example, maximise our AUC further, by simply giving all the weight to the component with the highest cross-validation AUC. Of course this would be argued against on the grounds that the one model is too simplistic in its representation of the ecological characteristics of the species in question: we would be told that if you run lots of models then, by chance, we expect some to have high AUC scores. However, it is not clear how taking an average of a lot of models (weighted or otherwise), none of which individually represent the species ecological characteristics, will necessarily result in an output that will do.

    I think that it is also important to note that when you’re making a ‘super-model’ by combining other model outputs together that you can’t claim parsimony because of the simplicity of the components. Each of these components have parameters that need to be estimated and the ‘super-model’ output is still a function of all the covariates. In other words, a ‘super-model’ of smaller models glued together is often *at least* as complex as one regression model that included all these covariates in the first place.

    • Dear Joe,
      Thank you for initiating the debate on our new post. Here are some tentative answers, and new questions!
      We acknowledge that using a same evaluation metrics (e.g. AUC) might favour ensemble models if the weighted average of sub-models is based on the same index as used to evaluate the ensemble predictions through repeated split-sample validation. This is an important comment, since this situation is particularly met when working with small datasets where a fully independent test dataset cannot be easily left apart for external evaluation, as is typically the case with rare and endangered species. However, we limited this problem in our study by: (1) evaluating model performance (through repeated split-sample validation) not only with AUC, but also with TSS, Sensitivity, Specificity, and the Boyce index (Hirzel et al. 2006) based on presence-only data, and the results were consistent across these metrics; in particular, the relative importance was greater for ESMs when evaluated with the Boyce index compared to AUC (table 2 and fig. 3 and 4 of the paper), and the number of species for which ESMs were superior to standard SDMs was greater when evaluation was based on the Boyce compared to AUC (table 3 of the paper); (2) using an ensemble of different modelling techniques on both sides (‘standard SDMs’ and ‘Ensemble of small models, ESMs’) and weighted in both cases by a cross-validated AUC of sub-models; therefore, if using AUC to both weight and evaluate the ensemble was a problem, it should affect both sides equally; (3) showing that ESMs built with a single technique still outperform the predictions based on standard SDMs fitted with an ensemble of several techniques; and finally (4) also evaluating model performance on fully independent data for the ‘rare’ group of species, with the same or even greater relative performance of ESMs compared to standard SDMs. We therefore think that our results present a real improvement by ESMs, but we may have omitted other limitations.
      The proposition to use AIC to weight the submodels in the ensemble seems rather to us to provide a measure of model fit, as typically used in traditional multi-model inference and model averaging in GLMs or GAMs (see Burnham and Anderson 2002, e.g. fitted with the MuMIn R package), but it does not provide as such a measure of predictive performance, and may be more difficult to calculate for GBM, Random Forest or other modelling techniques. Indices measuring predictive performance like AUC, TSS, Boyce and other such metrics have the advantage to be a post-modelling metrics calculable for all techniques on independent data. We agree that complementary metrics could also be used, and we are interested in any suggestion, but to properly favour transferability, the metric used for weighting the ensemble should preferably be based on internal cross-validation to reduce overfitting and favour transferability of the ensemble model to independent data.
      On the issue that ESMs are not simple but rather complex models, we fully agree and indeed did not intend to present them as simple models, at least if complexity is defined as the number of parameters used in a model (but other definitions may be used; see Merow et al. 2014). We did not use the term ‘super-model’ and did not really discuss issues of simplicity and complexity in our paper, but we acknowledge that this is an important issue to be addressed. ESMs were initially conceived to avoid over-fitting in each small model, while allowing to integrate enough variables to account for as many dimensions as possible of a species’ ecological niche. But this is at the cost of developing more ‘complex’ ensemble models. Indeed, some recent and also more complex modelling techniques – like random forest (RF, based on bagging) and boosted regression trees (BRT) – also use ensemble of sub-models and also prove often superior to non-ensemble techniques, even on independent data (e.g. Elith et al. 2006 for GBM, Prasad et al. 2006 for RF). Two interesting questions in this regard would be: 1) to establish a formal link between the ESM approach proposed here and these alternative statistical approaches, and assess whether there would be way to implement the ESM approach by simply tuning some parameters of RF or BRT; and 2) to assess the effect of increasing complexity in these ‘ESM-like’ models compared to standard ensemble of SDMs, as the later usually include less models in the final ensemble than the former. In this regard, a valid question may be: “what defines complexity in these two types of ensemble models?” Is it the number of predictors used or the sum of the complexities of the responses allowed for the different predictors (e.g. degrees of freedom in GAMs)?
      Answering these questions would certainly contribute to better assess the future role of ensemble of small models to predict rare species (or rare events in general), and whether this approach may be better embedded in existing statistical frameworks (like RF, BRT, multi-model inference, etc.).
      Antoine Guisan, Frank T. Breiner, Ariel Bergamini, Michael P. Nobis

      References cited:
      Burnham, K.P. & Anderson, D.R. (2002) Model selection and multi model inference: a practical information-theoretic approach. Springer-Verlag, New York.
      Elith, J., Graham, C.H., Anderson, R.P., Dudik, M., Ferrier, S., Guisan, A., Hijmans, R.J., Huettmann, F., Leathwick, J.R., Lehmann, A., Li, J., Lohmann, L.G., Loiselle, B.A., Manion, G., Moritz, C., Nakamura, M., Nakazawa, Y., Overton, J.M., Peterson, A.T., Phillips, S.J., Richardson, K., Scachetti-Pereira, R., Schapire, R.E., Soberon, J., Williams, S., Wisz, M.S. & Zimmermann, N.E. (2006) Novel methods improve prediction of species’ distributions from occurrence data. Ecography, 29, 129-151.
      Hirzel, A.H., Le Lay, G., Helfer, V., Randin, C. & Guisan, A. (2006) Evaluating the ability of habitat suitability models to predict species presences. Ecological Modelling, 199, 142-152.
      Merow, C., Smith, M.J., Edwards, T.C., Guisan, A., McMahon, S.M., Normand, S., Thuiller, W., Wüest, R.O., Zimmermann, N.E. & Elith, J. (2014) What do we gain from simplicity versus complexity in species distribution models? Ecography, 37, 1267-1281.
      Prasad, A.M., Iverson, L.R. & Liaw, A. (2006) Newer classification and regression tree techniques: Bagging and random forests for ecological prediction. Ecosystems, 9, 181-199.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s