Code-Based Methods and the Problem of Accessibility

Post provided by Jamie M. Kass, Matthew E. Aiello-Lammens, Bruno Vilela, Robert Muscarella, Cory Merow and Robert P. Anderson

The namesake of our software and founder of the field of biogeography, Alfred Russel Wallace. Photo ©G. W. Beccaloni

The namesake of our software and founder of the field of biogeography, Alfred Russel Wallace. Photo ©G. W. Beccaloni

In ecology, new methods are increasingly being accompanied by code, and sometimes even full command-line software packages (usually in R). This is great, as it makes analyses more reproducible and transparent, which is essential for the development of open science. In an ideal world, code would have informative annotation, generalized functions for multipurpose use, and be written in a legible and consistent manner. After all, the code may be used by ecologists with a wide range of programming experience.

In reality, code is often poorly commented (or not commented at all!), hard to reuse for other projects, and difficult to interpret. To add to that, most code isn’t actively maintained, so users are on their own if they try to commandeer it for new purposes. Further, ecologists with little or no programming knowledge are unlikely to benefit from methods that exist only as poorly documented code. In a positive development, some new methods are accessible through software with graphic user interfaces (GUIs) developed by programmers spending significant time and effort. But too often these end up as tools with flashy controls and insufficient instruction manuals.

How Wallace Came to Be

We struggled with this problem in 2015 when we were thinking about submitting to the GBIF Ebbe Nielsen Challenge, with its call for “bringing biodiversity information to life”. As developers of R packages for species distribution modeling (SDM) applications, we wanted to make our methods more accessible to a broader spectrum of biodiversity data users. Our aim was to develop a GUI-based application to access R package tools that leads users through an entire analysis workflow (without having to switch between multiple applications to handle spatial data or spreadsheets). We also wanted to avoid the pitfalls of ‘black box’ applications by encouraging experimentation and decision-making.

We were heavy users of Maxent (including a co-author of the original paper), which features easy-to-use software that masks its underlying complexity. Maxent has been criticized for enabling “quick and dirty” analyses without much understanding of the model or its outputs, despite research demonstrating the need for more detailed analyses. Because of its widespread use (the original article has been cited over 8000 times), Maxent was a poster-child for ‘black box’ applications in ecology. Its code has since become open source (Phillips et al. 2017), and the algorithm is much better understood now. However, we thought there were many lessons to be learned from its initial implementation, and knew that greater flexibility and more in-depth analyses were necessary to produce good models.

After many fascinating hours of debate, trial and error, and more than a few pots of coffee, we came up with Wallace v0.1.0. This version of the software stitched together functions from a collection of R packages to create a step-by-step SDM analysis accessible through a graphic user interface (via package shiny) with an interactive map (via package leaflet). To encourage best practices, we included extensive guidance text at every step, with references to the literature. This submission became a finalist in the competition, and we went on to develop Wallace v0.2.0 for round 2.

Figure 2. The Wallace v1.0.0 interface: (1) Navigation bar with component tabs, (2) toolbar with component name and module selection, (2a) selected module name and featured R package/s, (2b) control panel for selected module, (3) visualization space, (3a) log window, 3b) interactive map, results, and guidance text.

Figure 2. The Wallace v1.0.0 interface: (1) Navigation bar with component tabs, (2) toolbar with component name and module selection, (2a) selected module name and featured R package/s, (2b) control panel for selected module, (3) visualization space, (3a) log window, 3b) interactive map, results, and guidance text.

We set out to construct the next version in a modular way: different parts of the analysis would be discrete pieces of code. This would allow other developers to add modules and increase functionality in the future. We also added alternative modules for existing steps (e.g., the algorithm BIOCLIM in addition to Maxent). To address the reproducibility shortfall in GUI apps, we included the option to download an R Markdown script that reproduces the current analysis. The development of this beta version, along with encouraging words from colleagues, provided us with the inspiration to forge ahead.

What Makes Wallace Effective as Modern Scientific Software?

To help us focus on our next steps, we began writing a manuscript with one year of funding from the U.S. National Science Foundation (NSF). We initially just wanted to write up what we had already done, but it soon became clear that we could use this process to think about the future of Wallace. We asked ourselves, what was our vision for an ideal GUI-based ecological modeling software? In searching for our answer, we set down the characteristics that we felt distinguished Wallace from similar projects at the time, and came up with six qualities:

  • Open – the code is open-source (GNU GPL 3.0), and the linked data are freely available
  • Expandable – our modular approach makes it easier for community members to modify and add to Wallace to meet their needs
  • Flexible – there are multiple data upload and download options, and different choices for most steps (or ‘components’)
  • Interactive – visualization tools and user participation encourage experimentation and engagement
  • Instructive – extensive guidance text allows users to construct and interpret models wisely, and exposes them to the relevant literature
  • Reproducible – there’s an option to download an executable R Markdown script to rerun the analysis that can be shared, examined, and included with research products

Wallace is more than simply a shell for a species distribution modeling analysis. It’s a tool for carrying out some of the most fundamental tasks in biogeographic analyses—and it has the potential to become much more. For example, utilizing the spocc package, users can download species occurrence data from different open databases like GBIF or VertNet. Using the dismo package, users can acquire climatic data and find the corresponding values for each occurrence location. Importantly, Wallace allows the user to perform basic GIS tasks without relying on other software, such as point selection, drawing of polygon shapes, and visualization of occurrence localities on a map. It also makes model evaluation with different cross-validation procedures more accessible with the ENMeval package.

These are just some qualities that make Wallace effective as a GUI-based ecological modeling app, but we think that they are also important characteristics for scientific software in any field. We hope that Wallace will act not only as a tool for species distribution modeling, but also as a model of scientific software to come.

Implications for the Future

We’d like to see Wallace and other GUI-based apps that use R packages become a vital bridge between the cutting edge of ecological methods and everyone who would like to use them. In addition to this, they can serve as teaching tools, champions of reproducibility, and hubs of collaborative energy. At the present, there aren’t many flexible options for those in the ecology/evolution community who want to run recently-published analyses but have little coding experience (but see, for example, BCCVL). We believe that deep knowledge of R should not be a prerequisite to doing good, reproducible science, and tools like Wallace could help fill the gap.

With substantial support again from the NSF and now also NASA (the latter via a project led by Mary Blair of the Center for Biodiversity Conservation at the American Museum of Natural History), we’re expanding Wallace in exciting new directions. We plan to add a range of new modules featuring more modeling techniques, more connections to different kinds of occurrence and environmental data, greater capabilities for environmental-space visualization and analyses, and more downstream analyses that use SDM predictions (with an emphasis on applying analyses to biodiversity conservation by calculating biodiversity indicators).

To do this, we’ve been working with external partners who have developed great R packages and functions for ecological modeling analyses. Since no single group has all the answers for any particular class of methods, we’re reaching out to others to work together and benefit from a diversity of ideas. These partners, many from outside the US, have begun visiting our main base of operations at the City College of New York, City University of New York to engage in short hackathon-style development sessions, as well as similar activities at Yale led by Cory Merow.

Some new features in the works are obtaining paleoecological occurrence and climatic data, and running analyses that use multiple models (such as niche overlap). The success of this approach is showing how open scientific software development can be done by a collective, rather than a single laboratory, for the benefit of a varied user community. Through these trials over the past several months, we’ve steadily been streamlining the module addition process, and we expect independent contributions to be progressively easier in the future. We’re also planning to release educational materials and vignettes that are inspired by our interactions with our partners, colleagues, and user community.

As we move forward, we remain steadfast in our goal of developing a tool for an extremely broad cross-section of ecologists that’s also a model for good scientific software. We’d like to invite you, our colleagues, to contribute to Wallace and make ecological modeling methods more accessible to everyone.

To find out more about Wallace, read our Methods in Ecology and Evolution article ‘Wallace: A flexible platform for reproducible modeling of species niches and distributions built for community expansion’. This article is freely available – no subscription required.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s