Being Certain about Uncertainty: Can We Trust Data from Citizen Science Programs?


Citizen Science: A Growing Field

Thousands of volunteers around the world work on Citizen Science projects. ©GlacierNPS

Thousands of volunteers around the world work on Citizen Science projects. ©GlacierNPS

As you read this, thousands of volunteers of all ages and backgrounds are collecting information for over 1,100 citizen science projects worldwide. These projects cover a broad range of topics: from volunteers collecting samples of the microbes in their digestive tracts, to tourists providing images of endangered species (such as tigers) that are often costly to survey.

The popularity of citizen science initiatives has been increasing exponentially in the past decade, and the wealth of knowledge being contributed is overwhelming. For example, almost 300,000 participants have submitted around 300 million bird observations from 252 countries worldwide to the eBird program since 2002. Amazingly, rates of submissions have exceeded 9.5 million observations in a single month!

The combined effort of millions of citizen scientists generates an impressive quantity of valuable information, but most importantly, information is being collected across spatial and temporal scales previously unimaginable for biological monitoring programmes. So, it should come as as no surprise that the involvement of citizen scientists in research has proven to be highly valuable for these monitoring studies. In fact over 70% of all published papers using citizen science data are in biology-related fields.

The use and application of citizen science data has only begun to reach its full potential. The combination of multiple sampling designs with differences in protocol structure in many volunteer-based programmes often present analytical challenges. However, if scientists can incorporate and account for these factors, many believe that citizen science data can be significantly more useful for informing research at local and global scales.

At the same time, the inclusion of specific and unified sampling protocols (eg. the Breeding Bird Surveys in the UK and the United States), along with clear objectives can improve the overall quality of the information being gathered. These improvements have been found to be strong indicators of the return on the investments in collecting and analysing citizen science data inform management and policy.

Collecting and Analysing Citizen Science Data

To determine how best to collect and analyse citizen science data, we must first answer two questions:

  1. Are citizen science data that different from data collected by trained technicians and scientists?
  2. What are the ways we can reduce uncertainty in the biological information we’re able to get from citizen science programmes?

Because of the wide breadth of citizen science projects, it’s impossible to properly address these questions in a way that is applicable to all of them. We chose to focus on a project published in our recent paper – ‘Uncertainty in biological monitoring: a framework for data collection and analysis to account for multiple sources of sampling bias‘ – about issues related to data quality where the focus of inference is the probability that an event will occur (eg., the probability of finding a rare species or detecting an emerging disease).

The answer to the first question is: “not really”. Recent evidence suggests that there is not that much of a difference between information collected by average citizens and that collected by technicians and scientists. For example, Danielsen et al. (2014) analysed data collected by trained and untrained individuals on the status and trends of an impressive 63 vertebrate taxa in 34 tropical forest sites across four countries and the results were indistinguishable. The community members collecting the data (a.k.a. citizen scientists) produced similar results to the scientists.

To answer the second question, we have to take a close look at the two main sources of sampling bias for any biological monitoring programme where the main objective is to collect information needed to estimate the probability that an event will occur:

  1. The probability of missing an event that has actually occurred
  2. The probability of inadvertently reporting an event that has not occurred

We illustrate these principles using a hypothetical example, where we want to estimate the probability that an individual deer has chronic wasting disease (disease), given that the sample that was taken from this individual tested positive for the disease (+test). Probability formula

This equation might seem abstract and irrelevant for our everyday lives, but this probability structure of the occurrence or absence of an event (otherwise known as Bayes’ theorem) is the basis of all our inferences in the medical and biological sciences!

This probability is a combination of the probability of testing positive for the disease, given that the individual has the disease (i.e. sensitivity of the medical test) multiplied by the prevalence of the disease in the population, divided by the sum of this product (P) with the probability of testing positive for the disease when the individual does not have the disease (i.e. 1- specificity of the medical test) multiplied by the prevalence of healthy individuals in the population. Not correcting for both of these probabilities, even when small, has been shown to greatly bias our inferences, where the predicted occurrence of a rare event could be biased by as much as 70% in some cases.

Correcting for Bias: False-Negatives

Mourning Doves are one of the 16 focal species for CUBS. ©CUBS

Mourning Doves are one of the 16 focal species for CUBS. ©CUBS

Now that we know that in some cases data collected by trained and untrained individuals is similar – but we must correct for bias in both types of data – how can we apply this probability structure to reduce uncertainty in inference based on biological information? First, we need repeated visits to estimate the probability of detecting an event when it happens, otherwise known as sensitivity in the medical field, or false-negative probability in the biological sciences. This sampling approach is the backbone of numerous statistical advances in what are commonly known as occupancy models.

Surprisingly, few citizen science programmes have applied this type of sampling design. The North American Amphibian Monitoring Program (NAAMP) of the U.S. Geological Survey, where volunteers survey routes 3-4 times per year to detect the presence and absence of frog species in the spring through the summer is one of a few that does. Another good example is the Celebrate Urban Birds Program (CUBS) of the Cornell Lab of Ornithology, where school groups and other volunteers visit a site three times in a week, any week of the year, and record the presence and absence of 16 focal bird species in green spaces in cities.

Correcting for Bias: False-Positives

At this point, we are able to correct for false-negative probabilities using a repeated visit sampling design. However, to correct for false-positive probabilities (e.g., 1- specificity), we need to collect additional information. The nature of that information depends on the statistical model that will be used to make inferences, which fall under three general categories:

  • Information collected in the field that is validated or can be safely assumed to be true detections of an event (Site Confirmation Model)
  • Independent sources of information on both, false-positive and false-negative probabilities (Calibration Model)
  • Different types of information collected during sampling that can be classified as true absences, true presences, false-positives, and a combination of false absences and presences (Observation Confirmation Model)

What approach is most useful for citizen-science data, when we have repeated visits as part of our sampling framework? Well, the cost and feasibility of obtaining data in the field which can be accurately classified as “true” is likely to be limited for millions of observations, ruling out the Site Confirmation and Observation Confirmation models.

The Calibration Model: A Promising Way to Account for Uncertainty in Monitoring

This makes the Calibration Model the most promising approach, and the steps to apply this model to citizen science data are the focus of our recent paper in Methods in Ecology and Evolution. We developed a computationally efficient and flexible model that can accommodate repeated visit data to estimate false-negative probabilities. It also incorporates an independent test data-step to estimate false-positive probabilities. This model proved to be very accurate across a range of simulated scenarios, and it can accommodate large volumes of both field observations and independent test data.

A Northern Cricket Frog.  ©Patrick Coin

A Northern Cricket Frog. ©Patrick Coin

To evaluate the usefulness of our approach, we applied our model to data from the NAAMP programme, and data from a field test carried out to estimate false-negative probabilities of NAAMP volunteers. Without this approach to correct for both sources of uncertainty, existing models would have overestimated the occurrence of the Northern Cricket frog (Acris crepitans) in the Northeastern US by as much as 67%.

This is just the start though. Online platforms need to be developed to gather more test data to inform false-positive probabilities. Our simulations showed us that for rare species that are hard to detect, we would need to carry out at least 15,000 independent test trials for each species to be able to make accurate inferences using monitoring data collected by trained technicians and scientists, as well as those collected by citizen scientists.

For this, we recommend that citizen science programmes leverage the popularity of mobile applications to carry out these tests. For example, the Merlin bird identification app could have a Test your skills game, where observers could go in and test their visual and audio bird identification skills. This would help scientists (including our group at the Cornell Lab of Ornithology) to get the information they need to make better inferences of bird observations collected by citizen scientists!


8 thoughts on “Being Certain about Uncertainty: Can We Trust Data from Citizen Science Programs?

  1. Pingback: Links round-up: 18/03/2016 | BES Quantitative Ecology Blog

  2. Pingback: Uncertainty in biological monitoring : An interview with Viviana Ruiz-Gutierrez |

  3. Pingback: Biogeographic Regions: What Are They and What Can They Tell Us? |

  4. You need to start thinking about why people contribute. It is not just to feel like a scientist or total altruism. Some of us want some result that we can use for our own local communities.

  5. Pingback: Quality and quantity with citizen science – BlogON

  6. Pingback: Quality and quantity with citizen science - SciStarter Blog at SciStarter Blog

  7. Pingback: Quality and quantity with citizen science - Citizen Science Salon : Citizen Science Salon

  8. Pingback: Biogeography Virtual Issue |

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s