Tracking Auk Breeding With Mixed Modelling

Since 1982 the UK’s Centre for Ecology and Hydrology has been tracking the breeding habits of auks on the Isle of May, a small island east of Scotland. The auk family of sea birds, which includes the iconic puffin, live primarily on the coasts of the northern Atlantic Ocean, and climate change could have significant effects on their ability to breed. We’ll use the CEH’s collected data, plus weather and temperature data from the Meteorological Office, to see how various species of auk have fared over the course of this long-term study. We’ll use the {lme4} package to create random-intercept, mixed-effect linear models that account for the hierarchical structure of the data.

Throughout this report, I’ve included the code used in collapsible sections. For more insight into how I developed a particular solution, just click “code”.

Preparing the Data:

breeding <- read.csv("IMLOTSBSDataset1982-2016.csv")

# the data is in wide format, so we'll pivot it to make a "species" column,
#and use regex to assign new, accurate column names.
br <- breeding %>% 
  pivot_longer(
    cols = 2:13,
    names_to = c("species", "type"),
    names_pattern = "([\\w]+)[.]([\\w]+)")

#now we'll pare it down to breeding success data:

success <- br %>%
  filter(type == "BS")

#looks good!

Now we’ve arranged the data so we have each species’ rate of breeding success for each year of the study easily within reach. We’ll create a basic time series showing how each species’ breeding success has change over time. This will help us see patterns that might be useful for further study.

Visualizing breeding success over time:

birdpal <- c(Guillemot = "#F7DEDE", Razorbill = "#BAD5EE", Puffin = "#5071A0", Shag = "#111A36", Kittiwake = "#922A1C", Fulmar = "#F06A60")

ggplot(success, aes(x=year, y=value, color = species)) +
  geom_point() +
  theme_bw() +
  scale_color_manual(values = birdpal) +
  ylab("breeding success")

#faceted with lines of best fit:
 
ggplot(success, aes(x=year, y=value, color=species)) +
  geom_point() +
  facet_wrap(~species) +
  geom_smooth(method = "lm") +
  theme_bw() +
  theme(legend.position = "none") +
  scale_color_manual(values = birdpal) +
  ylab("breeding success")

FALSE `geom_smooth()` using formula 'y ~ x'

Our visualizations show a lot of variation from species to species! Shags and Kittiwakes appear to have increased breeding success over time, while the others are showing a mild decline. There’s also a lot of differences in error - how closely those data adhere to the basic linear pattern. To get more numbers in front of us, we’ll run a linear regression on each species in the dataset and collect that important info in a table.

Basic Linear Regressions by Species:

bird_list <- unique(br$species)
t <- c("species", "slope", "conflo", "confhi", "r2", "adj r2", "pval")
r <- data.frame("species", "slope", "conflo", "confhi", "r2", "adj r2", "pval")
names(r) <- t

for (b in 1:length(bird_list)) {
  bird <- bird_list[b]
  df <- success %>%
    filter(species == bird)
  line <- lm(value ~ year, data = df)
  slope <- round(tidy(line)[2,2],5)
  conflo <- round(confint(line)[2,1],5)
  confhi <- round(confint(line)[2,2],5)
  r2 <- round(summary(line)$r.squared,5)
  r2a <- round(summary(line)$adj.r.squared,5)
  pval <- round(tidy(line)[2,5],5)
  newvector <- c(bird, slope, conflo, confhi, r2, r2a, pval)
  names(newvector) <- t
  r <- rbind(r, newvector)
}
r <- r[-1,]
r %>%
  mutate(confhi = as.numeric(confhi),
         conflo = as.numeric(conflo),
         conf.range = (confhi-conflo))

FALSE     species    slope   conflo   confhi      r2   adj r2    pval conf.range
FALSE 1 Guillemot -0.00508 -0.00889 -0.00126 0.18137  0.15657 0.01075    0.00763
FALSE 2 Razorbill -0.00548 -0.00798 -0.00299 0.37659   0.3577   9e-05    0.00499
FALSE 3    Puffin -0.00631 -0.00986 -0.00276 0.28397  0.26228 0.00098    0.00710
FALSE 4      Shag  0.03953  0.01958  0.05948 0.37045  0.34796 0.00036    0.03990
FALSE 5 Kittiwake  0.00763 -0.00742  0.02268 0.03708  0.00269 0.30797    0.03010
FALSE 6    Fulmar -0.00164 -0.00652  0.00324  0.0166 -0.01852 0.49741    0.00976

#visualizing our basic linear models
ggplot(success, aes(x=year, y=value, color=species)) +
  geom_point(alpha = .4) +
  geom_smooth(method = "lm", se=TRUE, alpha=.2) +
  theme_bw() +
  scale_color_manual(values=birdpal) +
  ylab("breeding success") +
  theme(legend.position = c(.1,.70),
           legend.text = element_text(size=8),
           legend.title = element_text(size=9)) +
  ggtitle("Linear models of breeding success by species")

FALSE `geom_smooth()` using formula 'y ~ x'

From our table and chart we can see that Puffins and Shags seem to have the most significant changes in breeding success over time, though one is rising and the other declining. Shags and Kittiwakes seem to be trending toward higher success over time, while Guillemots, Razorbills, Puffins and Fulmars all show a negative slope. The R^2 value shows how much variation in the data is explained by our basic linear models: Razorbills and Shags both have about 38% of their variation explained by this basic model, but overall we’ll need need a model that can account for more of the variation in our data if we want to dig deeper!

Now, we’ll use data from the UK Meteorological Office to assign a maximum temperature to each year of the study. Then we’ll see if we can infer any trends from that data.

Digging into climate:

#shags are on a different track then the other auks, so we'll exclude them from
#this analysis
sc <- success %>% 
  filter(species != "Shag")

#we'll use maximum june temperature as our climate indicator:
climate <- read.csv("clim_east_Scotland.csv")

#pulling out year & june max temp:
june <- climate %>% 
  filter(var == "tmax") %>% 
  select(Year, JUN) %>%
  mutate(Year = as.numeric(Year)) %>%
  filter(Year %in% sc$year)

#a function to find & assign temperature:
find_temp <- function(year) {
  temperature <- june[june$Year == year,2]
}

#moving rowwise to assign temperature by year:
sc <- sc %>%
  mutate(year = as.numeric(year)) %>%
  rowwise() %>%
  mutate(temp = ifelse(year %in% unique(june$Year),
                       find_temp(year),
                       NA)) %>%
  ungroup()

#and modify sc to exclude years with no temperature data, and a little renaming for clarity:
sc <- sc %>%
  filter(year >= 1986) %>%
  rename(bs = value)

#adding a normalized year (ie, 1,2,3 etc rather than 1985...)
sc <- sc %>%
  mutate(nyear = as.factor(year-1985))

#and scaling temperature so the mean is 0 and variation is represented by
#std deviations.
sc$temp <- scale(sc$temp, center = TRUE, scale = TRUE)

#let's visualize how temperature effects breeding success by species!
ggplot(sc, aes(x = temp, y = bs, color = species)) +
  geom_point() +
  facet_wrap(~species) +
  theme_bw() +
  scale_color_manual(values = birdpal) +
  theme(legend.position = "none") +
  labs(x= "temperature (standard deviations)", y="breeding success", title = "how temperature affects breeding success by species")

 ggplot(sc, aes(x=temp, y= bs, color = species)) +
  geom_point() +
  theme_bw() +
  scale_color_manual(values = birdpal) +
  labs(x= "temperature (standard deviations)", y="breeding success", title = "how temperature affects breeding success by species")

These two charts show why it’s important for us to use a model that acknowledges that each species behaves differently. It’s easy to see from the second chart that we could draw a totally wrong conclusion about how auks as a whole are being affected by climate!

Our question is predicated on the idea that the maximum temperature has changed since 1986. Let’s quickly assess if the maximum temperature has changed significantly over the course of the study:

templine <- lm(temp ~ year, data = sc)
summary(templine)

FALSE 
FALSE Call:
FALSE lm(formula = temp ~ year, data = sc)
FALSE 
FALSE Residuals:
FALSE      Min       1Q   Median       3Q      Max 
FALSE -2.05074 -0.62552 -0.05853  0.65117  1.74435 
FALSE 
FALSE Coefficients:
FALSE               Estimate Std. Error t value Pr(>|t|)  
FALSE (Intercept) -45.654567  17.646441  -2.587   0.0106 *
FALSE year          0.022816   0.008819   2.587   0.0106 *
FALSE ---
FALSE Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE 
FALSE Residual standard error: 0.982 on 153 degrees of freedom
FALSE Multiple R-squared:  0.04192, Adjusted R-squared:  0.03565 
FALSE F-statistic: 6.694 on 1 and 153 DF,  p-value: 0.01061

ggplot(sc, aes(y = temp, x = year)) +
  geom_smooth(method = "lm", se = TRUE) +
  geom_point() +
  theme_bw()

FALSE `geom_smooth()` using formula 'y ~ x'

A basic linear regression shows that yes, we’re seeing a relatively significant trend toward higher maximum temperatures on the Isle of May from 1986 to 2016.

Now we want to build a model we can use to explain the trends we’re seeing using a random-intercept, mixed-model approach.

The question we want our model to answer is, “How does the breeding success of auks on the Isle of May respond to changes in temperature?” As we’ve seen, there are complicating factors in our data: the fact that data was gathered by species, and the fact that it’s grouped annually. Using random intercepts will allow us to generalize our data to answer that larger question without having to eliminate data, or risk bias from factors we can’t control for.

We’ll use the {lme4} package to construct our model, then summarize it.

Building the model:

#here's how to read our model from left to right:
#we're seeing how bs (breeding success) is explained by temperature -
#and randomizing intercepts based on year and species.

mixed2 <- lmer(bs ~ temp + (1|nyear) + (1|species), data = sc)
summary(mixed2)

FALSE Linear mixed model fit by REML ['lmerMod']
FALSE Formula: bs ~ temp + (1 | nyear) + (1 | species)
FALSE    Data: sc
FALSE 
FALSE REML criterion at convergence: -67.6
FALSE 
FALSE Scaled residuals: 
FALSE      Min       1Q   Median       3Q      Max 
FALSE -2.89596 -0.40043  0.05983  0.45225  3.14695 
FALSE 
FALSE Random effects:
FALSE  Groups   Name        Variance Std.Dev.
FALSE  nyear    (Intercept) 0.007048 0.08395 
FALSE  species  (Intercept) 0.018386 0.13560 
FALSE  Residual             0.027612 0.16617 
FALSE Number of obs: 153, groups:  nyear, 31; species, 5
FALSE 
FALSE Fixed effects:
FALSE             Estimate Std. Error t value
FALSE (Intercept) 0.606930   0.063920   9.495
FALSE temp        0.005903   0.020209   0.292
FALSE 
FALSE Correlation of Fixed Effects:
FALSE      (Intr)
FALSE temp 0.001

This summary gives us some valuable information. We’ve included two random effects, year and species. If we sum up the variance in the random effects section, we can divide by each value to see what percentage is is explained by each factor we randomized.

Year accounts for about 13.3% of of the variation we see in the data, and species accounts for about 34.6% - that leaves about 52.1% of the variation in the data that is explained by temperature or other factors we can’t detect.

We can also see under “fixed effects” (our explanatory variables) that the standard error is bigger than the temperature slope estimate. Thus, according to this model, we’re not seeing a significant positive trend in breeding success among the auk species studied, since our slope could likely be zero or negative. Unlike with other forms of statistical analysis, we’re not worried about significance or insignificance here - we’re just trying to create a model that accurately explains our data.

Now we’ll look into how we can measure the accuracy of our model, and visualize how well our model might predict future trends!

Predicting trends with our model:

plot(mixed2)

qqnorm(resid(mixed2))

# qqplots show how the data produced by a model compare to a theoretical normal distribution. A perfect qq plot would show a 45 degree line from lower left to upper right; our model's output adheres somewhat to the line. Our model is pretty good, but we see some deviance from normal around the extremes.

The “qqplot” above shows how the data produced by a model compare to a theoretical normal distribution. A perfect qqplot would show a 45 degree line from lower left to upper right; our model’s output adheres somewhat to the line. Our model is pretty good, but we see some deviance from normal around the extremes.

#using ggpredict to predict output of model
sp_both <- ggpredict(mixed2, terms = c("temp"))


ggplot(sp_both) +
  geom_line(aes(x=x, y=predicted)) + #slope line
  geom_ribbon(aes(x=x, ymin = predicted - std.error, ymax=predicted + std.error),
              fill = "lightgrey", alpha =.5) + #visualizing error
  geom_point(data = sc,
             aes(x=temp, y=bs, color=species)) +
  labs(x="temperature (scaled)", y = "breeding success", title = "temp explains some variance in auk breeding success") +
  scale_color_manual(values = birdpal) +
  theme_bw()

Here, we used the function ggpredict() with our model to show how it might respond to various inputs, and mapped it over our existing data. We can see there’s a lot of variation in breeding success that temperature alone doesn’t account for: while rising maximum temperatures could be a factor in changing breeding success in Isle of May auks, there are definitely many other factors at play.

Conclusions

By combining random effects with a standard linear regression, we were able to build a model that shows how breeding success in auks may change over time in response to temperature - while acknowledging factors like species that might complicate our statistical approach. Mixed models like these are useful across many disciplines that deal with messy, large scale data with complex structure and many contributing variables.

Acknowledgments:

This project was adapted from the Master Modeller challenge by the University of Edinburgh Coding Club.

The data were provided by the NERC Environmental Data Center.