Timeseries using fable


Modeling COVID-era pediatric mental health emergencies







Will Simmons
7 September 2023
WCM Biostatistics Computing Club

What is timeseries data?


  • Data with a time component

  • Can vary by frequency (e.g., every year, every minute)

  • Examples
    • Global rainfall by month
    • NYC electricity usage by day of week
    • Hourly temperature in your apartment

Timeseries in R

The tidyverts

Based on tsibble objects


  • Similar to a tibble or data.frame, but optimized for timeseries data

  • Primary timeseries components in a tsibble:

    • Index: unit of time measured
    • Key: (optional) grouping variable(s)
    • Measured variables: relevant data measured over time

Based on tsibble objects


Example: Yearly population counts across world countries


tsibbledata::global_economy |> 
  select(
    Country,    # Key
    Year,       # Index
    Population  # Measured var.
  ) |> 
  head()
# A tsibble: 6 x 3 [1Y]
# Key:       Country [1]
  Country      Year Population
  <fct>       <dbl>      <dbl>
1 Afghanistan  1960    8996351
2 Afghanistan  1961    9166764
3 Afghanistan  1962    9345868
4 Afghanistan  1963    9533954
5 Afghanistan  1964    9731361
6 Afghanistan  1965    9938414

fable: tidy timeseries!


  • Fit multiple timeseries models simultaneously

  • Explore, diagnose, plot, forecast

  • Huge upgrade from the forecast package

Motivating example

COVID-era pediatric mental health emergencies in NYC


  • Kids are brought to the ED for a number of reasons

  • Mental health (MH) increasingly one of those reasons

  • MH-related pediatric ED visits seasonal and increasing

  • Note: results are in-press, so please don’t share!

Peds MH ED visit rate data1


# Simulated representative data
ts_weekly_simulated |> 
  as_tsibble(index = week) |> 
  head()
# A tsibble: 6 x 3 [1W]
      week   rate covid_wave
    <week>  <dbl>      <dbl>
1 2016 W01 0.0509          0
2 2016 W02 0.0502          0
3 2016 W03 0.0624          0
4 2016 W04 0.0437          0
5 2016 W05 0.0504          0
6 2016 W06 0.0516          0


  • Index: week
  • Key: none (i.e., no grouping other than week)
  • Measured variables: MH ED visit rate2, COVID wave3

Peds MH ED visit rate data

Let’s plot the rate data, smoothed with a rolling average

code
rate_plot <- ts_weekly |> 
  mutate(
    roll_rate = slider::slide_index_dbl(
      .x = rate,
      .i = week,
      .f = mean,
      .before = 2,
      .after = 2
    )
  ) |> 
  ggplot(aes(x = week)) +
  geom_line(aes(y = rate, color = "Raw weekly rate")) +
  geom_line(aes(y = roll_rate, color = "4-week smoothed rate")) +
  geom_segment(aes(color = "COVID",
                   x = ymd("2020-03-01"), xend = ymd("2020-03-01"),
                   y = 0, yend = 0.08)) +
  scale_color_manual(
    name = NULL,
    breaks = c("Raw weekly rate", "4-week smoothed rate", "COVID"),
    values = c("lightgray", "black", "#2c8c99")
  ) +
  my_plot_styling()

rate_plot

Peds MH ED visit rate data

Let’s plot the rate data, smoothed with a rolling average


  • Seasonality: peaks April/November, troughs January/July


  • Trend fairly flat over time


  • Changes post-COVID (blue line)?

We can use timeseries methods and fable to estimate

ARIMA models

Specific class of timeseries models


AR:

I:

MA:

autoregressive (lagged observations as inputs)

integrated (differencing to make series stationary)

moving average (lagged errors as inputs)


Seasonal and multivariable (“dynamic”) extensions exist, which is what we’ll use

Fitting an ARIMA model with our data


Research question: What would pediatric MH ED rates have looked like had COVID not happened?


Process:

  • Fit ARIMA model on baseline (pre-COVID) data
  • Forecast baseline model into the COVID time period
    • “Counterfactual” forecast ignoring COVID
  • Compare counterfactual forecast with reality

1. Fit ARIMA model on baseline data


baseline_data <- ts_weekly |>
  filter(week < yearweek("2020 W10"))

baseline_fit <- baseline_data |>
  model(
    arima = ARIMA(rate ~ 1 + pdq(2, 0, 0) + PDQ(0, 1, 1))
  )

baseline_fit
# A mable: 1 x 1
                               arima
                             <model>
1 <ARIMA(2,0,0)(0,1,1)[52] w/ drift>

2. Forecast baseline model

Code
fc <- baseline_fit |>
  forecast(
    h = nrow(ts_weekly |>
               filter(week >= yearweek("2020 W10")))  # COVID period
  )

fc |>
  autoplot(ts_weekly |> filter(week < yearweek("2020 W10"))) +
  my_plot_styling()

3. Compare counterfactual forecast with reality

By COVID wave

Effect estimates


  • Forecasts from baseline data are useful for visualization

  • However, we didn’t get statistical estimates of excess rates
    • i.e., difference between observed and actual

  • We can get these by fitting a model on all data, using COVID waves as covariates

Full ARIMA model

Note: rate multiplied by 10,000 for interpretability


full_fit <- ts_weekly |>
  model(
    arima = ARIMA(rate ~ 1 + pdq(2, 0, 0) + PDQ(0, 1, 1) + covid_wave)
  )

full_fit |> my_cleaning()
# A tibble: 5 × 4
  wave    est `2.5 %` `97.5 %`
  <chr> <dbl>   <dbl>    <dbl>
1 wave1  70.0   13.2      127.
2 wave2 141.    97.9      185.
3 wave3  58.8    1.06     117.
4 wave4 108.    35.7      180.
5 wave5  39.2  -33.0      111.

What did we learn?


From our analysis,

  • NYC pediatric mental health is in crisis, and COVID may have exacerbated it


In general,

  • Timeseries methods allow us to answer interesting, important questions using data over time

  • The tidyverts suite of packages can help us along the way

thanks

questions? (I’m still learning!)

Citations and further reading