Synthetic Controls

Assess treatment effects by comparing real history to an alternate reality without the treatment.

Published

22 May 2026, 15:23:44

Introduction

Imagine an alternate reality in which Russia’s large scale invasion of Ukraine never happened. Or where the British referendum vote had rejected Brexit. Would global migration patterns look different in these alternate realities?

In this exercise, we will learn how to develop Bayesian synthetic control models to infer migration under these alternative scenarios and compare them to observed migration to asses the effect of these world events.

Background

Synthetic control models are really just a fancy weighted average.

We have a timeseries of observations from our “treated unit” with observations before and after some intervention date. This could be an economic indicator before and after a policy change, for example. We want a model that can estimate what we would have observed in the post-intervention period if there had been no intervention. We do that by taking a weighted average of observations from untreated units in our “donor pool”. The “donors” could be countries where no policy change was introduced.

Estimating the weights is the real trick. We need to find weights for our weighted average that reproduce the pre-intervention observations in the treated country based on observations from the donor countries. These weights are often determined through optimisation algorithms, but we’re going to learn how to estimate them statistically in a Bayesian context. This has some advantages in terms of estimating uncertainty as well as flexibility to specify different models depending on the characteristics of our data.

Traditional synthetic controls

We can see a nice (non-Bayesian) synthetic controls model from the example analysis for the tidysynth R package (see here). The example is based on a study from Abadie et al. (2010) which evaluated the impact that a large-scale tobacco control program in California had on cigarette sales. The model uses cigaratte sales in “donor” states to estimate what the trend would have been in California without the tobacco law.

In the figure below, we can see that the synthetic control predictions in the pre-intervetion period look quite close to reality which gives us confidence that the weighted average is working well. The observed cigarette sales declined significantly compared to the synthetic control in the post-intervention period which indicates to us that the 1988 California law was successful in reducing tobacco sales.

Show the R code

library(tidysynth)
library(dplyr)

data("smoking")

smoking_out <-
  smoking %>%

  # initial the synthetic control object
  synthetic_control(
    outcome = cigsale, # outcome
    unit = state, # unit index in the panel data
    time = year, # time index in the panel data
    i_unit = "California", # unit where the intervention occurred
    i_time = 1988, # time period when the intervention occurred
    generate_placebos = T # generate placebo synthetic controls (for inference)
  ) %>%

  # Generate the aggregate predictors used to fit the weights

  # average log income, retail price of cigarettes, and proportion of the
  # population between 15 and 24 years of age from 1980 - 1988
  generate_predictor(
    time_window = 1980:1988,
    ln_income = mean(lnincome, na.rm = T),
    ret_price = mean(retprice, na.rm = T),
    youth = mean(age15to24, na.rm = T)
  ) %>%

  # average beer consumption in the donor pool from 1984 - 1988
  generate_predictor(
    time_window = 1984:1988,
    beer_sales = mean(beer, na.rm = T)
  ) %>%

  # Lagged cigarette sales
  generate_predictor(time_window = 1975, cigsale_1975 = cigsale) %>%
  generate_predictor(time_window = 1980, cigsale_1980 = cigsale) %>%
  generate_predictor(time_window = 1988, cigsale_1988 = cigsale) %>%

  # Generate the fitted weights for the synthetic control
  generate_weights(
    optimization_window = 1970:1988, # time to use in the optimization task
    margin_ipop = .02,
    sigf_ipop = 7,
    bound_ipop = 6 # optimizer options
  ) %>%

  # Generate the synthetic control
  generate_control()

# Plotting the result
smoking_out %>% plot_trends()

This example analysis used covariates (age and beer sales) along with lagged effects that we will not explore in our Bayesian synthetic controls, although they could be implemented as extensions to the models we will develop below.

A Bayesian analysis: Ukrainian refugees

Now let’s see if we can build our own Bayesian synthetic control model. Here we’ll look at how the Russian invasion of Ukraine on 22 February 2022 affected migration from Ukraine to Germany.

Note: You can easily change the countries and intervation date for the analysis if you want to investigate a different topic (e.g. maybe you want to look at effects of Brexit implementation on 1 January 2021 or some other policy change).

R environment

Before we get started, we have a few quick housekeeping items to get out of the way. In this tutorial, we’re going to need the following R packages:

library(cmdstanr)
library(bayesplot)
library(posterior)
library(tools)
library(tibble)
library(tidyr)
library(tidybayes)
library(dplyr)
library(ggplot2)

If you haven’t yet installed cmdstanr, please go back to the Getting Started tutorial.

If you don’t have any of the other R packages, you can install them with this code:

Show the R code

pkgs <- c(
  "tools",
  "bayesplot",
  "posterior",
  "tibble",
  "tidyr",
  "tidybayes",
  "dplyr",
  "ggplot2"
)

pkgs_missing <- setdiff(pkgs, rownames(installed.packages()))
if (length(pkgs_missing) > 0) {
  install.packages(pkgs_missing)
}

Let’s also go ahead and setup our working directory and output directory:

outdir <- file.path(getwd(), "synthetic_controls")
dir.create(outdir, recursive = TRUE)

Raw data

We will be using Meta’s global migration dataset which contains monthly flows of migrants between countries estimated based on where people are when they connect to Facebook and Instagram (Chi et al. 2025). These data are feely available for download from Meta on the Humanitarian Data Exchange.

These data give us counts of migrants that moved from one country to another (including almost all countries globally) during each month from January 2019 to December 2022. A person needs to stay in a country for at least 12 months to be counted as a resident there. The data generating process accounted for differences in social media use to produce a relatively unbiased estimator of migration.

Now we’ll download the data into our working directory and load it into our R environment:

# filename to save the data
dat_filename <- file.path(outdir, "international_migration_flow.csv")

# download the data if we don't already have it
if (!file.exists(dat_filename)) {
  download.file(
    "https://data.humdata.org/dataset/e09595bd-f4c5-4a66-8130-5a05f14d5e64/resource/67b2d6fc-ff4f-4d04-a75e-80e3de81b072/download/international_migration_flow.csv",
    destfile = dat_filename
  )
}

# load the data into R
dat <- read.csv(dat_filename)

So you can get a better idea of what we’re working with, let’s take a quick look at the slice of data where Ukraine (UA) is the origin country:

Note: Countries are identified in the dataset based on their ISO-2 country codes which can be found on Wikipedia.

Poisson model

The basic Bayesian synthetic controls model that we will start with uses a Poisson likelihood since we are dealing with count data.

\[ \begin{aligned} y_t &\sim Poisson(\lambda_t) \\ \lambda_t &= e^{\alpha} \sum_{k=1}^K w_k x_{k,t} \\ \end{aligned} \]

where \(y_{t}\) are counts of migrants from Ukraine to Germany each month \(t\) and \(x_{k,t}\) are counts of migrants from each donor country \(k\) to Germany each month. The values of \(x_{k,t}\) have been re-scaled by dividing by the maximum value for each donor \(k\) so that values are comparable across countries and so that the values of \(x_{k,t}\) are always positive.

The weights \(w_k\) for each donor country are parameters that control how much each donor influences the synthetic control (i.e. estimate of alternate reality) for our treatment country. The weights are positive values and must sum to one across all donor countries. So the weighted average is defined as \(\sum_{k=1}^K w_k x_{k,t}\).

The parameter \(\alpha\) is essentially just a scaling factor that rescales our weighted average to match the scale of \(y_t\). For example, if \(\alpha=10\), then it would would scale a weighted average of 1 up to an outcome of 22,000.

We assign minimally informative priors to the parameters.

\[ \begin{aligned} w_k &\sim Dirichlet(rep(1, K)) \\ \alpha &\sim Normal(0, 10) \end{aligned} \]

We chose a \(Normal(0, 10)\) for the prior on \(\alpha\) because the weighted average will always be between zero and one, while the outcome \(y_t\) is on a scale of thousands of migrants.

We chose a Dirichlet prior because it is appropriate for a simplex–a vector of values between zero and one that must sum to one. Setting its parameter to a vector of ones results in a flat prior.

Stan code

We will now code this Poisson model in Stan.

data {
  int<lower=1> T; // Total number of months
  int<lower=1> T0; // End of pre-treatment period
  int<lower=1> K; // Number of donor countries
  array[T] int y; // Treated country migration counts (Ukraine)
  matrix[T, K] x; // Donor country migration counts/rates
}
parameters {
  real alpha;
  simplex[K] weights;
}
transformed parameters {
  vector[T] lambda;
  lambda = exp(alpha) * (x * weights);
}
model {
  // likelihood
  y[1 : T0] ~ poisson(lambda[1 : T0]);
  
  // priors
  alpha ~ normal(0, 10);
  weights ~ dirichlet(rep_vector(1, K));
}
generated quantities {
  array[T] int y_synthetic;
  array[T] int effect;
  
  for (t in 1 : T) {
    y_synthetic[t] = poisson_rng(lambda[t]);
    effect[t] = y[t] - y_synthetic[t];
  }
}

Notice that we are only including data in the likelihood for the pre-intervention period (i.e. [1:T0]). This is essential for the synthetic control model to represent pre-intervention expectations.

Note that we included a generated quantities block to produce posterior predictions across all time steps (i.e. y_synthetic[1:T]) which we will use as our synthetic control. These posterior predictions take into account the expectation from our scaled weighted average (\(\lambda_{k,t}\)) as well as the uncertainty (unexplained variance) associated with the Poisson distribution.

Save this Stan model code into a file named poisson.stan in the outdir that you defined above.

Important: If you don’t save the Stan model to this file, the code below will fail!

Select countries

In this example, we will look at migration out of Ukraine as a result of the large scale Russian invasion in February of 2022. Our goal is to build a synthetic estimate of migration without the war so that we can quantify the difference compared to observed migration. We can assess any destination country, but let’s start with Germany (DE).

# select countries
origin <- 'UA' # Ukraine
destination <- 'DE' # Germany

# identify last month before the intervention
last_pretreatment_month <- '2022-01'

With these selections, let’s take a look at the full list of donor countries (i.e. flows to Germany) that will be used to estimate our synthetic flows from Ukraine to Germany.

# identify "donor" countries for synthetic controls
donors <- dat %>%
  filter(
    country_to == destination,
    country_from != origin
  ) %>%
  distinct(country_from) %>%
  arrange() %>%
  pull()

print(donors)

  [1] "AD" "AE" "AF" "AL" "AM" "AO" "AR" "AT" "AU" "AZ" "BA" "BB" "BD" "BE" "BF"
 [16] "BG" "BH" "BI" "BJ" "BN" "BO" "BR" "BS" "BT" "BW" "BY" "BZ" "CA" "CD" "CF"
 [31] "CG" "CH" "CI" "CL" "CM" "CO" "CR" "CV" "CY" "CZ" "DJ" "DK" "DO" "DZ" "EC"
 [46] "EE" "EG" "ER" "ES" "ET" "FI" "FJ" "FM" "FR" "GA" "GB" "GD" "GE" "GH" "GM"
 [61] "GN" "GQ" "GR" "GT" "GW" "GY" "HK" "HN" "HR" "HT" "HU" "ID" "IE" "IL" "IN"
 [76] "IQ" "IS" "IT" "JM" "JO" "JP" "KE" "KG" "KH" "KI" "KM" "KR" "KW" "KZ" "LA"
 [91] "LB" "LC" "LK" "LR" "LS" "LT" "LU" "LV" "LY" "MA" "MD" "ME" "MG" "MK" "ML"
[106] "MM" "MN" "MO" "MR" "MT" "MU" "MV" "MW" "MX" "MY" "MZ" "NE" "NG" "NI" "NL"
[121] "NO" "NP" "NZ" "OM" "PA" "PE" "PG" "PH" "PK" "PL" "PT" "PY" "QA" "RO" "RS"
[136] "RU" "RW" "SA" "SB" "SD" "SE" "SG" "SI" "SK" "SL" "SN" "SR" "SS" "ST" "SV"
[151] "SY" "SZ" "TD" "TG" "TH" "TJ" "TL" "TM" "TN" "TO" "TR" "TT" "TW" "TZ" "UG"
[166] "US" "UY" "UZ" "VC" "VE" "VN" "VU" "WS" "XK" "YE" "ZA" "ZM" "ZW"

And, the months with data available:

# identify months
months <- dat %>%
  distinct(migration_month) %>%
  arrange() %>%
  pull()

# take a look
print(months)

 [1] "2019-01" "2019-02" "2019-03" "2019-04" "2019-05" "2019-06" "2019-07"
 [8] "2019-08" "2019-09" "2019-10" "2019-11" "2019-12" "2020-01" "2020-02"
[15] "2020-03" "2020-04" "2020-05" "2020-06" "2020-07" "2020-08" "2020-09"
[22] "2020-10" "2020-11" "2020-12" "2021-01" "2021-02" "2021-03" "2021-04"
[29] "2021-05" "2021-06" "2021-07" "2021-08" "2021-09" "2021-10" "2021-11"
[36] "2021-12" "2022-01" "2022-02" "2022-03" "2022-04" "2022-05" "2022-06"
[43] "2022-07" "2022-08" "2022-09" "2022-10" "2022-11" "2022-12"

Model data

We’ll extract the migration flows from Ukraine to Germany as a vector that we can use later:

# get treated (y) data: flows from origin to destination
y_treated <- dat %>%
  filter(country_from == origin, country_to == destination) %>%
  slice(match(months, migration_month)) %>%
  pull(num_migrants) %>%
  replace_na(0)

Let’s just plot that to get an idea of the change in migration from Ukraine to Germany after the Russian invasion.

Show the R code

plot_data <- data.frame(
  month = 1:length(y_treated),
  migration_count = y_treated
)

ggplot(plot_data, aes(x = month, y = migration_count)) +
  geom_line(color = "steelblue", linewidth = 1) +
  geom_vline(
    xintercept = which(months == last_pretreatment_month),
    color = "firebrick",
    linetype = "dashed",
    linewidth = 0.8
  ) +
  annotate(
    "text",
    x = which(months == last_pretreatment_month),
    y = max(y_treated),
    label = "Russian invasion",
    hjust = 1.2,
    color = "firebrick"
  ) +
  theme_minimal() +
  labs(
    title = paste0(
      "Migration Counts Over Time (",
      origin,
      "->",
      destination,
      ")"
    ),
    x = "Month",
    y = "Migrants"
  )

And, we will also save the flows from donor countries to Germany as a matrix that we can use later:

# get donor (x) data: flows from other countries to destination
x_donors <- dat %>%
  filter(country_to == destination, country_from %in% donors) %>%
  pivot_wider(
    id_cols = migration_month,
    names_from = country_from,
    values_from = num_migrants,
    values_fill = 0
  ) %>%
  slice(match(months, migration_month)) %>%
  column_to_rownames("migration_month") %>%
  select(all_of(donors)) %>%
  as.matrix()

# take a look
head(x_donors[, 1:10])

        AD  AE  AF   AL AM AO  AR   AT  AU  AZ
2019-01  4 154 131  925 69 37 148 1052 403  99
2019-02  4 143 112 1005 63 42 223 1130 517 114
2019-03  1 230 165 1115 59 50 259 1270 557 181
2019-04 16 183 160 1025 90 31 247 1670 555 113
2019-05  0 160 154 1007 71 43 250 1185 476 119
2019-06 16 220 123  960 89 48 264 1184 475 131

Next, we want to scale our donor flows so that they are comparable to one another. This will improve stability estimating our model weights \(w_{k,t}\) because they will be more comparable across countries. We will simply rescale by dividing the count of migrants from each country by the maximum count from that country in the pre-treatment period.

Note: In this case, one criterion for scaling is that we want \(x_{k,t}\) to always be positive so that our linear predictor \(\lambda_t\) is always positive (refer back to the model, if needed).

# identify pre-treatment months
months_pre_treat <- months[months <= last_pretreatment_month]

# get the scaling factor
# (i.e. max of each donor in pre-treatment period)
scalar <- x_donors[months_pre_treat, ] %>%
  apply(2, max)

# scale donor data
x_donors_scaled <- x_donors %>%
  sweep(2, scalar, FUN = "/")

# take a look
head(round(x_donors_scaled[, 1:10], 3))

           AD    AE    AF    AL    AM   AO    AR    AT    AU    AZ
2019-01 0.250 0.670 0.055 0.510 0.473 0.74 0.364 0.400 0.422 0.406
2019-02 0.250 0.622 0.047 0.554 0.432 0.84 0.548 0.429 0.541 0.467
2019-03 0.062 1.000 0.069 0.614 0.404 1.00 0.636 0.483 0.583 0.742
2019-04 1.000 0.796 0.067 0.565 0.616 0.62 0.607 0.634 0.581 0.463
2019-05 0.000 0.696 0.064 0.555 0.486 0.86 0.614 0.450 0.498 0.488
2019-06 1.000 0.957 0.051 0.529 0.610 0.96 0.649 0.450 0.497 0.537

Lastly, we need to format the data for our model into a list object that is required by Stan.

# set seed for random number generators (important for reproducibility)
seed <- round(runif(1, 1, 1e6))
set.seed(seed)

# stan model data object
md <- list(
  T = length(months),
  T0 = which(row.names(x_donors) == last_pretreatment_month),
  K = length(donors),
  y = y_treated,
  x = x_donors_scaled,
  seed = seed
)

# save data to disk
saveRDS(
  object = md,
  file = file.path(outdir, "md.rds")
)

Markov chain Monte Carlo

Let’s actualy run MCMC to fit the model now!

It’s always good practice to create a function to generate random initial values for the MCMC chains.

# function to generate initials
init_generator <- function(md = md, chain_id = 1) {
  result <- list(
    alpha = rnorm(1, log(mean(md$y[1:md$T0])), 0.1),
    weights = rep(1 / md$K, md$K)
  )
  return(result)
}

# generate random initials for four chains
inits <- lapply(1:4, function(id) {
  init_generator(md = md, chain_id = id)
})

Now let’s compile and run the model:

# compile the stan model
mod <- cmdstan_model(file.path(outdir, "poisson.stan"))

# run MCMC
fit_po <- mod$sample(
  data = md, # model data (generated above)
  init = inits, # initials (generated above)
  parallel_chains = 4, # number of MCMC chains
  iter_warmup = 1e3, # warmup MCMC iterations to discard
  iter_sampling = 2e3, # MCMC iterations to keep
  seed = md$seed
)

# save model to disk
fit_po$save_object(file.path(outdir, "fit_poisson.rds"))

MCMC Diagnostics

Before looking at the substantive results, we need to do a quick sanity check just to confirm that the MCMC results are giving us reliable parameter estimates. Let’s take a look at a few traceplots to get a visual confirmation that chains are behaving well.

Show the R code

draws <- fit_po$draws()
pars <- c("alpha", paste0("weights[", sample(1:md$K, 11), "]"))

p <- mcmc_trace(draws, pars)
print(p)

These traceplots look great, but we need to check the r-hat statistic for each parameter to formally confirm that everything converged (i.e. rhat < 1.01). We can check this in the fit summary alongside our parameter estimates. Just sort this table by the rhat column to check for any unconverged parameters.

Show the R code

# get fit summary
fit_summary <- fit_po$summary()

# print summary to table
fit_summary[-1, ] %>%
  column_to_rownames('variable') %>%
  round(4) %>%
  datatable(
    options = list(
      pageLength = 25,
      scrollX = TRUE,
      scrollY = "400px",
      fixedHeader = TRUE
    ),
    filter = 'top',
    rownames = TRUE
  )

Results

Now that we have confidence that the MCMC chains are converged, we’ll plot the synthetic control results.

Show the R code

# extract the posterior predictions for y_synthetic
draws_df <- fit_po$draws(
  variables = "y_synthetic",
  format = "df"
)

# summarise posterior predictions (mean, lower, upper)
synth_draws <- draws_df %>%
  pivot_longer(
    cols = starts_with("y_synthetic"),
    names_to = "parameter",
    values_to = "value"
  ) %>%
  mutate(t = as.integer(gsub(".*\\[(\\d+)\\]", "\\1", parameter))) %>%
  group_by(t) %>%
  summarise(
    y_hat = mean(value),
    lower = quantile(value, 0.025),
    upper = quantile(value, 0.975),
    .groups = "drop"
  ) %>%
  mutate(date = months[t])

# observed (real) data
real_data <- data.frame(
  date = row.names(md$x),
  y_obs = md$y
)

# data for plot
plot_df <- left_join(synth_draws, real_data, by = "date") %>%
  mutate(date = as.Date(paste0(date, "-01"))) %>%
  mutate(diff = y_obs - y_hat)

treatment_date <- as.Date(paste0(months[md$T0], "-01"))

p <- ggplot(plot_df, aes(x = date)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), fill = "skyblue", alpha = 0.4) +
  geom_line(aes(y = y_hat, color = "Synthetic Control"), linetype = "dashed") +
  geom_line(aes(y = y_obs, color = "Actual Data")) +
  geom_vline(xintercept = treatment_date, color = "black") +
  scale_y_log10(labels = scales::comma) + # This makes the ribbon visible!
  labs(
    title = "Migration reality and synthetic control timeseries",
    subtitle = paste(origin, "to", destination),
  ) +
  ylab('Migration flows (log scale)') +
  theme_minimal()

print(p)

We can see that the synthetic control matches the actual data in the pre-intervention period which indicates good fit. We can also see that the actual data diverge significantly after the Russian invasion which indicates a massive effect on Ukrainian migration flows.

One problem is that the prediction intervals for the synthetic control appear to be too tight. In other words, the intervals often do not include the actual data in the pre-intervention period. Let’s take a closer look at this “coverage” issue. For 95% credible intervals, we expect the actual data to be inside the intervals 95% of the time.

Show the R code

# get prediction intervals
predictive_intervals <- fit_po$draws("y_synthetic", format = "df") %>%
  spread_draws(y_synthetic[t]) %>%
  median_qi(y_synthetic, .width = 0.95) %>% # Calculates 2.5% and 97.5% quantiles
  filter(t <= md$T0) # Only look at the pre-treatment period

# join with observed data
coverage_df <- predictive_intervals %>%
  mutate(observed = md$y[t]) %>%
  mutate(is_covered = observed >= .lower & observed <= .upper)

# calculate coverage
coverage_rate <- mean(coverage_df$is_covered) * 100

# visualise coverage
p <- ggplot(coverage_df, aes(x = t)) +
  geom_ribbon(aes(ymin = .lower, ymax = .upper), alpha = 0.2, fill = "blue") +
  geom_line(aes(y = y_synthetic), color = "blue", linetype = "dashed") +
  geom_point(aes(y = observed, color = is_covered)) +
  scale_color_manual(values = c("TRUE" = "black", "FALSE" = "red")) +
  labs(
    title = paste(
      "Interval Coverage Diagnostic (",
      round(coverage_rate, 1),
      "%)"
    ),
    subtitle = "Red points fall outside the 95% prediction interval in the pre-treatment period",
    x = "Month Index",
    y = "Migration Count"
  ) +
  theme_minimal()

print(p)

This confirms that we have undercoverage of our prediction intervals. Let’s try to revise the model using a negative binomial likelihood instead of a Poisson to add flexibility for wider intervals. Remember that the Poisson distribution does not include a variance parameter; The variance is always equal to the mean which is quite restrictive.

Overdispersion: Negative binomial likelihood

Our new negative binomial model will look like this:

\[ \begin{aligned} y_t &\sim NegativeBinomial(\mu_t, \phi) \\ \mu_t &= e^{\alpha} \sum_{k=1}^K w_k x_{k,t} \\ \\ w_k &\sim Dirichlet(rep(1, K)) \\ \alpha &\sim Normal(0, 10) \\ \phi &\sim Exponential(0.1) \end{aligned} \]

Notice that the negative binomial includes a precision parameter \(\phi\) which allows for the overdispersion (extra-poisson variance) that we need. We’ve used an exponential as a moderately uninformative prior on this precision parameter.

Stan model

data {
  int<lower=1> T; // Total number of months
  int<lower=1> T0; // End of pre-treatment period
  int<lower=1> K; // Number of donor countries
  array[T] int y; // Treated country migration counts (Ukraine)
  matrix[T, K] x; // Donor country migration counts/rates
}
parameters {
  real alpha;
  simplex[K] weights;
  real phi;
}
transformed parameters {
  vector[T] mu;
  
  mu = exp(alpha) .* (x * weights);
}
model {
  // likelihood
  y[1 : T0] ~ neg_binomial_2(mu[1 : T0], phi);
  
  // priors
  weights ~ dirichlet(rep_vector(1, K));
  
  alpha ~ normal(0, 10);
  phi ~ exponential(0.1);
}
generated quantities {
  array[T] int y_synthetic;
  array[T] int effect;
  
  for (t in 1 : T) {
    y_synthetic[t] = neg_binomial_2_rng(mu[t], phi);
    effect[t] = y[t] - y_synthetic[t];
  }
}

We used Stan’s neg_binomial_2 distribution as our likelihood because this parameterises the negative binomial using its mean (for comparison, see the standard parameterisation here).

Save this Stan model code into a file named negbin.stan in the outdir that you defined above.

Important: If you don’t save the Stan model to file, the code below will fail!

MCMC

We just need to update our initials generator function to include our new parameter phi, and we can re-run the model.

# function to generate initials
init_generator <- function(md = md, chain_id = 1) {
  result <- list(
    mu = rnorm(1, log(mean(md$y[1:md$T0])), 0.1),
    phi = runif(1, 5, 15),
    weights = rep(1 / md$K, md$K)
  )
  return(result)
}

# random initials for four chains
inits <- lapply(1:4, function(id) {
  init_generator(md = md, chain_id = id)
})

# compile the stan model
mod <- cmdstan_model(file.path(outdir, "negbin.stan"))

# run MCMC
fit_nb <- mod$sample(
  data = md, # model data (generated above)
  init = inits, # initials (generated above)
  parallel_chains = 4, # number of MCMC chains
  iter_warmup = 1e3, # warmup MCMC iterations to discard
  iter_sampling = 2e3, # MCMC iterations to keep
  seed = md$seed
)

# save model to disk
fit_nb$save_object(file.path(outdir, "fit_negbin.rds"))

Diagnostics

As always, we first need to confirm that the MCMC chains have converged and check for any warnings or errors from Stan when the model finishes running.

Show the R code

draws <- fit_nb$draws()
pars <- c("alpha", "phi", paste0("weights[", sample(1:md$K, 10), "]"))

p <- mcmc_trace(draws, pars)
print(p)

Traceplots look good… check.

Show the R code

# get fit summary
fit_summary <- fit_nb$summary()

# print summary to table
fit_summary[-1, ] %>%
  column_to_rownames('variable') %>%
  round(4) %>%
  datatable(
    options = list(
      pageLength = 25,
      scrollX = TRUE,
      scrollY = "400px",
      fixedHeader = TRUE
    ),
    filter = 'top',
    rownames = TRUE
  )

Our rhat statistics indicate good convergence (i.e. rhat < 1.01)… check.

Results

Let’s take a look out our results to see if we have better coverage of prediction intervals now with the negative binomial likelihood.

Show the R code

# extract the posterior predictions for y_synthetic
draws_df <- fit_nb$draws(
  variables = "y_synthetic",
  format = "df"
)

# summarise posterior predictions (mean, lower, upper)
synth_draws <- draws_df %>%
  pivot_longer(
    cols = starts_with("y_synthetic"),
    names_to = "parameter",
    values_to = "value"
  ) %>%
  mutate(t = as.integer(gsub(".*\\[(\\d+)\\]", "\\1", parameter))) %>%
  group_by(t) %>%
  summarise(
    y_hat = mean(value),
    lower = quantile(value, 0.025),
    upper = quantile(value, 0.975),
    .groups = "drop"
  ) %>%
  mutate(date = months[t])

# observed (real) data
real_data <- data.frame(
  date = row.names(md$x),
  y_obs = md$y
)

# data for plot
plot_df <- left_join(synth_draws, real_data, by = "date") %>%
  mutate(date = as.Date(paste0(date, "-01"))) %>%
  mutate(diff = y_obs - y_hat)

treatment_date <- as.Date(paste0(months[md$T0], "-01"))

p <- ggplot(plot_df, aes(x = date)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), fill = "skyblue", alpha = 0.4) +
  geom_line(aes(y = y_hat, color = "Synthetic Control"), linetype = "dashed") +
  geom_line(aes(y = y_obs, color = "Actual Data")) +
  geom_vline(xintercept = treatment_date, color = "black") +
  scale_y_log10(labels = scales::comma) +
  labs(
    title = "Migration reality and synthetic control timeseries",
    subtitle = paste(origin, "to", destination),
  ) +
  ylab('Migration flows (log scale)') +
  theme_minimal()

print(p)

It now appears that we have good coverage in addition to fairly good fit in the pre-intervention period and a significant effect in the post-intervention period. Let’s just take a closer look at the coverage to confirm.

Show the R code

# get prediction intervals
predictive_intervals <- fit_nb$draws("y_synthetic", format = "df") %>%
  spread_draws(y_synthetic[t]) %>%
  median_qi(y_synthetic, .width = 0.95) %>% # Calculates 2.5% and 97.5% quantiles
  filter(t <= md$T0) # Only look at the pre-treatment period

# join with observed data
coverage_df <- predictive_intervals %>%
  mutate(observed = md$y[t]) %>%
  mutate(is_covered = observed >= .lower & observed <= .upper)

# calculate coverage
coverage_rate <- mean(coverage_df$is_covered) * 100

# visualise coverage
p <- ggplot(coverage_df, aes(x = t)) +
  geom_ribbon(aes(ymin = .lower, ymax = .upper), alpha = 0.2, fill = "blue") +
  geom_line(aes(y = y_synthetic), color = "blue", linetype = "dashed") +
  geom_point(aes(y = observed, color = is_covered)) +
  scale_color_manual(values = c("TRUE" = "black", "FALSE" = "red")) +
  labs(
    title = paste(
      "Interval Coverage Diagnostic (",
      round(coverage_rate, 1),
      "%)"
    ),
    subtitle = "Red points fall outside the 95% prediction interval in the pre-treatment period",
    x = "Month Index",
    y = "Migration Count"
  ) +
  theme_minimal()

print(p)

We expect 95% of the observations to be within the 95% prediction intervals. In this case, a value of round(coverage_rate, 1)% is good. Remember, we only have 37 pre-intervention time steps, so getting a coverage of exactly 95% may not be possible.

Now that we have reliable estimates and prediction intervals, lets look at the difference in the number of migrants from reality compared to the synthetic control (also with prediction intervals).

Show the R code

total_diff <- plot_df %>%
  filter(t > which(months == last_pretreatment_month)) %>%
  pull(y_obs) %>%
  sum()

ggplot(plot_df, aes(x = date)) +
  geom_ribbon(
    aes(ymin = y_obs - upper, ymax = y_obs - lower),
    fill = "firebrick",
    alpha = 0.4
  ) +
  geom_line(aes(y = diff)) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_vline(xintercept = treatment_date, color = "black") +
  labs(
    title = "Estimated Treatment Effect (Actual - Synthetic)",
    subtitle = paste(origin, "to", destination),
    y = "Number of migrants (a.k.a. refugees) above expectation",
    x = "Date"
  ) +
  theme_minimal()

Notice that the magnitude of the effect size is massive compared to the uncerainty intervals, indicating strong evidence of a treatment effect (which is obvious in this case).

Next, let’s take a look to see if individual donor countries were particularly heavily weighted in our weighted average or if they contribute more evenly. Remember, we included almost all countries in the world, so we would expect some with migration to Germany that better mirrors the usual migration patterns from Ukraine to Germany.

Show the R code

# How many donors to plot
top_x <- 15

# Summarize weights and map names
weights_summary <- fit_nb$summary("weights") %>%
  mutate(
    donor_index = 1:n(),
    donor_name = donors[donor_index]
  ) %>%
  arrange(desc(mean)) %>%
  mutate(cumulative_weight = cumsum(mean)) %>%
  slice_head(n = top_x)

# Plot the top donors with names
p <- ggplot(weights_summary, aes(x = reorder(donor_name, -mean), y = mean)) +
  geom_bar(stat = "identity", fill = "steelblue", alpha = 0.8) +
  geom_errorbar(
    aes(ymin = q5, ymax = q95),
    width = 0.2,
    color = "firebrick"
  ) +
  geom_text(aes(label = round(mean, 3)), vjust = -1.2, size = 3.5) +
  labs(
    title = paste(
      "Top",
      top_x,
      "donors for synthetic",
      origin,
      "->",
      destination
    ),
    x = "Donor Country",
    y = "Posterior Weight (Mean)"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
    panel.grid.major.x = element_blank()
  )

print(p)

We see here that Geoergia (GE), Afghanistan (AF), and Tajikistan (TJ) were the donors with the highest weights, but that all weights are very similar. Across 178 donor countries, a perfectly even distribution of weights would results in weights of 0.006.

Next, let’s encourage the model to drop some countries from the donor pool while emphasising those that provide the best fit using a concept known as “shrinkage” (i.e. shrinking their weights to zero).

Shrinkage: Select the most important donors

We begin with the exact same negative binomial model from above.

\[ \begin{aligned} y_t &\sim NegativeBinomial(\mu_t, \phi) \\ \mu_t &= e^{\alpha} \sum_{k=1}^K w_k x_{k,t} \\ \\ \alpha &\sim Normal(0, 10) \\ \phi &\sim Exponential(0.1) \end{aligned} \]

We will implement a “horseshoe prior” to shrink weights of the least important donors to zero. \[ \begin{aligned} \tilde{w}_k &= w^{raw}_k \tau \lambda_k \\ w_k &= \frac{\tilde{w}_k}{\sum_{j=1}^K \tilde{w}_j} \end{aligned} \\ \]

The horseshoe prior consists of three parts:

Raw weights \(w^{raw}_k\).
Global scale \(\tau\) which controls the overall “sparsity” of the model (i.e. the proportion of donors allowed to have an effect). This parameter pulls all weights towards zero by default.
Local scale \(\lambda_k\) which allows some donors to escape the pull towards zero. If a donor country is a great match for migration trends from Ukraine to Germany in the pre-intervention period, then their local scale will be large allowing the country to have a stronger weight.

We implement minimally informative priors for these parameters.

\[ \begin{aligned} \lambda_k &\sim Cauchy^+(0,1) \\ \tau &\sim Cauchy^+(0,1) \\ w^{raw}_k &\sim Normal(0,1) \end{aligned} \\ \]

Stan model

// bayesian synthetic controls
// negative binomial likelihood
// horseshoe shrinkage
data {
  int<lower=1> T; // Total number of months
  int<lower=1> T0; // End of pre-treatment period
  int<lower=1> K; // Number of donor countries
  array[T] int y; // Treated country migration counts (Ukraine)
  matrix[T, K] x; // Donor country migration counts/rates
}
parameters {
  real alpha;
  real phi;
  
  vector<lower=0>[K] weights_raw;
  vector<lower=0>[K] local_scales;
  real<lower=0> global_scale;
}
transformed parameters {
  vector[T] mu;
  vector<lower=0>[K] weights_shrunk;
  simplex[K] weights;
  
  weights_shrunk = weights_raw .* local_scales * global_scale;
  weights = weights_shrunk / sum(weights_shrunk);
  
  mu = exp(alpha) .* (x * weights);
}
model {
  // likelihood
  y[1 : T0] ~ neg_binomial_2(mu[1 : T0], phi);
  
  // priors
  alpha ~ normal(0, 10);
  phi ~ exponential(0.1);
  
  weights_raw ~ std_normal();
  local_scales ~ cauchy(0, 1);
  global_scale ~ cauchy(0, 1);
}
generated quantities {
  array[T] int y_synthetic;
  array[T] int effect;
  
  for (t in 1 : T) {
    y_synthetic[t] = neg_binomial_2_rng(mu[t], phi);
    effect[t] = y[t] - y_synthetic[t];
  }
}

Save this Stan model code into a file named shrinkage.stan in the outdir that you defined above.

Important: If you don’t save the Stan model to file, the code below will fail!

MCMC

Again, we just update our initials generating function to include the new parameters and then rerun the model.

# function to generate initials
init_generator <- function(md = md, chain_id = 1) {
  result <- list(
    mu = rnorm(1, log(mean(md$y[1:md$T0])), 0.1),
    phi = runif(1, 5, 15),
    weights_raw = rep(1, md$K),
    local_scales = rep(1, md$K),
    global_scale = 1
  )
  return(result)
}

# random initials for four chains
inits <- lapply(1:4, function(id) {
  init_generator(md = md, chain_id = id)
})

# compile the stan model
mod <- cmdstan_model(file.path(outdir, "shrinkage.stan"))

# run MCMC
fit_sh <- mod$sample(
  data = md, # model data (generated above)
  init = inits, # initials (generated above)
  parallel_chains = 4, # number of MCMC chains
  iter_warmup = 1e3, # warmup MCMC iterations to discard
  iter_sampling = 2e3, # MCMC iterations to keep
  seed = md$seed
)

# save model to disk
fit_sh$save_object(file.path(outdir, "fit_shrinkage.rds"))

Diagnostics

Show the R code

draws <- fit_sh$draws()
pars <- c("alpha", "phi", paste0("weights[", sample(1:md$K, 10), "]"))

p <- mcmc_trace(draws, pars)
print(p)

Traceplots look good… check.

Show the R code

# get fit summary
fit_summary <- fit_sh$summary()

# print summary to table
fit_summary[-1, ] %>%
  column_to_rownames('variable') %>%
  round(4) %>%
  datatable(
    options = list(
      pageLength = 25,
      scrollX = TRUE,
      scrollY = "400px",
      fixedHeader = TRUE
    ),
    filter = 'top',
    rownames = TRUE
  )

Our rhat statistics indicate good convergence (i.e. rhat < 1.01)… check.

Results

Now, the first thing we notice in our plot of the synthetic control versus reality is that the prediction intervals are tighter and the synthetic control is hugging more closely to the actual data in the pre-intervention period. This is because the model is now emphasising fewer closely matching countries in the weights.

Show the R code

# extract the posterior predictions for y_synthetic
draws_df <- fit_sh$draws(
  variables = "y_synthetic",
  format = "df"
)

# summarise posterior predictions (mean, lower, upper)
synth_draws <- draws_df %>%
  pivot_longer(
    cols = starts_with("y_synthetic"),
    names_to = "parameter",
    values_to = "value"
  ) %>%
  mutate(t = as.integer(gsub(".*\\[(\\d+)\\]", "\\1", parameter))) %>%
  group_by(t) %>%
  summarise(
    y_hat = mean(value),
    lower = quantile(value, 0.025),
    upper = quantile(value, 0.975),
    .groups = "drop"
  ) %>%
  mutate(date = months[t])

# observed (real) data
real_data <- data.frame(
  date = row.names(md$x),
  y_obs = md$y
)

# data for plot
plot_df <- left_join(synth_draws, real_data, by = "date") %>%
  mutate(date = as.Date(paste0(date, "-01"))) %>%
  mutate(diff = y_obs - y_hat)

treatment_date <- as.Date(paste0(months[md$T0], "-01"))

p <- ggplot(plot_df, aes(x = date)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), fill = "skyblue", alpha = 0.4) +
  geom_line(aes(y = y_hat, color = "Synthetic Control"), linetype = "dashed") +
  geom_line(aes(y = y_obs, color = "Actual Data")) +
  geom_vline(xintercept = treatment_date, color = "black") +
  scale_y_log10(labels = scales::comma) +
  labs(
    title = "Migration reality and synthetic control timeseries",
    subtitle = paste(origin, "to", destination),
  ) +
  ylab('Migration flows (log scale)') +
  theme_minimal()

print(p)

We see that we still have good coverage of prediction intervals, although it is worth noting that 100% coverage is higher than expected.

Show the R code

# get prediction intervals
predictive_intervals <- fit_sh$draws("y_synthetic", format = "df") %>%
  spread_draws(y_synthetic[t]) %>%
  median_qi(y_synthetic, .width = 0.95) %>% # Calculates 2.5% and 97.5% quantiles
  filter(t <= md$T0) # Only look at the pre-treatment period

# join with observed data
coverage_df <- predictive_intervals %>%
  mutate(observed = md$y[t]) %>%
  mutate(is_covered = observed >= .lower & observed <= .upper)

# calculate coverage
coverage_rate <- mean(coverage_df$is_covered) * 100

# visualise coverage
p <- ggplot(coverage_df, aes(x = t)) +
  geom_ribbon(aes(ymin = .lower, ymax = .upper), alpha = 0.2, fill = "blue") +
  geom_line(aes(y = y_synthetic), color = "blue", linetype = "dashed") +
  geom_point(aes(y = observed, color = is_covered)) +
  scale_color_manual(values = c("TRUE" = "black", "FALSE" = "red")) +
  labs(
    title = paste(
      "Interval Coverage Diagnostic (",
      round(coverage_rate, 1),
      "%)"
    ),
    subtitle = "Red points fall outside the 95% prediction interval in the pre-treatment period",
    x = "Month Index",
    y = "Migration Count"
  ) +
  theme_minimal()

print(p)

When we look at the contribution of the top 10 donors, we can see that now the predictions are heavily weighted towards Georgia (GE). This was our top donor in the negative binomial model, but nowhere near this degree of influence (i.e. weight of 0.56)!

Show the R code

# How many donors to plot
top_x <- 10

# Summarize weights and map names
weights_summary <- fit_sh$summary("weights") %>%
  mutate(
    donor_index = 1:n(),
    donor_name = donors[donor_index]
  ) %>%
  arrange(desc(mean)) %>%
  mutate(cumulative_weight = cumsum(mean)) %>%
  slice_head(n = top_x)

# Plot the top donors with names
p <- ggplot(weights_summary, aes(x = reorder(donor_name, -mean), y = mean)) +
  geom_bar(stat = "identity", fill = "steelblue", alpha = 0.8) +
  geom_errorbar(
    aes(ymin = q5, ymax = q95),
    width = 0.2,
    color = "firebrick"
  ) +
  geom_text(aes(label = round(mean, 3)), vjust = -1.2, size = 3.5) +
  labs(
    title = paste(
      "Top",
      top_x,
      "donors for synthetic",
      origin,
      "->",
      destination
    ),
    x = "Donor Country",
    y = "Posterior Weight (Mean)"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
    panel.grid.major.x = element_blank()
  )

print(p)

It is unusual for a single donor to have such a strong influence, but in this case the result holds across a number of different parameterisations aimed at penalising over-reliance on too few donors. In normal conditions, Georgian migration to Germany apparently strongly mirrors Ukrainian migration patterns to Germany.

Conclusions

We have now implemented three different Bayesian synthetic control models with some discussion about why each modification was needed. The Poisson likelihood had prediction intervals with inadequate coverage. The negative binomial likelihood had good coverage, but it didn’t seem to be giving enough weight to the best donors. The negative binomial with a horseshoe prior provided strong shrinkage that isolated the effects of the best donors. There are many extensions and modifications that could be made to this model. I hope that you have the basic tools that you would need to start exploring these modelling alternatives yourself.

Please feel free to go back to repeat the exercise for different destination countries. Or look at a completely different context. For example, Brexit was implemented on 1 January 2021. You could look for effects of that policy on migration trends to/from the United Kindgom (GB). Remember, the Meta migration dataset has global coverage, so you could investigate any bilateral migration trends that you find interesting! I have included the full analysis script for the negative binomial model with the horseshoe prior in the appendix for your convenience.

References

Abadie, Alberto, Alexis Diamond, and Jens Hainmueller. 2010. “Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of Californiaś Tobacco Control Program.” Journal of the American Statistical Association 105 (490): 493–505. https://doi.org/10.1198/jasa.2009.ap08746.

Chi, Guanghua, Guy J Abel, Drew Johnston, Eugenia Giraudy, and Mike Bailey. 2025. “Measuring Global Migration Flows Using Online Data.” Proceedings of the National Academy of Sciences 122 (18): e2409418122. https://doi.org/10.1073/pnas.2409418122.

Appendix: Full analysis

Note: This analysis script assumes that you have saved the Stan code for the negative binomial model with a horseshoe prior

#---- R environment ----#

# install packages
pkgs <- c(
  "tools",
  "bayesplot",
  "posterior",
  "tibble",
  "tidyr",
  "tidybayes",
  "dplyr",
  "ggplot2"
)

pkgs_missing <- setdiff(pkgs, rownames(installed.packages()))
if (length(pkgs_missing) > 0) {
  install.packages(pkgs_missing)

# load libraries
library(cmdstanr)
library(bayesplot)
library(posterior)
library(tools)
library(tibble)
library(tidyr)
library(tidybayes)
library(dplyr)
library(ggplot2)

# output directory
outdir <- file.path(getwd(), "synthetic_controls")
dir.create(outdir, recursive = TRUE)

#---- Raw data ----#

# filename to save the data
dat_filename <- file.path(outdir, "international_migration_flow.csv")

# download the data if we don't already have it
if (!file.exists(dat_filename)) {
  download.file(
    "https://data.humdata.org/dataset/e09595bd-f4c5-4a66-8130-5a05f14d5e64/resource/67b2d6fc-ff4f-4d04-a75e-80e3de81b072/download/international_migration_flow.csv",
    destfile = dat_filename
  )
}

# load the data into R
dat <- read.csv(dat_filename)


#---- Model data ----#

# get treated (y) data: flows from origin to destination
y_treated <- dat %>%
  filter(country_from == origin, country_to == destination) %>%
  slice(match(months, migration_month)) %>%
  pull(num_migrants) %>%
  replace_na(0)

# get donor (x) data: flows from other countries to destination
x_donors <- dat %>%
  filter(country_to == destination, country_from %in% donors) %>%
  pivot_wider(
    id_cols = migration_month,
    names_from = country_from,
    values_from = num_migrants,
    values_fill = 0
  ) %>%
  slice(match(months, migration_month)) %>%
  column_to_rownames("migration_month") %>%
  select(all_of(donors)) %>%
  as.matrix()

# identify pre-treatment months
months_pre_treat <- months[months <= last_pretreatment_month]

# get the scaling factor
# (i.e. max of each donor in pre-treatment period)
scalar <- x_donors[months_pre_treat, ] %>%
  apply(2, max)

# scale donor data
x_donors_scaled <- x_donors %>%
  sweep(2, scalar, FUN = "/")

# set seed for random number generators (important for reproducibility)
seed <- round(runif(1, 1, 1e6))
set.seed(seed)

# stan model data object
md <- list(
  T = length(months),
  T0 = which(row.names(x_donors) == last_pretreatment_month),
  K = length(donors),
  y = y_treated,
  x = x_donors_scaled,
  seed = seed
)

# save data to disk
saveRDS(
  object = md,
  file = file.path(outdir, "md.rds")
)

#---- Stan model ----#

stan_code <- "
// bayesian synthetic controls
// negative binomial likelihood
// horseshoe shrinkage
data {
  int<lower=1> T; // Total number of months
  int<lower=1> T0; // End of pre-treatment period
  int<lower=1> K; // Number of donor countries
  array[T] int y; // Treated country migration counts (Ukraine)
  matrix[T, K] x; // Donor country migration counts/rates
}
parameters {
  real alpha;
  real phi;
  
  vector<lower=0>[K] weights_raw;
  vector<lower=0>[K] local_scales;
  real<lower=0> global_scale;
}
transformed parameters {
  vector[T] mu;
  vector<lower=0>[K] weights_shrunk;
  simplex[K] weights;
  
  weights_shrunk = weights_raw .* local_scales * global_scale;
  weights = weights_shrunk / sum(weights_shrunk);
  
  mu = exp(alpha) .* (x * weights);
}
model {
  // likelihood
  y[1 : T0] ~ neg_binomial_2(mu[1 : T0], phi);
  
  // priors
  alpha ~ normal(0, 10);
  phi ~ exponential(0.1);
  
  weights_raw ~ std_normal();
  local_scales ~ cauchy(0, 1);
  global_scale ~ cauchy(0, 1);
}
generated quantities {
  array[T] int y_synthetic;
  array[T] int effect;
  
  for (t in 1 : T) {
    y_synthetic[t] = neg_binomial_2_rng(mu[t], phi);
    effect[t] = y[t] - y_synthetic[t];
  }
}

"
writeLines(stan_code, con = file.path(outdir, "shrinkage.stan"))


#---- MCMC ----# 

# function to generate initials
init_generator <- function(md = md, chain_id = 1) {
  result <- list(
    mu = rnorm(1, log(mean(md$y[1:md$T0])), 0.1),
    phi = runif(1, 5, 15),
    weights_raw = rep(1, md$K),
    local_scales = rep(1, md$K),
    global_scale = 1
  )
  return(result)
}

# random initials for four chains
inits <- lapply(1:4, function(id) {
  init_generator(md = md, chain_id = id)
})

# compile the stan model
mod <- cmdstan_model(file.path(outdir, "shrinkage.stan"))

# run MCMC
fit_sh <- mod$sample(
  data = md, # model data (generated above)
  init = inits, # initials (generated above)
  parallel_chains = 4, # number of MCMC chains
  iter_warmup = 1e3, # warmup MCMC iterations to discard
  iter_sampling = 2e3, # MCMC iterations to keep
  seed = md$seed
)

# save model to disk
fit_sh$save_object(file.path(outdir, "fit_shrinkage.rds"))


#---- Diagnostics ----# 

# get and save fit summary
fit_summary <- fit_sh$summary()

write.csv(
  fit_summary,
  file.path(
    outdir,
    "fit_summary.csv"
  )
)

# posterior parameter estimates
print(fit_summary)


# check convergence for all parameters
# rhat < 1.01 means the chains converged (< 1.1 is okay for testing purposes)
fit_summary %>%
  filter(rhat > 1.01) %>%
  select(variable, rhat)

# traceplots
draws <- fit_sh$draws()
params <- fit_sh$metadata()$variables

pdf(
  file.path(
    outdir,
    "traceplots.pdf"
  ),
  width = 11,
  height = 8.5
)

plots_per_page <- 12
for (i in seq(1, length(params), by = plots_per_page)) {
  current_batch <- params[i:min(i + 11, length(params))]
  p <- mcmc_trace(draws, pars = current_batch) +
    facet_wrap(~parameter, ncol = 3, nrow = 4, scales = "free_y") +
    ggtitle(paste("Parameters", i, "to", min(i + 11, length(params))))
  print(p)
}
dev.off()

#---- assess coverage ----#

# get prediction intervals
predictive_intervals <- fit_sh$draws("y_synthetic", format = "df") %>%
  spread_draws(y_synthetic[t]) %>%
  median_qi(y_synthetic, .width = 0.95) %>% # Calculates 2.5% and 97.5% quantiles
  filter(t <= md$T0) # Only look at the pre-treatment period

# join with observed data
coverage_df <- predictive_intervals %>%
  mutate(observed = md$y[t]) %>%
  mutate(is_covered = observed >= .lower & observed <= .upper)

# calculate coverage
coverage_rate <- mean(coverage_df$is_covered) * 100

print(paste0("Pre-treatment Coverage: ", round(coverage_rate, 2), "%"))

# visualise coverage
pdf(
  file.path(
    outdir,
    "coverage_plot.pdf"
  ),
  width = 11,
  height = 8.5
)

ggplot(coverage_df, aes(x = t)) +
  geom_ribbon(aes(ymin = .lower, ymax = .upper), alpha = 0.2, fill = "blue") +
  geom_line(aes(y = y_synthetic), color = "blue", linetype = "dashed") +
  geom_point(aes(y = observed, color = is_covered)) +
  scale_color_manual(values = c("TRUE" = "black", "FALSE" = "red")) +
  labs(
    title = paste(
      "Interval Coverage Diagnostic (",
      round(coverage_rate, 1),
      "%)"
    ),
    subtitle = "Red points fall outside the 95% prediction interval in the pre-treatment period",
    x = "Month Index",
    y = "Migration Count"
  ) +
  theme_minimal()

dev.off()


#---- synthetic controls plot ----#

# extract the posterior predictions for y_synthetic
draws_df <- fit_sh$draws(
  variables = "y_synthetic",
  format = "df"
)

# summarise posterior predictions (mean, lower, upper)
synth_draws <- draws_df %>%
  pivot_longer(
    cols = starts_with("y_synthetic"),
    names_to = "parameter",
    values_to = "value"
  ) %>%
  mutate(t = as.integer(gsub(".*\\[(\\d+)\\]", "\\1", parameter))) %>%
  group_by(t) %>%
  summarise(
    y_hat = mean(value),
    lower = quantile(value, 0.025),
    upper = quantile(value, 0.975),
    .groups = "drop"
  ) %>%
  mutate(date = months[t])

# observed (real) data
real_data <- data.frame(
  date = row.names(md$x),
  y_obs = md$y
)

# data for plot
plot_df <- left_join(synth_draws, real_data, by = "date") %>%
  mutate(date = as.Date(paste0(date, "-01"))) %>%
  mutate(diff = y_obs - y_hat)

treatment_date <- as.Date(paste0(months[md$T0], "-01"))


## plot reality and synthetic control

pdf(
  file.path(
    outdir,
    "synthetic_controls_plot.pdf"
  ),
  width = 11,
  height = 8.5
)

ggplot(plot_df, aes(x = date)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), fill = "skyblue", alpha = 0.4) +
  geom_line(aes(y = y_hat, color = "Synthetic Control"), linetype = "dashed") +
  geom_line(aes(y = y_obs, color = "Actual Data")) +
  geom_vline(xintercept = treatment_date, color = "black") +
  scale_y_log10(labels = scales::comma) + # This makes the ribbon visible!
  labs(
    title = "Migration reality and synthetic control timeseries",
    subtitle = paste(origin, "to", destination),
  ) +
  ylab('Migration flows (log scale)') +
  theme_minimal()

dev.off()


## plot difference
pdf(
  file.path(
    outdir,
    "treatment_effect_plot.pdf"
  ),
  width = 11,
  height = 8.5
)

ggplot(plot_df, aes(x = date)) +
  geom_ribbon(
    aes(ymin = y_obs - upper, ymax = y_obs - lower),
    fill = "firebrick",
    alpha = 0.4
  ) +
  geom_line(aes(y = diff)) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_vline(xintercept = treatment_date, color = "black") +
  labs(
    title = "Estimated Treatment Effect (Actual - Synthetic)",
    subtitle = paste(origin, "to", destination),
  ) +
  theme_minimal()

dev.off()

#---- check shrinkage (i.e. dominant donors) ----#

# Set your threshold
top_x <- 10

# Summarize weights and map names
weights_summary <- fit_sh$summary("weights") %>%
  mutate(
    donor_index = 1:n(),
    # Map the index to the name in the 'donors' object
    donor_name = donors[donor_index]
  ) %>%
  arrange(desc(mean)) %>%
  mutate(cumulative_weight = cumsum(mean)) %>%
  slice_head(n = top_x)

# Plot the top donors with names
pdf(
  file.path(
    outdir,
    "top_donors_plot.pdf"
  ),
  width = 11,
  height = 8.5
)

p <- ggplot(weights_summary, aes(x = reorder(donor_name, -mean), y = mean)) +
  geom_bar(stat = "identity", fill = "steelblue", alpha = 0.8) +
  geom_errorbar(
    aes(ymin = q5, ymax = q95),
    width = 0.2,
    color = "firebrick"
  ) +
  geom_text(aes(label = round(mean, 3)), vjust = -1.2, size = 3.5) +
  labs(
    title = paste(
      "Top",
      top_x,
      "donors for synthetic",
      origin,
      "->",
      destination
    ),
    subtitle = paste0(
      paste0("Total weight of top ", top_x, " donors: "),
      round(max(weights_summary$cumulative_weight) * 100, 1),
      "%"
    ),
    x = NULL, # Remove axis label as names are self-explanatory
    y = "Posterior Weight (Mean)"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
    panel.grid.major.x = element_blank()
  )

print(p)

dev.off()