Generalization Bounds for Causal Regression
In causal regression, we seek to make predictions that estimate the effects of interventions. In particular, we try to reason about what would happen, rather than what hapenned. This is very tricky, because we don’t have access to what would have happened for a certain individual under an intervention if that intervention did not happen.
Many algorithms have been proposed to predict causal quantities such as potential outcomes and treatment effects. However, there is no backing theory. And moreover, because we don’t have access to the true counterfactuals in observational data, it’s impossible to do any faithful benchmarking – so we really need theory here!
In this work, we establish some fundamental and general results for achieving strong guarantees on modern causal prediction methods. Our methods can be easily adapted to new algorithms, naturally adjusting to their intricacies.
Our main result is a bound relating the unobservable causal quantity to its observable analogue, bounding their gap as a function of the variance of the treatment propensities over the population (taking into account potentially unobserved confounders):
\[ \begin{aligned} \underbrace{\mathbb{E}[\mathrm{Loss}]}_\text{unobservable...} \!\!\!\!\leq \underbrace{\mathbb{E}[\mathrm{Loss} | T=a]}_\text{observable!} + \lambda \underbrace{\mathbb{E}\left[\left( w(X) \frac{\mathbb{P}[T=a | X,U]}{\mathbb{P}[T=a]} - 1 \right)^2\right]}_{\Delta = \text{variance of the treatment propensities}} + \frac{1}{4\lambda\sigma^2}, \quad \text{for any}\ \lambda > 0. \end{aligned} \]
\[ \begin{aligned} &\underbrace{\mathbb{E}[\mathrm{Loss}]}_\text{unobservable...} \!\!\!\!\leq \underbrace{\mathbb{E}[\mathrm{Loss} | T=a]}_\text{observable!} \\ &+ \lambda \underbrace{\mathbb{E}\left[\left( w(X) \frac{\mathbb{P}[T=a | X,U]}{\mathbb{P}[T=a]} - 1 \right)^2\right]}_{\Delta = \text{variance of the treatment propensities}} \\ &+ \frac{1}{4\lambda\sigma^2}, \qquad\qquad\qquad \text{for any}\ \lambda > 0. \end{aligned} \]
Moreover, we show that this variance term can be – somewhat surprisingly – upper bounded by a quantity that can be estimated in practice regardless of any unobserved confounders, as a function of the quality of a treatment assignment classification model \(\nu\) (also known as a propensity scoring model):
\[ \begin{align*} \Delta \leq \frac{2}{\mathbb{P}[T=a]^2} \cdot \left( \underbrace{\mathbb{E}\left[w^2(X) \cdot \left( \nu(X) - \mathbb{1}[T=a] \right)^2\right]}_\text{reweighted Brier Score of the model $\nu$} + \underbrace{\mathbb{E}\left[\left( w(X) \mathbb{1}[T=a] - \mathbb{P}[T=a] \right)^2\right]}_\text{propensity balancing term} \right). \end{align*} \]
\[ \begin{align*} \Delta &\leq \frac{2}{\mathbb{P}[T=a]^2} \cdot \Bigl( \\ &\quad\ \underbrace{\mathbb{E}\left[w^2(X) \cdot \left( \nu(X) - \mathbb{1}[T=a] \right)^2\right]}_\text{reweighted Brier Score of the model $\nu$} \\ &+ \underbrace{\mathbb{E}\left[\left( w(X) \mathbb{1}[T=a] - \mathbb{P}[T=a] \right)^2\right]}_\text{propensity balancing term} \\ &\ \Bigr). \end{align*} \]
With both of these inequalities in hand, we can connect them with generalization bounds for base algorithms. We give an example in the paper with generalization bounds based on the Rademacher complexity, but the same approach works for pretty much any existing generalization bound (e.g., those based on VC dimension, algorithmic stability, PAC-Bayes, etc.).
We also prove the analogous results for treatment effect estimation for a wide family of models (T-learners, S-learners and X-learners), which can be naturally extended to even more situations.
Our bounds are remarkably tight, vastly outperforming the closest matching bounds previously available in the literature, which is several orders of magnitude looser (!).


Full details in the paper! You can find it here and on ArXiv.