If you have two probability distributions \(P\) and \(Q\) over a same sample space \(\Omega\), it is often interesting to have some notion of “distance” between them. There are many candidates, but one appealing option is the Wasserstein distance, \(W(P,Q)\).
There are two equivalent commonly-used definitions for the Wasserstein distance. The more common one is based on optimal transport theory, in which the Wasserstein distance is somewhat intuitively defined as “the minimum amount of effort necessary to transport one distribution into the other”. That said, we will opt here for the alternative definition, which is often termed the representation of the Wasserstein distance as an Integral Probability Metric (IPM), or simply as the dual representation of the Wasserstein distance. It is then defined as such:
\[W(P,Q) ≔ \sup\limits_{\left\| \phi \right\|_{\text{Lip }} \leq 1}\left| {{\mathbb{E}}_{X \sim P}\left\lbrack \phi(X) \right\rbrack - {\mathbb{E}}_{Y \sim Q}\left\lbrack \phi(Y) \right\rbrack} \right|,\]
where \(\left\| \phi \right\|_{\text{Lip}}\) is the Lipschitz constant of \(\phi\). That the two definitions are indeed equivalent is due to a famous result known as the Kantorovich-Rubinstein theorem.
The Wasserstein distance is interesting for a number of reasons. For one, it is a proper metric between any two probability distributions (meaning it is always positive, zero only if the two distributions are equal, and satisfies the triangular inequality), and with no restrictions between the two distributions (e.g., one can be discrete and the other continuous). But more than that, it also allows for a very handy change-of-measure inequality:
Proposition (Wasserstein Change-of-Measure). For any two probability distributions \(P\) and \(Q\) over \(\Omega\), for any \(f:\Omega \rightarrow {\mathbb{R}},\) it follows that: \[\left| {{\mathbb{E}}_{X \sim P}\left\lbrack f(X) \right\rbrack - {\mathbb{E}}_{Y \sim Q}\left\lbrack f(Y) \right\rbrack} \right| \leq \left\| f \right\|_{\text{Lip }}W(P,Q).\]
Proof: This follows immediately from the definition of the Wasserstein distance as an IPM.
□In particular, note that this means that for any \(P\), \(Q\), and Lipschitz \(f\), it must hold that
\[{\mathbb{E}}_{X \sim P}\left\lbrack f(X) \right\rbrack \leq {\mathbb{E}}_{Y \sim Q}\left\lbrack f(Y) \right\rbrack + \left\| f \right\|_{\text{Lip }}W(P,Q).\]
This handy fact has been used, e.g., in [Johansson et al., 2020] to produce generalization bounds for causal machine learning algorithms, and more recently in my paper [Csillag et al., 2025] to turn certain calculations more tractable.
All that said, the Wasserstein distance is, unfortunately, infamous for being hard to estimate when the dimension of \(\Omega\) is high. When it is 1-dimensional, there is a famous and commonly-implemented procedure (e.g. implemented in SciPy), but when it is high-dimensional, no efficient and statistically robust algorithm exists. This even led to the design of alternative notions of distance that would be much easier to estimate, such as the Sliced Wasserstein Distance.
Though the Wasserstein distance is hard to estimate, it actually isn’t that hard to bound empirically. Below, I’d like to share a simple proof that the Wasserstein distance can be quite simply upper bounded by the mean distance between samples of each distribution.
Proposition. For any distributions \(P\) and \(Q\) over some space \(\Omega\), it holds that \[W(P,Q) \leq {\mathbb{E}}_{X \sim P,Y \sim Q}\left\lbrack \left\| {X - Y} \right\| \right\rbrack.\]
Proof: By definition, \[\begin{aligned} W(P,Q) & = \sup\limits_{\left\| \phi \right\|_{\text{Lip }} \leq 1}\left| {{\mathbb{E}}_{X \sim P}\left\lbrack \phi(X) \right\rbrack - {\mathbb{E}}_{Y \sim Q}\left\lbrack \phi(Y) \right\rbrack} \right| \\ & = \sup\limits_{\left\| \phi \right\|_{\text{Lip }} \leq 1}\left| {{\mathbb{E}}_{X \sim P,Y \sim Q}\left\lbrack \phi(X) - \phi(Y) \right\rbrack} \right|. \end{aligned}\]
By Jensen,
\[\sup\limits_{\left\| \phi \right\|_{\text{Lip }} \leq 1}\left| {{\mathbb{E}}_{X \sim P,Y \sim Q}\left\lbrack \phi(X) - \phi(Y) \right\rbrack} \right| \leq \sup\limits_{\left\| \phi \right\|_{\text{Lip }} \leq 1}{\mathbb{E}}_{X \sim P,Y \sim Q}\left\lbrack \left| {\phi(X) - \phi(Y)} \right| \right\rbrack.\] \[\begin{aligned} & \sup\limits_{\left\| \phi \right\|_{\text{Lip }} \leq 1}\left| {{\mathbb{E}}_{X \sim P,Y \sim Q}\left\lbrack \phi(X) - \phi(Y) \right\rbrack} \right| \\ & \leq \sup\limits_{\left\| \phi \right\|_{\text{Lip }} \leq 1}{\mathbb{E}}_{X \sim P,Y \sim Q}\left\lbrack \left| {\phi(X) - \phi(Y)} \right| \right\rbrack. \end{aligned}\]
And since \(\phi\) must be 1-Lipschitz (by the constraint in the supremum), \[\begin{aligned} \sup\limits_{\left\| \phi \right\|_{\text{Lip }} \leq 1}{\mathbb{E}}_{X \sim P,Y \sim Q}\left\lbrack \left| {\phi(X) - \phi(Y)} \right| \right\rbrack & \leq \sup\limits_{\left\| \phi \right\|_{\text{Lip }} \leq 1}{\mathbb{E}}_{X \sim P,Y \sim Q}\left\lbrack \left\| {X - Y} \right\| \right\rbrack \\ & = {\mathbb{E}}_{X \sim P,Y \sim Q}\left\lbrack \left\| {X - Y} \right\| \right\rbrack, \end{aligned}\] \[\begin{aligned} & \sup\limits_{\left\| \phi \right\|_{\text{Lip }} \leq 1}{\mathbb{E}}_{X \sim P,Y \sim Q}\left\lbrack \left| {\phi(X) - \phi(Y)} \right| \right\rbrack \\ & \leq \sup\limits_{\left\| \phi \right\|_{\text{Lip }} \leq 1}{\mathbb{E}}_{X \sim P,Y \sim Q}\left\lbrack \left\| {X - Y} \right\| \right\rbrack \\ & = {\mathbb{E}}_{X \sim P,Y \sim Q}\left\lbrack \left\| {X - Y} \right\| \right\rbrack, \end{aligned}\]
as we desired.
□If we consider the case in which \(\Omega = {\mathbb{R}}\) (and the norm is the usual one), this upper bound is the very familiar mean absolute error (MAE):
\[\Omega = {\mathbb{R}} \Longrightarrow W(P,Q) \leq {\mathbb{E}}_{X \sim P,Y \sim Q}\left\lbrack \left| {X - Y} \right| \right\rbrack.\]
If one wants to relate this instead to the more usual mean squared error (MSE) (well, the root mean squared error (RMSE), really), then all that is needed is to apply Jensen once more:
\[{\mathbb{E}}_{X \sim P,Y \sim Q}\left\lbrack \left| {X - Y} \right| \right\rbrack = {\mathbb{E}}_{X \sim P,Y \sim Q}\left\lbrack \sqrt{\left| {X - Y} \right|^{2}} \right\rbrack \leq \sqrt{{\mathbb{E}}_{X \sim P,Y \sim Q}\left\lbrack \left| {X - Y} \right|^{2} \right\rbrack}.\] \[\begin{aligned} & {\mathbb{E}}_{X \sim P,Y \sim Q}\left\lbrack \left| {X - Y} \right| \right\rbrack = {\mathbb{E}}_{X \sim P,Y \sim Q}\left\lbrack \sqrt{\left| {X - Y} \right|^{2}} \right\rbrack \\ & \quad \leq \sqrt{{\mathbb{E}}_{X \sim P,Y \sim Q}\left\lbrack \left| {X - Y} \right|^{2} \right\rbrack}. \end{aligned}\]
Now, this bound is occasionally tight, but not necessarily so. In particular, note that though \(W(P,P) = 0\) for any \(P\), our bound gives us merely that
\[W(P,P) \leq {\mathbb{E}}_{X \sim P,Y \sim P}\left\lbrack \left\| {X - Y} \right\| \right\rbrack,\]
which, generally speaking, will be zero only if \(P\) has zero variance. But here’s an example in which this is legitimately useful (inspired by [Csillag et al., 2025]):
Example (Imputing with predictions doesn’t hurt if your procedure is stable). Suppose you have \((X,Y) \sim P\) with support in \(\mathcal{X} \times \mathcal{Y}\), and let \(\mu:\mathcal{X} \rightarrow \mathcal{Y}\) be a machine learning model for predicting \(Y\) from \(X\) (i.e., a function for which hopefully \(\mu(X) \approx Y\)). It then holds:
\[W\left( \mu(X),Y \right) \leq {\mathbb{E}}_{(X,Y) \sim P}\left\lbrack \left\| {\mu(X) - Y} \right\| \right\rbrack.\]
In particular, note: if \(g:\mathcal{Y} \rightarrow {\mathbb{R}}\) is some “post-processing” function, which does something useful/interesting atop the label values \(Y\), and \(g\) is Lipschitz with a not-too-large Lipschitz constant (i.e., it is reasonably ‘stable’), then if we were to replace the true values \(Y\) with predicted values \(\mu(X)\), this must only be about as bad as the mean absolute distance between them, regardless of the particulars of \(\mu\) (e.g., it could be an arbitrarily complicated neural network):
\[\left| {{\mathbb{E}}\left\lbrack g(Y) \right\rbrack - {\mathbb{E}}\left\lbrack g\left( \mu(X) \right) \right\rbrack} \right| \leq \left\| g \right\|_{\text{Lip }}W\left( \mu(X),Y \right) \leq \left\| g \right\|_{\text{Lip }}{\mathbb{E}}\left\lbrack \left\| {Y - \mu(X)} \right\| \right\rbrack.\] \[\begin{aligned} \left| {{\mathbb{E}}\left\lbrack g(Y) \right\rbrack - {\mathbb{E}}\left\lbrack g\left( \mu(X) \right) \right\rbrack} \right| & \leq \left\| g \right\|_{\text{Lip }}W\left( \mu(X),Y \right) \\ & \leq \left\| g \right\|_{\text{Lip }}{\mathbb{E}}\left\lbrack \left\| {Y - \mu(X)} \right\| \right\rbrack. \end{aligned}\]
□Another relevant example is that of a neat bound on the Wasserstein distance between a (multivariate) Gaussian and a Dirac:
Example (Bound on the Wasserstein distance between a Gaussian and a Dirac). \[\begin{aligned} W\left( \mathcal{N}(\mu,\Sigma),\delta_{x} \right) & \leq \sqrt{{\mathbb{E}}_{X \sim \mathcal{N}(\mu,\Sigma)}\left\lbrack \left\| {X - x} \right\|^{2} \right\rbrack} = \sqrt{{\mathbb{E}}_{\widetilde{X} \sim \mathcal{N}(\mu - x,\Sigma)}\left\lbrack \left\| \widetilde{X} \right\|^{2} \right\rbrack} \\ & = \sqrt{\sum_{i = 1}^{d}{\mathbb{E}}_{\widetilde{X} \sim \mathcal{N}(\mu_{i} - x_{i},\Sigma_{i,i})}\left\lbrack {\widetilde{X}}_{i}^{2} \right\rbrack} = \sqrt{\sum_{i = 1}^{d}\left( \Sigma_{i,i} + \left( \mu_{i} - x_{i} \right)^{2} \right)} \\ & = \sqrt{\text{tr}(\Sigma) + \left\| {\mu - x} \right\|^{2}}. \end{aligned}\] \[\begin{aligned} W\left( \mathcal{N}(\mu,\Sigma),\delta_{x} \right) & \leq \sqrt{{\mathbb{E}}_{X \sim \mathcal{N}(\mu,\Sigma)}\left\lbrack \left\| {X - x} \right\|^{2} \right\rbrack} \\ & = \sqrt{{\mathbb{E}}_{\widetilde{X} \sim \mathcal{N}(\mu - x,\Sigma)}\left\lbrack \left\| \widetilde{X} \right\|^{2} \right\rbrack} \\ & = \sqrt{\sum_{i = 1}^{d}{\mathbb{E}}_{\widetilde{X} \sim \mathcal{N}(\mu_{i} - x_{i},\Sigma_{i,i})}\left\lbrack {\widetilde{X}}_{i}^{2} \right\rbrack} \\ & = \sqrt{\sum_{i = 1}^{d}\left( \Sigma_{i,i} + \left( \mu_{i} - x_{i} \right)^{2} \right)} \\ & = \sqrt{\text{tr}(\Sigma) + \left\| {\mu - x} \right\|^{2}}. \end{aligned}\]
And by the way, this also works just as well in infinite dimensions (i.e., a Gaussian process).
□If for some reason you'd like to cite this post, you can use this BibTeX entry (click to copy to clipboard).