Why Margins.contrast exists: joint covariance for two arms

The problem

A common task is to compute a contrast between two counterfactual specifications — for example, the population average treatment effect (PATE):

\[ \Delta \;=\; \mathbb{E}\bigl[g_A(\beta)\bigr] \;-\; \mathbb{E}\bigl[g_B(\beta)\bigr], \]

where \(g_A\) averages the predicted response with treat=1 and \(g_B\) does the same with treat=0. A user could compute this naïvely:

r_a = M.predict(values={"treat": 1})
r_b = M.predict(values={"treat": 0})
diff = r_a.estimate - r_b.estimate           # correct
naive_se = np.sqrt(r_a.se**2 + r_b.se**2)    # WRONG

The point estimate is fine. The standard error is wrong, often badly.

Why the naïve SE is wrong

The two predictions \(g_A\) and \(g_B\) depend on the same \(\hat\beta\). Their sampling errors are correlated — strongly, since they share every coefficient. The variance of the difference is:

\[ \operatorname{Var}\bigl(g_A(\hat\beta) - g_B(\hat\beta)\bigr) \;=\; \operatorname{Var}(g_A) + \operatorname{Var}(g_B) - 2\,\operatorname{Cov}(g_A, g_B). \]

The naïve formula sets the covariance to zero. For PATE in a logit, where \(g_A\) and \(g_B\) differ only by one regressor, the covariance term is typically the same order of magnitude as the variances, and the naïve SE can be off by a factor of 2 or more.

The delta-method derivation

Let \(J_A\) and \(J_B\) be the Jacobians of \(g_A\) and \(g_B\) with respect to \(\beta\), evaluated at \(\hat\beta\). Stack them:

\[\begin{split} J \;=\; \begin{bmatrix} J_A \\ J_B \\ J_A - J_B \end{bmatrix}, \qquad g(\hat\beta) \;=\; \begin{bmatrix} g_A(\hat\beta) \\ g_B(\hat\beta) \\ g_A(\hat\beta) - g_B(\hat\beta) \end{bmatrix}. \end{split}\]

The first-order delta-method approximation gives a single joint covariance matrix:

\[ \operatorname{Var}\bigl(g(\hat\beta)\bigr) \;\approx\; J\,\widehat V\,J^\top, \]

where \(\widehat V\) is the estimated covariance of \(\hat\beta\). The (3, 3) block of this matrix is the correct variance of the difference — and it automatically picks up the off-diagonal \(\operatorname{Cov}(g_A, g_B)\) because of the way \(J_A - J_B\) is built into the stack.

Equivalently, observe that

\[ \bigl[J\,\widehat V\,J^\top\bigr]_{33} \;=\; J_A \widehat V J_A^\top + J_B \widehat V J_B^\top - 2\,J_A \widehat V J_B^\top, \]

so the joint computation is simply doing the bookkeeping the user would otherwise have to do by hand.

What the implementation does

Margins.contrast (smmargins/core.py:1442) builds exactly that stacked statistic and Jacobian:

  1. Resolve arm A’s frames via the same path predict() uses (with values=a or a_newdata=).

  2. Resolve arm B’s frames the same way.

  3. Form \(g(\beta) = [g_A(\beta), g_B(\beta), g_A(\beta) - g_B(\beta)]\) as a single vector-valued statistic.

  4. Form \(J(\beta)\) by stacking \(J_A\), \(J_B\), and \(J_A - J_B\).

  5. Pass through the standard _delta runner. Inference kwargs (vce=, cov_type=, ci_method=) compose unchanged.

The result is a standard MarginsResult with three rows. The third row’s SE is the joint-covariance SE. The full \(3 \times 3\) block is available as result.vcov for users who want to construct further linear combinations.

Composition with non-delta inference

For vce="bootstrap" and vce="simulation", the joint computation is automatic: a single resample (or a single \(\hat\beta\) draw) produces both \(g_A^{(b)}\) and \(g_B^{(b)}\), and the contrast at draw \(b\) is just \(g_A^{(b)} - g_B^{(b)}\). The empirical covariance of those draws gives the joint covariance with no extra work — the bookkeeping is implicit in the resampling design.

For ci_method="sup-t", the simultaneous CI uses the joint draw matrix across all three rows, so the band on the difference is calibrated against \(g_A\) and \(g_B\) jointly.

Tradeoffs and pitfalls

  • a_newdata and b_newdata should usually be the same length. They don’t have to be — M.contrast(a_newdata=df_treated, b_newdata=df_control) is well-defined when treated and control populations differ — but the contrast then estimates a target-population-specific effect, not PATE on the original sample. Document which population you mean.

  • The joint \(\widehat V\) assumption requires the two arms to be evaluated against the same \(\hat\beta\). If you fit two separate models and want to contrast them, this machinery is the wrong tool — you need a different covariance estimator (typically a stacked-GMM or seemingly-unrelated approach).

  • Bootstrap arm consistency. Each resample refits once and computes both arms against the refit’s coefficients. Don’t bootstrap arms independently — that throws away the very correlation the joint computation is designed to capture.

See also

  • Tutorial: contrast walkthrough (forthcoming).

  • API: smmargins.Margins.contrast (see API reference).

  • Theory: Mathematical motivation for the general delta-method derivation underlying every MarginsResult.