Mathematical motivation

This page derives, in one place, every statistic that Margins computes and the standard error attached to each one. The motivation is identical to Stata’s margins-delta-method FAQ and the cases enumerated in Richard Williams’ Margins01 notes; the goal here is to write them down using a single primitive so the implementation reduces to one Jacobian.

The delta method

For a fitted parameter vector \(\hat\beta\) with estimated covariance \(\widehat V\), and a (possibly vector-valued) statistic \(g(\beta)\), a first-order Taylor expansion gives

\[g(\hat\beta) \;\approx\; g(\beta_0) + G\,(\hat\beta - \beta_0), \qquad G = \left.\tfrac{\partial g}{\partial \beta}\right|_{\beta_0}.\]

Taking variances on both sides yields

\[\widehat{\operatorname{Var}}\bigl[g(\hat\beta)\bigr] \;\approx\; G\,\widehat V\,G^\top .\]

Three things to notice:

  1. The approximation depends only on first derivatives of \(g\).

  2. Once \(G\widehat V G^\top\) is in hand, any linear combination \(C\,g(\hat\beta)\) has covariance \(C(G\widehat V G^\top)C^\top\) — no further differentiation is needed. That is why contrast() is exact under the same approximation.

  3. The standard error of \(g(\hat\beta)\) is just the square-root of the diagonal of \(G\widehat V G^\top\); we report both.

A single statistic schema

Let \(f(\eta) = E[Y \mid \eta]\) be the response-scale map (the identity for OLS, the inverse link for a GLM, etc.). Williams enumerates seven statistics; every one can be written as a linear combination of mean predictions on different design matrices:

Statistic

\(g(\beta)\)

AAP

\(\frac{1}{n}\sum_i f(x_i^\top\beta)\)

APM

\(f(\bar x^\top\beta)\)

APR

\(\frac{1}{n}\sum_i f(x_i^\top\beta)\) with some \(x\) fixed

AME

\(\frac{1}{n}\sum_i \partial f(x_i^\top\beta)/\partial x_{ij}\)

MEM

\(\partial f(\bar x^\top\beta)/\partial x_j\)

MER

AME with some \(x\) held at representative values

Contrast

\(E[f(\cdot)\mid x_j=\text{level}] - E[f(\cdot)\mid x_j=\text{ref}]\)

Every entry is a linear combination of mean-predictions \(\frac{1}{n}\sum_i f(x_i^\top\beta)\) on different design matrices. The single Jacobian primitive is therefore

\[\frac{\partial}{\partial\beta}\, \frac{1}{n}\sum_i f(x_i^\top\beta) \;=\; \frac{1}{n}\sum_i f'(x_i^\top\beta)\,x_i .\]

Substituting back into the delta method gives a closed form for the covariance of every statistic in the table.

Marginal effects of a single variable

For a continuous covariate \(x_{ij}\),

\[\frac{\partial f(x_i^\top\beta)}{\partial x_{ij}} \;=\; f'(x_i^\top\beta)\,\frac{\partial (x_i^\top\beta)}{\partial x_{ij}} .\]

If the design matrix is just \(X = [\,\dots,\, x_j,\,\dots]\), the inner derivative is \(\beta_j\) and the AME collapses to the familiar \(\overline{f'(x^\top\beta)}\,\beta_j\). With a formula involving I(x**2), x1:x2, bs(x, df=4), or a Categorical contrast, the inner derivative is whatever patsy produces when the data column is perturbed and the design matrix is rebuilt — which is exactly what Margins does.

Discrete contrasts (factors, booleans, low-cardinality numerics) are handled by computing \(f\) at the level and at the reference, and differencing — the delta-method covariance follows from stacking the two predictions and applying the contrast matrix \(C = [-1, +1]\).

Elasticities

The four method= choices on dydx() apply a per-observation transform to the raw derivative \(d_i = \partial f(x_i^\top\beta)/\partial x_{ij}\) before averaging:

Method

Per-obs transform

Interpretation

dydx

\(d_i\)

Level change (default).

dyex

\(d_i\,x_i\)

Semi-elasticity, \(dy/d(\ln x)\).

eyex

\(d_i\,x_i / y_i\)

Full elasticity.

eydx

\(d_i / y_i\)

Semi-elasticity, \(d(\ln y)/dx\).

Elasticities are only defined for continuous variables; the implementation raises if you ask for an elasticity on a discrete one.

Outer Jacobian: analytic vs. finite differences

Stata’s margins computes the entire \(G\) numerically. We compute it analytically when possible:

  • For any GLM, \(f'\) is read off family.link.inverse_deriv. This covers Logit, Probit, Poisson, NegBin, Gamma, and Gaussian out of the box.

  • For OLS / WLS / GLS, \(f' \equiv 1\) (identity link).

  • For MNLogit, we use an analytic softmax-derivative gradient.

Whenever an analytic \(f'\) or Jacobian is available and the linear predictor is \(\eta = X\beta\) (no offset/exposure), the chain rule gives \(\partial g/\partial\beta\) in closed form, saving \(p\) forward predict calls per statistic.

The fallback is central differences,

\[\frac{\partial f}{\partial x_j} \approx \frac{f(x + h e_j) - f(x - h e_j)}{2 h_j}, \qquad h_j = \epsilon^{1/3}\cdot\max(|x_j|, 1),\]

with \(h_j\) chosen to balance truncation error (\(O(h^2)\)) against round-off (\(O(\epsilon/h)\)); the optimum scales as \(h \sim \epsilon^{1/3}\). See _central_jacobian() and Nocedal & Wright, Numerical Optimization (2006), Chapter 8.

  • Analytic-vs-FD parity matrix. Every public API call (AAP, APM, APR, AME, MEM, MER — continuous and discrete) is run twice on each of OLS+poly, Logit+`C(grp)`, Poisson, and MNLogit, with analytic=True and analytic=False, and the two paths must agree on both estimate and SE. This guards against the analytic chain-rule formula drifting from the FD answer that the other tests pin to Stata / get_margeff.

Beyond delta: simulation and bootstrap VCEs

The delta method is a first-order approximation: \(G\widehat V G^\top\) is exact only to the extent that \(g\) is locally linear in \(\beta\) near \(\hat\beta\). For statistics that are sharply nonlinear (a probability near 0 or 1, an elasticity at a near-zero prediction) this can produce symmetric Wald intervals that extend beyond the natural support, or under-cover when the curvature is non-negligible.

smmargins exposes two alternatives behind the same vce= keyword.

Krinsky–Robb simulation (vce="simulation"). Draw \(S\) parameter vectors \(\beta_s\sim N(\hat\beta,\widehat V)\), evaluate \(g(\beta_s)\) for each, and read inference off the empirical distribution of the draws:

\[\widehat{\operatorname{Var}}_{\mathrm{KR}}\bigl[g(\hat\beta)\bigr] = \frac{1}{S-1}\sum_{s=1}^S \bigl(g(\beta_s) - \bar g\bigr)\!\bigl(g(\beta_s) - \bar g\bigr)^{\!\top}.\]

The reported point estimate stays the analytic \(g(\hat\beta)\), not \(\bar g\); this matches Stata’s vce(simulation) convention and avoids spurious bias from finite \(S\). Pointwise intervals default to empirical quantiles of the draws and so are not forced to be symmetric — an asymmetric CI for an extreme probability is the practical reason to reach for KR.

Bootstrap (vce="bootstrap"). Refit the model on \(B\) resamples of the rows and read inference off the bootstrap distribution. Three resampling schemes are built in (boot_method=):

  • "pairs"\((y_i, x_i)\) resampled IID (default);

  • "cluster" — whole clusters resampled with replacement, required for robustness to within-cluster correlation;

  • "block" — moving-block resampling for time series, with block_size= chosen to span the dependence horizon.

Failed refits are caught and counted; a RuntimeWarning fires when the failure rate exceeds 5%. The point estimate is again the analytic \(g(\hat\beta)\), not the bootstrap mean.

Simultaneous confidence intervals

For a family of \(m\) margins returned in a single call, ci_method= controls the critical value. With pointwise critical value \(z_{1-\alpha/2}\):

  • "bonferroni" uses \(z_{1-\alpha/(2m)}\) — works with any VCE;

  • "sidak" uses \(z_{(1+(1-\alpha)^{1/m})/2}\), slightly narrower than Bonferroni — also works with any VCE;

  • "sup-t" uses the \((1-\alpha)\) quantile of the maximum standardised absolute deviation across the family, computed from the simulation/bootstrap draw matrix:

    \[c_{\sup\text{-}t}^{\,(1-\alpha)} = Q_{1-\alpha}\!\left( \max_{j=1,\dots,m} \left|\frac{g_j(\beta_s) - g_j(\hat\beta)}{\widehat\sigma_j}\right| \right).\]

Sup-t requires draws (so vce must be "simulation" or "bootstrap") and is typically narrower than Bonferroni / Šidák when the margins in the family are correlated — exactly the case for margins evaluated at neighbouring covariate profiles.

Why the response scale matters for DiD (Ai & Norton 2003)

In a logit, the coefficient on a group:condition interaction is on the log-odds scale. On the probability scale, the difference-in-differences

\[\mathrm{DiD}(x_*) = \bigl[f(\eta_{1,1}) - f(\eta_{1,0})\bigr] - \bigl[f(\eta_{0,1}) - f(\eta_{0,0})\bigr]\]

is a nonlinear function of every parameter and every covariate profile \(x_*\). You cannot read it off the interaction coefficient. did() evaluates the four cells on the response scale, with their joint delta-method covariance, and forms the DiD as a contrast — Ai & Norton’s recommendation in code.

References