Mathematical motivation¶
This page derives, in one place, every statistic that
Margins computes and the standard error attached to
each one. The motivation is identical to Stata’s margins-delta-method
FAQ
and the cases enumerated in Richard Williams’ Margins01 notes; the goal
here is to write them down using a single primitive so the
implementation reduces to one Jacobian.
The delta method¶
For a fitted parameter vector \(\hat\beta\) with estimated covariance \(\widehat V\), and a (possibly vector-valued) statistic \(g(\beta)\), a first-order Taylor expansion gives
Taking variances on both sides yields
Three things to notice:
The approximation depends only on first derivatives of \(g\).
Once \(G\widehat V G^\top\) is in hand, any linear combination \(C\,g(\hat\beta)\) has covariance \(C(G\widehat V G^\top)C^\top\) — no further differentiation is needed. That is why
contrast()is exact under the same approximation.The standard error of \(g(\hat\beta)\) is just the square-root of the diagonal of \(G\widehat V G^\top\); we report both.
A single statistic schema¶
Let \(f(\eta) = E[Y \mid \eta]\) be the response-scale map (the identity for OLS, the inverse link for a GLM, etc.). Williams enumerates seven statistics; every one can be written as a linear combination of mean predictions on different design matrices:
Statistic |
\(g(\beta)\) |
|---|---|
AAP |
\(\frac{1}{n}\sum_i f(x_i^\top\beta)\) |
APM |
\(f(\bar x^\top\beta)\) |
APR |
\(\frac{1}{n}\sum_i f(x_i^\top\beta)\) with some \(x\) fixed |
AME |
\(\frac{1}{n}\sum_i \partial f(x_i^\top\beta)/\partial x_{ij}\) |
MEM |
\(\partial f(\bar x^\top\beta)/\partial x_j\) |
MER |
AME with some \(x\) held at representative values |
Contrast |
\(E[f(\cdot)\mid x_j=\text{level}] - E[f(\cdot)\mid x_j=\text{ref}]\) |
Every entry is a linear combination of mean-predictions \(\frac{1}{n}\sum_i f(x_i^\top\beta)\) on different design matrices. The single Jacobian primitive is therefore
Substituting back into the delta method gives a closed form for the covariance of every statistic in the table.
Marginal effects of a single variable¶
For a continuous covariate \(x_{ij}\),
If the design matrix is just \(X = [\,\dots,\, x_j,\,\dots]\), the
inner derivative is \(\beta_j\) and the AME collapses to the
familiar \(\overline{f'(x^\top\beta)}\,\beta_j\). With a formula
involving I(x**2), x1:x2, bs(x, df=4), or a
Categorical contrast, the inner derivative is whatever
patsy produces when the data column is perturbed and the design
matrix is rebuilt — which is exactly what
Margins does.
Discrete contrasts (factors, booleans, low-cardinality numerics) are handled by computing \(f\) at the level and at the reference, and differencing — the delta-method covariance follows from stacking the two predictions and applying the contrast matrix \(C = [-1, +1]\).
Elasticities¶
The four method= choices on dydx() apply a
per-observation transform to the raw derivative
\(d_i = \partial f(x_i^\top\beta)/\partial x_{ij}\) before
averaging:
Method |
Per-obs transform |
Interpretation |
|---|---|---|
|
\(d_i\) |
Level change (default). |
|
\(d_i\,x_i\) |
Semi-elasticity, \(dy/d(\ln x)\). |
|
\(d_i\,x_i / y_i\) |
Full elasticity. |
|
\(d_i / y_i\) |
Semi-elasticity, \(d(\ln y)/dx\). |
Elasticities are only defined for continuous variables; the implementation raises if you ask for an elasticity on a discrete one.
Outer Jacobian: analytic vs. finite differences¶
Stata’s margins computes the entire \(G\) numerically. We
compute it analytically when possible:
For any GLM, \(f'\) is read off
family.link.inverse_deriv. This covers Logit, Probit, Poisson, NegBin, Gamma, and Gaussian out of the box.For
OLS/WLS/GLS, \(f' \equiv 1\) (identity link).For
MNLogit, we use an analytic softmax-derivative gradient.
Whenever an analytic \(f'\) or Jacobian is available and the linear
predictor is \(\eta = X\beta\) (no offset/exposure), the chain rule
gives \(\partial g/\partial\beta\) in closed form, saving
\(p\) forward predict calls per statistic.
The fallback is central differences,
with \(h_j\) chosen to balance truncation error
(\(O(h^2)\)) against round-off (\(O(\epsilon/h)\)); the
optimum scales as \(h \sim \epsilon^{1/3}\). See
_central_jacobian() and Nocedal & Wright,
Numerical Optimization (2006), Chapter 8.
Analytic-vs-FD parity matrix. Every public API call (AAP, APM, APR, AME, MEM, MER — continuous and discrete) is run twice on each of OLS+poly, Logit+`C(grp)`, Poisson, and MNLogit, with analytic=True and analytic=False, and the two paths must agree on both estimate and SE. This guards against the analytic chain-rule formula drifting from the FD answer that the other tests pin to Stata / get_margeff.
Why the response scale matters for DiD (Ai & Norton 2003)¶
In a logit, the coefficient on a group:condition interaction is on
the log-odds scale. On the probability scale, the
difference-in-differences
is a nonlinear function of every parameter and every covariate
profile \(x_*\). You cannot read it off the interaction
coefficient. did() evaluates the four cells
on the response scale, with their joint delta-method covariance, and
forms the DiD as a contrast — Ai & Norton’s recommendation in code.
References¶
Stata FAQ, How are the standard errors computed with margins? https://www.stata.com/support/faqs/statistics/compute-standard-errors-with-margins/
Williams, R. (2012). Using the margins command to estimate and interpret adjusted predictions and marginal effects. Stata Journal, 12(2), 308–331. https://www3.nd.edu/~rwilliam/stats/Margins01.pdf
Ai, C., & Norton, E. C. (2003). Interaction terms in logit and probit models. Economics Letters, 80(1), 123–129. https://doi.org/10.1016/S0165-1765(03)00032-6
Nocedal, J., & Wright, S. J. (2006). Numerical Optimization, 2nd ed., Springer. Chapter 8.