Mathematical motivation¶
This page derives, in one place, every statistic that
Margins computes and the standard error attached to
each one. The motivation is identical to Stata’s margins-delta-method
FAQ
and the cases enumerated in Richard Williams’ Margins01 notes; the goal
here is to write them down using a single primitive so the
implementation reduces to one Jacobian.
The delta method¶
For a fitted parameter vector \(\hat\beta\) with estimated covariance \(\widehat V\), and a (possibly vector-valued) statistic \(g(\beta)\), a first-order Taylor expansion gives
Taking variances on both sides yields
Three things to notice:
The approximation depends only on first derivatives of \(g\).
Once \(G\widehat V G^\top\) is in hand, any linear combination \(C\,g(\hat\beta)\) has covariance \(C(G\widehat V G^\top)C^\top\) — no further differentiation is needed. That is why
contrast()is exact under the same approximation.The standard error of \(g(\hat\beta)\) is just the square-root of the diagonal of \(G\widehat V G^\top\); we report both.
A single statistic schema¶
Let \(f(\eta) = E[Y \mid \eta]\) be the response-scale map (the identity for OLS, the inverse link for a GLM, etc.). Williams enumerates seven statistics; every one can be written as a linear combination of mean predictions on different design matrices:
Statistic |
\(g(\beta)\) |
|---|---|
AAP |
\(\frac{1}{n}\sum_i f(x_i^\top\beta)\) |
APM |
\(f(\bar x^\top\beta)\) |
APR |
\(\frac{1}{n}\sum_i f(x_i^\top\beta)\) with some \(x\) fixed |
AME |
\(\frac{1}{n}\sum_i \partial f(x_i^\top\beta)/\partial x_{ij}\) |
MEM |
\(\partial f(\bar x^\top\beta)/\partial x_j\) |
MER |
AME with some \(x\) held at representative values |
Contrast |
\(E[f(\cdot)\mid x_j=\text{level}] - E[f(\cdot)\mid x_j=\text{ref}]\) |
Every entry is a linear combination of mean-predictions \(\frac{1}{n}\sum_i f(x_i^\top\beta)\) on different design matrices. The single Jacobian primitive is therefore
Substituting back into the delta method gives a closed form for the covariance of every statistic in the table.
Marginal effects of a single variable¶
For a continuous covariate \(x_{ij}\),
If the design matrix is just \(X = [\,\dots,\, x_j,\,\dots]\), the
inner derivative is \(\beta_j\) and the AME collapses to the
familiar \(\overline{f'(x^\top\beta)}\,\beta_j\). With a formula
involving I(x**2), x1:x2, bs(x, df=4), or a
Categorical contrast, the inner derivative is whatever
patsy produces when the data column is perturbed and the design
matrix is rebuilt — which is exactly what
Margins does.
Discrete contrasts (factors, booleans, low-cardinality numerics) are handled by computing \(f\) at the level and at the reference, and differencing — the delta-method covariance follows from stacking the two predictions and applying the contrast matrix \(C = [-1, +1]\).
Elasticities¶
The four method= choices on dydx() apply a
per-observation transform to the raw derivative
\(d_i = \partial f(x_i^\top\beta)/\partial x_{ij}\) before
averaging:
Method |
Per-obs transform |
Interpretation |
|---|---|---|
|
\(d_i\) |
Level change (default). |
|
\(d_i\,x_i\) |
Semi-elasticity, \(dy/d(\ln x)\). |
|
\(d_i\,x_i / y_i\) |
Full elasticity. |
|
\(d_i / y_i\) |
Semi-elasticity, \(d(\ln y)/dx\). |
Elasticities are only defined for continuous variables; the implementation raises if you ask for an elasticity on a discrete one.
Outer Jacobian: analytic vs. finite differences¶
Stata’s margins computes the entire \(G\) numerically. We
compute it analytically when possible:
For any GLM, \(f'\) is read off
family.link.inverse_deriv. This covers Logit, Probit, Poisson, NegBin, Gamma, and Gaussian out of the box.For
OLS/WLS/GLS, \(f' \equiv 1\) (identity link).For
MNLogit, we use an analytic softmax-derivative gradient.
Whenever an analytic \(f'\) or Jacobian is available and the linear
predictor is \(\eta = X\beta\) (no offset/exposure), the chain rule
gives \(\partial g/\partial\beta\) in closed form, saving
\(p\) forward predict calls per statistic.
The fallback is central differences,
with \(h_j\) chosen to balance truncation error
(\(O(h^2)\)) against round-off (\(O(\epsilon/h)\)); the
optimum scales as \(h \sim \epsilon^{1/3}\). See
_central_jacobian() and Nocedal & Wright,
Numerical Optimization (2006), Chapter 8.
Analytic-vs-FD parity matrix. Every public API call (AAP, APM, APR, AME, MEM, MER — continuous and discrete) is run twice on each of OLS+poly, Logit+`C(grp)`, Poisson, and MNLogit, with analytic=True and analytic=False, and the two paths must agree on both estimate and SE. This guards against the analytic chain-rule formula drifting from the FD answer that the other tests pin to Stata / get_margeff.
Beyond delta: simulation and bootstrap VCEs¶
The delta method is a first-order approximation: \(G\widehat V G^\top\) is exact only to the extent that \(g\) is locally linear in \(\beta\) near \(\hat\beta\). For statistics that are sharply nonlinear (a probability near 0 or 1, an elasticity at a near-zero prediction) this can produce symmetric Wald intervals that extend beyond the natural support, or under-cover when the curvature is non-negligible.
smmargins exposes two alternatives behind the same vce=
keyword.
Krinsky–Robb simulation (vce="simulation"). Draw
\(S\) parameter vectors \(\beta_s\sim N(\hat\beta,\widehat
V)\), evaluate \(g(\beta_s)\) for each, and read inference off
the empirical distribution of the draws:
The reported point estimate stays the analytic \(g(\hat\beta)\),
not \(\bar g\); this matches Stata’s vce(simulation)
convention and avoids spurious bias from finite \(S\). Pointwise
intervals default to empirical quantiles of the draws and so are not
forced to be symmetric — an asymmetric CI for an extreme probability
is the practical reason to reach for KR.
Bootstrap (vce="bootstrap"). Refit the model on
\(B\) resamples of the rows and read inference off the bootstrap
distribution. Three resampling schemes are built in (boot_method=):
"pairs"— \((y_i, x_i)\) resampled IID (default);"cluster"— whole clusters resampled with replacement, required for robustness to within-cluster correlation;"block"— moving-block resampling for time series, withblock_size=chosen to span the dependence horizon.
Failed refits are caught and counted; a RuntimeWarning fires when
the failure rate exceeds 5%. The point estimate is again the analytic
\(g(\hat\beta)\), not the bootstrap mean.
Simultaneous confidence intervals¶
For a family of \(m\) margins returned in a single call,
ci_method= controls the critical value. With pointwise critical
value \(z_{1-\alpha/2}\):
"bonferroni"uses \(z_{1-\alpha/(2m)}\) — works with any VCE;"sidak"uses \(z_{(1+(1-\alpha)^{1/m})/2}\), slightly narrower than Bonferroni — also works with any VCE;"sup-t"uses the \((1-\alpha)\) quantile of the maximum standardised absolute deviation across the family, computed from the simulation/bootstrap draw matrix:\[c_{\sup\text{-}t}^{\,(1-\alpha)} = Q_{1-\alpha}\!\left( \max_{j=1,\dots,m} \left|\frac{g_j(\beta_s) - g_j(\hat\beta)}{\widehat\sigma_j}\right| \right).\]
Sup-t requires draws (so vce must be "simulation" or
"bootstrap") and is typically narrower than Bonferroni / Šidák
when the margins in the family are correlated — exactly the case for
margins evaluated at neighbouring covariate profiles.
Prediction scales and the chain rule¶
The scale= kwarg on predict() and
dydx() swaps the response-scale map
\(f(\eta)\) for an arbitrary \(g(\eta)\). Built-in scales
("linear", "or", "exp", "log", …) and any user-defined
Transform provide \(g\), \(g'\), and (for
dydx) \(g''\) analytically.
Predicted values. For a single design \(X\), the gradient of \(g(X\beta)\) w.r.t. \(\beta\) is
Only \(g'\) is needed; this is why predict(scale=t) accepts
Transform instances with hess=None.
AME on the g-scale. For a continuous covariate \(x_k\), the AME on the g-scale is
Differentiating w.r.t. \(\beta\):
Both terms appear: the first comes from differentiating
\(g'(\eta_i)\) through \(\eta_i = x_i^\top\beta\), the second
from differentiating the explicit \(\beta_k\). smmargins
computes this Jacobian analytically when Transform.hess is
provided (every built-in does). Without hess, dydx raises —
this is the package’s strict-derivative contract.
Discrete contrasts under a g-scale are simpler: the contrast is \(g(\eta_a) - g(\eta_b)\) averaged over the sample, and its gradient is \(g'(\eta_a)\,x_a - g'(\eta_b)\,x_b\). Only \(g'\) enters; \(g''\) is not used on the discrete path.
Composition with elasticities. Each method= ("eyex",
"dyex", "eydx") wraps the g-scale formulas above with the
appropriate \(x\) or \(g(\eta)\) factor, e.g.
The Jacobian inherits an extra factor or quotient but no new analytic term, so the same \(g'\) / \(g''\) machinery powers every elasticity × scale combination.
Subgroup AMEs (over=)¶
over= partitions the sample into \(G\) subgroups and computes
the AME within each. Because every subgroup AME is a function of the
same \(\hat\beta\), the joint covariance of the \(G\)
estimates is not block-diagonal:
Computing each subgroup independently and concatenating would discard
the off-diagonal blocks and break Wald tests across subgroups, so
smmargins always builds the stacked Jacobian.
Joint Wald and pairwise contrasts¶
For a result with point estimate \(\hat\theta\) and covariance
\(V\), wald() computes
distributed as \(\chi^2_{\mathrm{rank}(CVC^\top)}\) under
\(H_0\). pairwise(by=…) builds \(C\) for all
level-vs-level comparisons of a factor and returns a
MarginsResult whose ci_method= flows through
the simultaneous-CI machinery — including "sup-t" when the
underlying result carries draws.
Why the response scale matters for DiD (Ai & Norton 2003)¶
In a logit, the coefficient on a group:condition interaction is on
the log-odds scale. On the probability scale, the
difference-in-differences
is a nonlinear function of every parameter and every covariate
profile \(x_*\). You cannot read it off the interaction
coefficient. did() evaluates the four cells
on the response scale, with their joint delta-method covariance, and
forms the DiD as a contrast — Ai & Norton’s recommendation in code.
References¶
Stata FAQ, How are the standard errors computed with margins? https://www.stata.com/support/faqs/statistics/compute-standard-errors-with-margins/
Williams, R. (2012). Using the margins command to estimate and interpret adjusted predictions and marginal effects. Stata Journal, 12(2), 308–331. https://www3.nd.edu/~rwilliam/stats/Margins01.pdf
Ai, C., & Norton, E. C. (2003). Interaction terms in logit and probit models. Economics Letters, 80(1), 123–129. https://doi.org/10.1016/S0165-1765(03)00032-6
Nocedal, J., & Wright, S. J. (2006). Numerical Optimization, 2nd ed., Springer. Chapter 8.