Mathematical motivation ======================= This page derives, in one place, every statistic that :class:`~smmargins.Margins` computes and the standard error attached to each one. The motivation is identical to Stata's `margins-delta-method FAQ `_ and the cases enumerated in Richard Williams' `Margins01 notes `_; the goal here is to write them down using a single primitive so the implementation reduces to one Jacobian. The delta method ---------------- For a fitted parameter vector :math:`\hat\beta` with estimated covariance :math:`\widehat V`, and a (possibly vector-valued) statistic :math:`g(\beta)`, a first-order Taylor expansion gives .. math:: g(\hat\beta) \;\approx\; g(\beta_0) + G\,(\hat\beta - \beta_0), \qquad G = \left.\tfrac{\partial g}{\partial \beta}\right|_{\beta_0}. Taking variances on both sides yields .. math:: \widehat{\operatorname{Var}}\bigl[g(\hat\beta)\bigr] \;\approx\; G\,\widehat V\,G^\top . Three things to notice: 1. The approximation depends only on **first** derivatives of :math:`g`. 2. Once :math:`G\widehat V G^\top` is in hand, *any* linear combination :math:`C\,g(\hat\beta)` has covariance :math:`C(G\widehat V G^\top)C^\top` — no further differentiation is needed. That is why :meth:`~smmargins.MarginsResult.contrast` is exact under the same approximation. 3. The standard error of :math:`g(\hat\beta)` is just the square-root of the diagonal of :math:`G\widehat V G^\top`; we report both. A single statistic schema ------------------------- Let :math:`f(\eta) = E[Y \mid \eta]` be the response-scale map (the identity for OLS, the inverse link for a GLM, etc.). Williams enumerates seven statistics; every one can be written as a linear combination of mean predictions on different design matrices: .. list-table:: :header-rows: 1 :widths: 15 60 * - Statistic - :math:`g(\beta)` * - AAP - :math:`\frac{1}{n}\sum_i f(x_i^\top\beta)` * - APM - :math:`f(\bar x^\top\beta)` * - APR - :math:`\frac{1}{n}\sum_i f(x_i^\top\beta)` with some :math:`x` fixed * - AME - :math:`\frac{1}{n}\sum_i \partial f(x_i^\top\beta)/\partial x_{ij}` * - MEM - :math:`\partial f(\bar x^\top\beta)/\partial x_j` * - MER - AME with some :math:`x` held at representative values * - Contrast - :math:`E[f(\cdot)\mid x_j=\text{level}] - E[f(\cdot)\mid x_j=\text{ref}]` Every entry is a linear combination of mean-predictions :math:`\frac{1}{n}\sum_i f(x_i^\top\beta)` on different design matrices. The single Jacobian primitive is therefore .. math:: \frac{\partial}{\partial\beta}\, \frac{1}{n}\sum_i f(x_i^\top\beta) \;=\; \frac{1}{n}\sum_i f'(x_i^\top\beta)\,x_i . Substituting back into the delta method gives a closed form for the covariance of every statistic in the table. Marginal effects of a single variable ------------------------------------- For a continuous covariate :math:`x_{ij}`, .. math:: \frac{\partial f(x_i^\top\beta)}{\partial x_{ij}} \;=\; f'(x_i^\top\beta)\,\frac{\partial (x_i^\top\beta)}{\partial x_{ij}} . If the design matrix is just :math:`X = [\,\dots,\, x_j,\,\dots]`, the inner derivative is :math:`\beta_j` and the AME collapses to the familiar :math:`\overline{f'(x^\top\beta)}\,\beta_j`. With a formula involving ``I(x**2)``, ``x1:x2``, ``bs(x, df=4)``, or a :class:`~patsy.Categorical` contrast, the inner derivative is whatever patsy produces when the *data column* is perturbed and the design matrix is rebuilt — which is exactly what :class:`~smmargins.Margins` does. Discrete contrasts (factors, booleans, low-cardinality numerics) are handled by computing :math:`f` at the level and at the reference, and differencing — the delta-method covariance follows from stacking the two predictions and applying the contrast matrix :math:`C = [-1, +1]`. Elasticities ------------ The four ``method=`` choices on :meth:`~smmargins.Margins.dydx` apply a per-observation transform to the raw derivative :math:`d_i = \partial f(x_i^\top\beta)/\partial x_{ij}` before averaging: .. list-table:: :header-rows: 1 :widths: 10 30 60 * - Method - Per-obs transform - Interpretation * - ``dydx`` - :math:`d_i` - Level change (default). * - ``dyex`` - :math:`d_i\,x_i` - Semi-elasticity, :math:`dy/d(\ln x)`. * - ``eyex`` - :math:`d_i\,x_i / y_i` - Full elasticity. * - ``eydx`` - :math:`d_i / y_i` - Semi-elasticity, :math:`d(\ln y)/dx`. Elasticities are only defined for continuous variables; the implementation raises if you ask for an elasticity on a discrete one. Outer Jacobian: analytic vs. finite differences ----------------------------------------------- Stata's ``margins`` computes the entire :math:`G` numerically. We compute it analytically when possible: - For any GLM, :math:`f'` is read off ``family.link.inverse_deriv``. This covers Logit, Probit, Poisson, NegBin, Gamma, and Gaussian out of the box. - For ``OLS`` / ``WLS`` / ``GLS``, :math:`f' \equiv 1` (identity link). - For ``MNLogit``, we use an analytic softmax-derivative gradient. Whenever an analytic :math:`f'` or Jacobian is available **and** the linear predictor is :math:`\eta = X\beta` (no offset/exposure), the chain rule gives :math:`\partial g/\partial\beta` in closed form, saving :math:`p` forward ``predict`` calls per statistic. The fallback is central differences, .. math:: \frac{\partial f}{\partial x_j} \approx \frac{f(x + h e_j) - f(x - h e_j)}{2 h_j}, \qquad h_j = \epsilon^{1/3}\cdot\max(|x_j|, 1), with :math:`h_j` chosen to balance truncation error (:math:`O(h^2)`) against round-off (:math:`O(\epsilon/h)`); the optimum scales as :math:`h \sim \epsilon^{1/3}`. See :func:`~smmargins.utils._central_jacobian` and Nocedal & Wright, *Numerical Optimization* (2006), Chapter 8. - **Analytic-vs-FD parity matrix.** Every public API call (AAP, APM, APR, AME, MEM, MER — continuous and discrete) is run twice on each of OLS+poly, Logit+`C(grp)`, Poisson, and MNLogit, with `analytic=True` and `analytic=False`, and the two paths must agree on both estimate and SE. This guards against the analytic chain-rule formula drifting from the FD answer that the other tests pin to Stata / `get_margeff`. Beyond delta: simulation and bootstrap VCEs ------------------------------------------- The delta method is a first-order approximation: :math:`G\widehat V G^\top` is exact only to the extent that :math:`g` is locally linear in :math:`\beta` near :math:`\hat\beta`. For statistics that are sharply nonlinear (a probability near 0 or 1, an elasticity at a near-zero prediction) this can produce symmetric Wald intervals that extend beyond the natural support, or under-cover when the curvature is non-negligible. ``smmargins`` exposes two alternatives behind the same ``vce=`` keyword. **Krinsky–Robb simulation** (``vce="simulation"``). Draw :math:`S` parameter vectors :math:`\beta_s\sim N(\hat\beta,\widehat V)`, evaluate :math:`g(\beta_s)` for each, and read inference off the empirical distribution of the draws: .. math:: \widehat{\operatorname{Var}}_{\mathrm{KR}}\bigl[g(\hat\beta)\bigr] = \frac{1}{S-1}\sum_{s=1}^S \bigl(g(\beta_s) - \bar g\bigr)\!\bigl(g(\beta_s) - \bar g\bigr)^{\!\top}. The reported point estimate stays the *analytic* :math:`g(\hat\beta)`, not :math:`\bar g`; this matches Stata's ``vce(simulation)`` convention and avoids spurious bias from finite :math:`S`. Pointwise intervals default to empirical quantiles of the draws and so are not forced to be symmetric — an asymmetric CI for an extreme probability is the practical reason to reach for KR. **Bootstrap** (``vce="bootstrap"``). Refit the model on :math:`B` resamples of the rows and read inference off the bootstrap distribution. Three resampling schemes are built in (``boot_method=``): - ``"pairs"`` — :math:`(y_i, x_i)` resampled IID (default); - ``"cluster"`` — whole clusters resampled with replacement, required for robustness to within-cluster correlation; - ``"block"`` — moving-block resampling for time series, with ``block_size=`` chosen to span the dependence horizon. Failed refits are caught and counted; a ``RuntimeWarning`` fires when the failure rate exceeds 5%. The point estimate is again the analytic :math:`g(\hat\beta)`, not the bootstrap mean. Simultaneous confidence intervals --------------------------------- For a family of :math:`m` margins returned in a single call, ``ci_method=`` controls the critical value. With pointwise critical value :math:`z_{1-\alpha/2}`: - ``"bonferroni"`` uses :math:`z_{1-\alpha/(2m)}` — works with any VCE; - ``"sidak"`` uses :math:`z_{(1+(1-\alpha)^{1/m})/2}`, slightly narrower than Bonferroni — also works with any VCE; - ``"sup-t"`` uses the :math:`(1-\alpha)` quantile of the maximum standardised absolute deviation across the family, computed from the simulation/bootstrap draw matrix: .. math:: c_{\sup\text{-}t}^{\,(1-\alpha)} = Q_{1-\alpha}\!\left( \max_{j=1,\dots,m} \left|\frac{g_j(\beta_s) - g_j(\hat\beta)}{\widehat\sigma_j}\right| \right). Sup-t requires draws (so ``vce`` must be ``"simulation"`` or ``"bootstrap"``) and is typically narrower than Bonferroni / Šidák when the margins in the family are correlated — exactly the case for margins evaluated at neighbouring covariate profiles. Prediction scales and the chain rule ------------------------------------ The ``scale=`` kwarg on :meth:`~smmargins.Margins.predict` and :meth:`~smmargins.Margins.dydx` swaps the response-scale map :math:`f(\eta)` for an arbitrary :math:`g(\eta)`. Built-in scales (``"linear"``, ``"or"``, ``"exp"``, ``"log"``, …) and any user-defined :class:`~smmargins.Transform` provide :math:`g`, :math:`g'`, and (for ``dydx``) :math:`g''` analytically. **Predicted values.** For a single design :math:`X`, the gradient of :math:`g(X\beta)` w.r.t. :math:`\beta` is .. math:: \frac{\partial g(\eta)}{\partial \beta} = g'(\eta) \cdot X. Only :math:`g'` is needed; this is why ``predict(scale=t)`` accepts ``Transform`` instances with ``hess=None``. **AME on the g-scale.** For a continuous covariate :math:`x_k`, the AME on the g-scale is .. math:: \mathrm{AME}_k^{(g)} = \frac{1}{n}\sum_i g'(\eta_i) \cdot \beta_k. Differentiating w.r.t. :math:`\beta`: .. math:: \frac{\partial \mathrm{AME}_k^{(g)}}{\partial \beta_j} = \frac{1}{n}\sum_i \bigl[\,g''(\eta_i) \cdot x_{ij} \cdot \beta_k + g'(\eta_i) \cdot \delta_{jk}\,\bigr]. Both terms appear: the first comes from differentiating :math:`g'(\eta_i)` through :math:`\eta_i = x_i^\top\beta`, the second from differentiating the explicit :math:`\beta_k`. ``smmargins`` computes this Jacobian analytically when ``Transform.hess`` is provided (every built-in does). Without ``hess``, ``dydx`` raises — this is the package's strict-derivative contract. **Discrete contrasts** under a g-scale are simpler: the contrast is :math:`g(\eta_a) - g(\eta_b)` averaged over the sample, and its gradient is :math:`g'(\eta_a)\,x_a - g'(\eta_b)\,x_b`. Only :math:`g'` enters; :math:`g''` is not used on the discrete path. **Composition with elasticities.** Each ``method=`` (``"eyex"``, ``"dyex"``, ``"eydx"``) wraps the g-scale formulas above with the appropriate :math:`x` or :math:`g(\eta)` factor, e.g. .. math:: \mathrm{eyex}_k^{(g)} = \mathrm{AME}_k^{(g)} \cdot \frac{\overline{x_k}}{\overline{g(\eta)}}. The Jacobian inherits an extra factor or quotient but no new analytic term, so the same :math:`g'` / :math:`g''` machinery powers every elasticity × scale combination. Subgroup AMEs (``over=``) ------------------------- ``over=`` partitions the sample into :math:`G` subgroups and computes the AME within each. Because every subgroup AME is a function of the *same* :math:`\hat\beta`, the joint covariance of the :math:`G` estimates is **not block-diagonal**: .. math:: \widehat{\mathrm{Var}}\bigl(\hat\theta_{1\ldots G}\bigr) = J\,\widehat V\,J^\top, \quad J = \begin{bmatrix} \partial \theta_1 / \partial \beta \\ \vdots \\ \partial \theta_G / \partial \beta \end{bmatrix}. Computing each subgroup independently and concatenating would discard the off-diagonal blocks and break Wald tests across subgroups, so ``smmargins`` always builds the stacked Jacobian. Joint Wald and pairwise contrasts --------------------------------- For a result with point estimate :math:`\hat\theta` and covariance :math:`V`, :meth:`~smmargins.MarginsResult.wald` computes .. math:: W = (C\hat\theta - c_0)^\top (C V C^\top)^{-1} (C\hat\theta - c_0), distributed as :math:`\chi^2_{\mathrm{rank}(CVC^\top)}` under :math:`H_0`. ``pairwise(by=…)`` builds :math:`C` for all level-vs-level comparisons of a factor and returns a :class:`~smmargins.MarginsResult` whose ``ci_method=`` flows through the simultaneous-CI machinery — including ``"sup-t"`` when the underlying result carries draws. Why the response scale matters for DiD (Ai & Norton 2003) --------------------------------------------------------- In a logit, the coefficient on a ``group:condition`` interaction is on the *log-odds* scale. On the *probability* scale, the difference-in-differences .. math:: \mathrm{DiD}(x_*) = \bigl[f(\eta_{1,1}) - f(\eta_{1,0})\bigr] - \bigl[f(\eta_{0,1}) - f(\eta_{0,0})\bigr] is a nonlinear function of *every* parameter and *every* covariate profile :math:`x_*`. You cannot read it off the interaction coefficient. :meth:`~smmargins.Margins.did` evaluates the four cells on the response scale, with their joint delta-method covariance, and forms the DiD as a contrast — Ai & Norton's recommendation in code. References ---------- - Stata FAQ, *How are the standard errors computed with margins?* https://www.stata.com/support/faqs/statistics/compute-standard-errors-with-margins/ - Williams, R. (2012). *Using the margins command to estimate and interpret adjusted predictions and marginal effects.* Stata Journal, 12(2), 308–331. https://www3.nd.edu/~rwilliam/stats/Margins01.pdf - Ai, C., & Norton, E. C. (2003). *Interaction terms in logit and probit models.* Economics Letters, 80(1), 123–129. https://doi.org/10.1016/S0165-1765(03)00032-6 - Nocedal, J., & Wright, S. J. (2006). *Numerical Optimization*, 2nd ed., Springer. Chapter 8.