How to set covariate profiles with values=, Expr, and newdata=¶
Prerequisites¶
Tutorial: First steps with smmargins — fitting a model and computing basic predictions
Tutorial: Adjusted predictions —
at=andatexog=
Problem statement¶
You want to evaluate predictions or marginal effects under hypothetical covariate profiles: setting income to its 25th percentile, increasing everyone’s age by five years, or using an entirely new data frame. The values= keyword provides a per-variable DSL for this; Expr lets you write formula-based transformations; and newdata= is the escape hatch for arbitrary frames.
Minimal working solution¶
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from smmargins import Margins, Expr
rng = np.random.default_rng(7)
N = 5_000
df = pd.DataFrame({
"age": rng.normal(45, 12, N).clip(18, 90),
"income": rng.lognormal(10.5, 0.4, N),
"educ": rng.choice(["hs", "college", "grad"], N, p=[0.4, 0.4, 0.2]),
"female": rng.integers(0, 2, N),
})
df["voted"] = (rng.uniform(0, 1, N) < 1 / (1 + np.exp(-(
-4 + 0.05 * df.age + 0.00001 * df.income
+ 0.8 * (df.educ == "college") + 1.4 * (df.educ == "grad")
+ 0.3 * df.female
)))).astype(int)
fit = smf.logit("voted ~ age + income + C(educ) + female", data=df).fit(disp=False)
M = Margins(fit)
# Hold income at its median; everything else stays as observed
print(M.predict(values={"income": "p50"}))
# Increase everyone's age by 5 years
print(M.predict(values={"age": Expr("age + 5")}))
# Evaluate at an out-of-sample profile
new = pd.DataFrame({"age": [30, 50, 70], "income": [30000, 50000, 90000],
"educ": ["college", "hs", "grad"], "female": [0, 1, 0]})
print(M.predict(newdata=new))
The values= DSL¶
values= is a mapping from variable names to specifications. Each specification can be one of several kinds:
Kind |
Example |
Result |
|---|---|---|
Scalar |
|
Column set to 45 for every row |
Categorical level |
|
Column set to |
Reducer |
|
Column reduced to its mean |
Percentile |
|
Column reduced to its 25th percentile |
Sequence |
|
Cartesian product over the list |
Callable |
|
Per-row transform via function |
Expr |
|
Per-row transform via |
Built-in reducers¶
Valid reducer strings are:
"mean","median","mode","min","max","zero"Percentiles:
"p0"through"p100""asobserved"(the default fordefault_values=)
Numeric reducers raise on non-numeric columns; "mode" and "zero" work for any dtype. "zero" on a categorical column sets it to the reference (first observed) level.
Sequences and the cartesian product¶
If any value is a list, tuple, or ndarray, smmargins computes the cartesian product over all sequence specs:
# 3 ages × 2 sexes = 6 predictions
M.predict(values={"age": [25, 45, 65], "female": [0, 1]})
Non-sequence specs (scalars, reducers, callables, Expr) are applied first and do not multiply the grid.
Callables¶
A callable receives the original data frame and must return a pd.Series or np.ndarray of the same length:
M.predict(values={"income": lambda d: d["income"] * 1.10})
Expr¶
:class:~smmargins.Expr wraps a string that is evaluated via :meth:pandas.DataFrame.eval against the original data frame. Use it when the transformation is easier to write as a formula:
M.predict(values={"income": Expr("income * 1.10")})
M.predict(values={"age": Expr("(age - age.mean()) / age.std()")})
Strings that are not wrapped in Expr are interpreted as reducer names or categorical levels, not as formulas.
default_values= — what happens to the rest¶
By default, columns not mentioned in values= or atexog= keep their observed values (default_values="asobserved"). You can change this:
# Hold age at 45; set all other numerics to their mean
M.predict(values={"age": 45}, default_values="mean")
# Set all numerics to their median; factors stay as observed
M.predict(values={"age": 45}, default_values="median")
Valid options for default_values are "asobserved", "mean", "median", "mode", and "zero".
newdata= — arbitrary frames¶
newdata= accepts any :class:pandas.DataFrame with the same columns as the fitting data. It is mutually exclusive with at, atexog, values, and over:
hypo = pd.DataFrame({"age": [25, 45, 65], "income": [30000, 50000, 80000],
"educ": ["college", "hs", "grad"], "female": [0, 1, 0]})
M.predict(newdata=hypo)
Use newdata= for out-of-sample profiles or when the DSL does not give you enough control.
Using the DSL with dydx and contrast¶
values= and default_values= work on every post-estimation method, not just predict:
# AME of age when income is held at its 25th percentile
M.dydx("age", values={"income": "p25"})
# Marginal effect of education when everyone is female
M.dydx("educ", values={"female": 1})
# Contrast: female=1 vs female=0, holding age at 45
M.contrast(a={"female": 1, "age": 45}, b={"female": 0, "age": 45})
When to use this¶
Use values= for the common case where you want to tweak a few variables and leave the rest at observed or mean values. Use Expr when the counterfactual is naturally expressed as a formula (e.g. “income + 10%”). Use newdata= when you have a full data frame of profiles, especially out-of-sample ones.
When NOT to use this¶
⚠️ Trade-off: Do not mix
newdata=withat/atexog/values/over— they are mutually exclusive. Pass one complete frame OR use the DSL, not both.⚠️ Trade-off:
default_values="mean"reduces unspecified numeric columns to their mean, but factors stay as observed (or at their mode/reference depending onfactor_stat=). If you want a single representative individual, useat="mean"instead.
See also¶
How to compute counterfactual predictions with values, Expr, and newdata — additional examples including joint contrasts
How to compute elasticities —
values=inside elasticity callsReference: Expr — API reference for the
Exprwrapper