# Why `Expr` and the `values=` DSL ## The problem Counterfactual profiling requires three distinct operations: 1. **Reduce** a column to a summary (mean, median, 25th percentile). 2. **Transform** every observation by a formula ("income + 10%"). 3. **Replace** the data with an arbitrary frame. A single API surface has to serve all three without ambiguity. In particular, bare strings are already overloaded: `"mean"` is a reducer, `"college"` is a categorical level, and `"income * 1.1"` is a formula. `smmargins` resolves this with a small, explicit type system. ## The design ### Reducers vs. scalars Strings in `values=` are inspected at runtime: - If the string matches a built-in reducer (`"mean"`, `"median"`, `"p25"`, …) or a percentile pattern (`p0`–`p100`), it is interpreted as a reducer. - Otherwise it is treated as a scalar categorical level. This is why `"p50"` reduces income to its median, while `"college"` sets education to that level. ### `Expr` as an explicit opt-in Formula strings are **not** parsed automatically, because a formula like `"age + 5"` could be mistaken for a categorical level or a mistyped reducer. :class:`~smmargins.Expr` is an explicit wrapper that says "evaluate this via `df.eval`": ```python M.predict(values={"income": Expr("income * 1.10")}) ``` This avoids the ambiguity that would arise if bare strings were passed to `pandas.eval`, and it makes the user's intent visible at the call site. ### Callables for programmatic transforms For transforms that are easier to write as Python functions than as strings, `values=` also accepts callables. The callable receives the original data frame and must return a Series or array of the same length. This is useful when the transform reuses intermediate computations or depends on external parameters: ```python def add_tax(d): return d["income"] * 1.10 + 500 M.predict(values={"income": add_tax}) ``` ### `newdata=` as the escape hatch The `values=` DSL always starts from the original fitting data. If you need out-of-sample profiles or a completely different population, `newdata=` replaces the entire frame. It is mutually exclusive with `values=` because the two concepts do not compose: there is no sensible way to apply a per-variable DSL on top of an arbitrary foreign frame. ## Tradeoffs - **Explicit over implicit.** `Expr` requires an import and a wrapper. The payoff is that strings never silently change meaning when a new reducer is added. - **Observed-values default.** `default_values="asobserved"` means that forgetting a variable leaves the full distribution intact. This is conservative but can produce wide confidence intervals when many variables vary. Switch to `default_values="mean"` for a tighter, MEM-style profile. - **Cartesian product only.** Sequences in `values=` are expanded as a full cartesian product. There is no "meshgrid-only" mode; if you need a custom grid shape, build the frame yourself and use `newdata=`. ## What the implementation does The resolution pipeline lives in `smmargins/_design.py`: 1. `_classify_value` inspects each `values=` entry and returns a `ValueSpec` (`scalar`, `sequence`, `reducer`, `callable`, or `expr`). 2. `expand_values` applies `default_values` to unspecified columns, then applies non-sequence specs, then takes the cartesian product over sequences. 3. The resulting frames are passed to `_Profile.materialize`, which rebuilds the design matrix via patsy (or raw exog lookup) for each profile. This separation means the DSL is pure data manipulation: it knows nothing about predictions, derivatives, or scales.