Skip to content

Add ContinuousDiD estimator (Callaway, Goodman-Bacon & Sant'Anna 2024)#177

Merged
igerber merged 17 commits intomainfrom
worktree-continuous-did
Feb 22, 2026
Merged

Add ContinuousDiD estimator (Callaway, Goodman-Bacon & Sant'Anna 2024)#177
igerber merged 17 commits intomainfrom
worktree-continuous-did

Conversation

@igerber
Copy link
Owner

@igerber igerber commented Feb 21, 2026

Summary

  • Add ContinuousDiD estimator implementing Callaway, Goodman-Bacon & Sant'Anna (2024) "Difference-in-Differences with a Continuous Treatment" (NBER WP 32117)
  • B-spline dose-response curves for ATT(d) and ACRT(d) derivative estimation
  • Full staggered adoption support with (g,t) cell iteration, dose aggregation, and event-study aggregation
  • Multiplier bootstrap inference with analytical SE fallback
  • Extract shared bootstrap utilities from staggered_bootstrap.py into bootstrap_utils.py
  • Fix plt.show() blocking test suite by setting non-interactive matplotlib backend in conftest.py

New files

  • diff_diff/continuous_did.py — Main ContinuousDiD estimator class
  • diff_diff/continuous_did_results.pyContinuousDiDResults and DoseResponseCurve dataclasses
  • diff_diff/continuous_did_bspline.py — B-spline basis construction, evaluation, derivatives
  • diff_diff/bootstrap_utils.py — Shared bootstrap weight generation, CI, p-value helpers
  • docs/methodology/continuous-did.md — Full methodology documentation
  • tests/test_continuous_did.py — 46 unit/integration tests
  • tests/test_methodology_continuous_did.py — 15 equation verification + 6 R benchmark tests

Methodology references (required if estimator / math changes)

  • Method name(s): Continuous Difference-in-Differences
  • Paper / source link(s): Callaway, Goodman-Bacon & Sant'Anna (2024), NBER WP 32117 — https://www.nber.org/papers/w32117
  • Reference implementation: R contdid v0.1.0 (https://bcallaway11.github.io/contdid/)
  • Any intentional deviations from the source (and why):
    • Use training boundary knots for evaluation grid (R's contdid v0.1.0 has a boundary knot inconsistency where splines2::bSpline(dvals) uses range(dvals) instead of range(dose))
    • Properly propagate control_group to overall ATT computation (R's contdid v0.1.0 always uses notyettreated internally regardless of user setting)

Validation

  • Tests added/updated:
    • tests/test_continuous_did.py — 46 tests covering init, data validation, fit, results, B-splines, dose grid, aggregation, bootstrap, edge cases
    • tests/test_methodology_continuous_did.py — 15 tests: hand-calculable equation verification + 6 R contdid benchmarks (ATT(d), ACRT(d), overall ATT/ACRT all within <1-2% relative tolerance)
    • tests/conftest.py — Added MPLBACKEND=Agg to prevent plt.show() blocking
  • Full test suite: 1485 passed, 64 skipped, 0 failures (excluding TROP)
  • Bootstrap refactor safe: all 95 CallawaySantAnna tests pass

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Implement Callaway, Goodman-Bacon & Sant'Anna (2024) continuous
treatment DiD estimator with B-spline dose-response curves, ACRT
derivatives, staggered adoption support, and multiplier bootstrap
inference. Validated against R contdid v0.1.0 across 6 benchmarks.

Also extract shared bootstrap utilities to bootstrap_utils.py and
fix plt.show() blocking in test suite via non-interactive backend.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

Overall assessment: ⛔ Blocker

Executive summary

  • P0: control_group="not_yet_treated" incorrectly includes the treated cohort itself in pre-period control sets, biasing pre-trend/event-study results and violating the Methodology Registry definition.
  • P1: New inline inference in DoseResponseCurve.to_dataframe() violates the required safe_inference() pattern.
  • P1: No validation that treated doses are strictly positive; negative/zero treated doses are silently excluded and can break spline setup or bias estimates.

Methodology

  • Severity: P0. Impact: For control_group="not_yet_treated", the control set includes the treated cohort itself when t < g because unit_cohorts > t is true for g. This contaminates the control mean, attenuates pre-trends, and violates the registry’s “not‑yet‑treated excludes the treated cohort” requirement. Location: diff_diff/continuous_did.py in _compute_dose_response_gt where control_mask = never_treated_mask | (unit_cohorts > t). Concrete fix: exclude the treated cohort explicitly, e.g. control_mask = never_treated_mask | ((unit_cohorts > t) & (unit_cohorts != g)), or build control mask as (unit_cohorts == 0) | (unit_cohorts > t) and then control_mask &= (unit_cohorts != g).
  • Severity: P1. Impact: The estimator assumes D_+ strictly positive (see Methodology Registry), but the code only drops units with zero dose; negative treated doses are treated as “untreated” in treated_mask = (unit_cohorts == g) & (dose_vector > 0), which can silently drop treated units and lead to empty dose grids or biased estimates. Location: diff_diff/continuous_did.py in fit() and _compute_dose_response_gt. Concrete fix: validate that all treated units have dose > 0 (or explicitly support negative doses by adjusting masks/grid construction), and raise a clear error if violated.

Code Quality

  • Severity: P1. Impact: Inline inference in DoseResponseCurve.to_dataframe() violates the “use safe_inference()” rule and risks inconsistent NaN propagation or distributional assumptions. Location: diff_diff/continuous_did_results.py in DoseResponseCurve.to_dataframe (manual t_stat/p_value). Concrete fix: compute t_stat, p_value, and CI via safe_inference() in a vectorized loop or helper to ensure consistent NaN handling.

Performance

  • No issues found in the diff.

Maintainability

  • No issues found in the diff.

Tech Debt

  • No issues found in the diff.

Security

  • No issues found in the diff.

Documentation/Tests

  • Severity: P2. Impact: The not‑yet‑treated control bug is not covered; current tests only assert that the estimator runs. Without a pre‑period validation test, regressions to control‑group logic won’t be caught. Location: tests/test_continuous_did.py. Concrete fix: add a test that checks pre‑period event‑study effects with control_group="not_yet_treated" and asserts the control group excludes the treated cohort (e.g., by constructing a small panel where contamination would produce a known bias).

Suggested next steps

  1. Fix the control group logic for not_yet_treated in _compute_dose_response_gt and add a targeted test.
  2. Replace inline inference in DoseResponseCurve.to_dataframe() with safe_inference() and add a small test for NaN propagation.
  3. Add explicit validation for treated doses (dose > 0) and a test for negative/zero dose handling.

…lidation

- Fix not_yet_treated control group to exclude cohort g from its own control
  set (matches staggered.py behavior)
- Replace inline t_stat/p_value computation in DoseResponseCurve.to_dataframe()
  with safe_inference() loop per project convention
- Add validation rejecting negative doses among treated units
- Fix test_inf_first_treat_normalization CI failure (cast to float before inf)
- Add test for not_yet_treated control group correctness and negative dose

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 21, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: b9e97f057b84657c0ba9ae6fbdfb87967f1a3eec


Overall assessment: ⛔ Blocker

Executive summary

  • Affected methods: new ContinuousDiD estimator (Callaway, Goodman-Bacon & Sant’Anna 2024) and shared bootstrap utilities used by CallawaySantAnna (refactor only).
  • P0 Analytical SEs from influence functions appear mis-scaled, likely understating SEs and overstating significance in the default n_bootstrap=0 path.
  • P1 Registry-required discrete-dose detection/warning is missing; discrete treatments are silently fit with splines.
  • P1 Event-study aggregation ignores anticipation in relative-time indexing; post-periods can be labeled as pre.
  • P1 No guard for empty post-treatment cells; aggregates default to 0.0 rather than error/NaN.

Methodology

  • P0 diff_diff/continuous_did.py:L747-L788 Impact: Influence-function SEs are scaled by / n_t and / n_c and then sqrt(mean(IF^2)), which underestimates SEs by roughly sqrt(n) (default inference is too optimistic). Concrete fix: define IFs without the 1/n_t/1/n_c scaling and compute se = sqrt(mean(IF^2) / n_units), or keep current IFs but multiply by sqrt(n_units) at the end; validate against bootstrap for parity.
  • P1 diff_diff/continuous_did.py:L170-L210 Impact: Registry says discrete dose should be detected and warned; current validation never checks for integer-valued or low-unique dose. Users may get misleading smooth curves without warning. Concrete fix: add a discrete-dose detection check on treated doses (e.g., all-close-to-integers or small unique count) and warn/raise with guidance that saturated regression is not yet implemented.
  • P1 diff_diff/continuous_did.py:L670-L690 Impact: Event-study relative time uses e = t - g even when anticipation > 0; this mislabels post-treatment periods as pre-treatment in the event-study output. Concrete fix: set e = t - (g - self.anticipation) or adjust the event-study aggregation to shift by anticipation consistently with the post-treatment filter.
  • P1 diff_diff/continuous_did.py:L269-L307 Impact: If post_gt is empty (e.g., all treatments occur after the last observed period), overall_att/overall_acrt stay at 0.0 with NaN inference, which is misleading rather than a hard failure. Concrete fix: after building post_gt, error or return NaNs with a clear warning when no post-treatment cells exist.

Code Quality

  • P2 diff_diff/continuous_did.py:L520-L551 Impact: Invalid control_group values silently fall into the not_yet_treated branch, which can mis-specify the estimator without any error. Concrete fix: validate control_group and base_period against allowed values in __init__ or fit and raise ValueError on unknown inputs.

Performance

  • No issues found.

Maintainability

  • No issues found.

Tech Debt

  • No issues found.

Security

  • No issues found.

Documentation/Tests

  • P2 tests/test_continuous_did.py Impact: No tests covering anticipation with event-study aggregation or empty post_gt handling; discrete-dose warning not tested. Concrete fix: add tests for anticipation>0 event-study labels, no-post-treatment error/NaN behavior, and discrete-dose warning path.

Tests not run (not requested).

- Fix analytical SE: use sqrt(sum(IF^2)) instead of sqrt(mean(IF^2)) to
  match CallawaySantAnna's influence function convention (P0)
- Add discrete dose detection warning for integer-valued treatments (P1)
- Guard empty post-treatment cells: warn and return NaN instead of 0.0 (P1)
- Validate control_group and base_period params in __init__ and set_params (P2)
- Add 7 new tests: SE parity, discrete dose, anticipation event study,
  empty post_gt, and parameter validation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 21, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 8c3980b413a2aa20ecf5747d17bcb3ee091e4bcd


Overall assessment: ⚠️ Needs changes

Executive summary:

  • P1: ContinuousDiD allows estimation without any D=0 units under control_group="not_yet_treated", but the Methodology Registry requires P(D=0)>0 (Remark 3.1 not implemented).
  • P1: Bootstrap inference is used only for SEs; overall/event‑study p‑values and CIs are normal‑approx, and dose‑response p‑values are normal while CIs are bootstrap percentiles, which is inconsistent with the Registry’s “multiplier bootstrap for inference.”
  • P2: Event‑study inference is left as NaN when n_bootstrap=0, despite “analytical SE fallback” being advertised.
  • Otherwise the core B‑spline construction, (g,t) iteration, and weighting logic look aligned with the new methodology docs.

Methodology

  • Severity: P1 | Impact: Violates the Registry’s requirement of an untreated group (P(D=0)>0), so estimates can be produced in unidentified settings without warning when control_group="not_yet_treated". | Concrete fix: Enforce n_control > 0 regardless of control_group, or implement the “lowest‑dose‑as‑control” Remark 3.1 logic and document it. | diff_diff/continuous_did.py:L262-L267
  • Severity: P1 | Impact: When n_bootstrap>0, overall ATT/ACRT and event‑study p‑values/CIs are computed via normal approximation (safe_inference), while dose‑response CIs are percentile bootstrap; this is inconsistent with the Registry’s stated “multiplier bootstrap for inference” and with existing CS/SA bootstrap behavior. | Concrete fix: Use compute_effect_bootstrap_stats outputs for overall and event‑study (and optionally per‑dose) CIs/p‑values; if you want bootstrap‑based ACRT intervals later, add acrt_glob into bootstrap_info and shift boot_acrt_glob by the original effect before computing percentiles. | diff_diff/continuous_did.py:L378-L406, diff_diff/continuous_did.py:L1012-L1032, diff_diff/continuous_did.py:L688-L705, diff_diff/continuous_did.py:L954-L959, diff_diff/continuous_did_results.py:L47-L66
  • Severity: P2 | Impact: aggregate="eventstudy" with n_bootstrap=0 yields NaN SE/t/p/CI for event‑study effects, which undercuts the “analytical SE fallback” claim. | Concrete fix: Add an analytical event‑study SE path (re‑use influence functions + event‑time weights), or explicitly document that event‑study inference requires bootstrap. | diff_diff/continuous_did.py:L328-L406

Code Quality

  • No issues found.

Performance

  • No issues found.

Maintainability

  • No issues found.

Tech Debt

  • No issues found.

Security

  • No issues found.

Documentation/Tests

  • No issues found.

…l event-study SEs

- Use percentile CI/p-value from bootstrap (not normal approx) for overall
  ATT/ACRT and event-study effects, matching CallawaySantAnna convention
- Add P(D=0)>0 warning when control_group='not_yet_treated' has no
  never-treated units (Remark 3.1 in Callaway et al.)
- Compute IF-based analytical SEs for event-study bins when n_bootstrap=0
  (previously yielded NaN)
- Add tests for all three fixes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 21, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 40d22d4530fb45aed6315b92a27d7a33f7d27e93


Overall assessment: ⚠️ Needs changes

Executive Summary

  • ContinuousDiD is well-scaffolded (estimator, B-splines, results, tests, registry), and bootstrap helpers were cleanly extracted.
  • P1 methodology gap: estimation proceeds without any D=0 units under control_group="not_yet_treated", but the registry says Remark 3.1 is not implemented and P(D=0)>0 is required.
  • P1 methodology gap: balanced-panel check only compares per-unit counts, not identical time sets, so some unbalanced panels slip through.
  • P3 test reliability: R benchmark tests write fixed /tmp/r_bench*.csv paths, which can collide under parallel pytest.

Methodology

  • P1 — Impact: Allows estimation when P(D=0)=0 despite registry stating lowest‑dose‑as‑control is not implemented and P(D=0)>0 is required, so ATT/ACRT are not identified. | Fix: raise a ValueError when control_group="not_yet_treated" and n_control==0, or implement Remark 3.1 and document it. diff_diff/continuous_did.py:L269-L277 and docs/methodology/REGISTRY.md:L425-L429
  • P1 — Impact: “Balanced panel required” is enforced only by equal counts per unit; panels with different missing periods can pass and propagate NaNs into ΔY and OLS, violating the registry requirement. | Fix: verify each unit has the same set of time values as the global set (e.g., compare sets or pivot and check for any NaNs) and raise on mismatch. diff_diff/continuous_did.py:L247-L252 and docs/methodology/REGISTRY.md:L429-L429

Code Quality

  • P3 — Impact: Invalid aggregate values silently fall back to dose aggregation, making typos hard to detect. | Fix: validate aggregate in {None, "dose", "eventstudy"} at the start of fit() and raise on invalid input. diff_diff/continuous_did.py:L338-L343

Performance

  • No issues found.

Maintainability

  • No issues found.

Tech Debt

  • No issues found.

Security

  • No issues found.

Documentation/Tests

  • P3 — Impact: Fixed /tmp/r_bench*.csv paths can collide in parallel test runs or leave stale files, causing flaky R benchmarks. | Fix: use tempfile.NamedTemporaryFile (or pass a temp path into the R script) and read back that same path in Python. tests/test_methodology_continuous_did.py:L477-L493, tests/test_methodology_continuous_did.py:L551-L568, tests/test_methodology_continuous_did.py:L621-L636

- Upgrade P(D=0)=0 warning to ValueError for not_yet_treated (P1)
- Strengthen balanced-panel check to verify identical time sets (P1)
- Add aggregate parameter validation at fit() entry (P3)
- Replace hardcoded /tmp paths with tempfile in R benchmarks (P3)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 22, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 1e01fd45b73f49bd7b04939b293c7a6f86b4ae72


Overall assessment: ⚠️ Needs changes

  • P1: Bootstrap ACRT^{glob} distribution omits the baseline term, so percentile CI/p-values for ACRT^{glob} are centered at 0 instead of the estimate when n_bootstrap > 0.
  • Methodology registry alignment looks solid for ATT(d)/ACRT(d) construction, control-group requirements, and edge-case handling.
  • Refactor to shared bootstrap utilities appears consistent with prior behavior; no NaN-guard regressions observed.
  • Docs/tests coverage is strong for the new estimator and edge cases.

Methodology

  • P1: Bootstrap ACRT^{glob} uses only the perturbation term (delta_beta @ dpsi_mean) and never adds the baseline acrt_glob, because _bootstrap_info doesn’t include acrt_glob. This yields a bootstrap distribution centered at 0, making percentile CI and p-values for ACRT^{glob} incorrect whenever bootstrap is enabled. Impact: misleading inference for ACRT^{glob} (CI/p-values wrong; SEs unaffected but still paired with wrong CI/p). Concrete fix: either store acrt_glob in _bootstrap_info and add it when building boot_acrt_glob, or compute acrt_glob_b directly as beta_b @ dpsi_mean and aggregate without adding a separate baseline. diff_diff/continuous_did.py:L741-L788, diff_diff/continuous_did.py:L987-L1042

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • No findings.

Store acrt_glob in _bootstrap_info so the bootstrap distribution is
centered at the point estimate instead of 0.  Add test that verifies
bootstrap ACRT CI brackets the estimate rather than zero.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 22, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 0f2849bf7b0091802a6c1053c6cd2ed56564cbc1


Overall assessment: ⚠️ Needs changes

Executive Summary

  • ContinuousDiD aligns with the registry on core estimation flow and the test coverage is thorough.
  • P1: B‑spline evaluation uses training boundary knots, but the methodology registry claims a strict match to R contdid; this deviation is not documented.
  • P1: Several __init__ parameters are not persisted into results objects, violating the project rule and losing provenance.
  • P3: Per‑cell OLS computes vcov that is never used.

Methodology

  • P1 | Impact: ATT(d)/ACRT(d) near tails will differ from the stated R reference because evaluation is clamped to training boundary knots, while the registry claims “matching R’s splines2::bSpline”; this is an undocumented deviation from the reference implementation. | Fix: Document the boundary‑knot choice in the methodology docs and registry (or add a switch to reproduce R’s dvals‑boundary behavior for benchmarking). | Location: diff_diff/continuous_did_bspline.py:L89-L100, docs/methodology/REGISTRY.md:L433-L436, docs/methodology/continuous-did.md:L357-L364

Code Quality

  • P1 | Impact: Results omit base_period, anticipation, n_bootstrap, bootstrap_weights, seed, rank_deficient_action, so outputs are not fully reproducible and violate the “new init param” rule. | Fix: Add these fields to ContinuousDiDResults and pass them in ContinuousDiD.fit (optionally include in summary). | Location: diff_diff/continuous_did.py:L84-L109, diff_diff/continuous_did.py:L547-L571, diff_diff/continuous_did_results.py:L91-L114

Performance

  • P3 | Impact: solve_ols(..., return_vcov=True) computes a covariance matrix per (g,t) cell that is never used, adding avoidable cost. | Fix: Drop return_vcov=True (or use vcov if intended). | Location: diff_diff/continuous_did.py:L727-L731

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • No additional findings beyond the Methodology documentation gap noted above.

…e, drop unused vcov

- Document B-spline boundary knot deviation in REGISTRY.md and continuous-did.md
- Add base_period, anticipation, n_bootstrap, bootstrap_weights, seed,
  rank_deficient_action fields to ContinuousDiDResults with passthrough from fit()
- Switch per-cell OLS to return_vcov=False to skip unused covariance computation
- Add test_results_contain_init_params verifying param roundtrip

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 22, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: f737ab6b99eb4ad6fbc2a001586bc59ec87c465a


Overall Assessment ⚠️ Needs changes

Executive Summary

  • The B‑spline knots are built once from all treated doses, but the Methodology Registry requires per‑cell (g,t) training‑dose boundary knots; this can induce extrapolation and change ATT/ACRT when dose supports differ by cohort.
  • Dose‑response aggregation weights are group‑averaged across post periods, while the Registry specifies n_treated weights per (g,t) cell; this changes the estimand and needs to be reconciled or documented.
  • Minor doc mismatch: ContinuousDiDResults docstring omits the supported "webb" bootstrap weight type.

Methodology

  • Severity: P1 | Impact: Violates the Registry requirement that boundary knots use the training‑dose range per (g,t) cell; using global knots can extrapolate outside cell‑specific supports and distort ATT(d)/ACRT(d), especially when cohorts have different dose ranges. | Fix: Build knots inside _compute_dose_response_gt from treated_doses for that cell (or cache per‑group), and store cell‑specific knots, Psi_eval, and dPsi_eval in _bootstrap_info. | Location: diff_diff/continuous_did.py:L293-L313 and _compute_dose_response_gt uses knots passed in, diff_diff/continuous_did.py:L593-L736; Registry requirement at docs/methodology/REGISTRY.md:L430-L436.
  • Severity: P1 | Impact: Aggregation weights for dose‑response/overall ATT/ACRT are group‑averaged across post periods (pg / n_post_cells), which conflicts with the Registry’s “n_treated weights” requirement and changes the estimand relative to documented methodology. | Fix: If n_treated weighting is intended, set cell_weights[(g,t)] = r["n_treated"] / total_treated for each post cell (no division by number of post cells). If group‑averaged weighting is intentional to match contdid, update the Registry/continuous‑did docs and tests to reflect that explicitly. | Location: diff_diff/continuous_did.py:L367-L395, Registry at docs/methodology/REGISTRY.md:L434-L436.

Code Quality

  • No issues found.

Performance

  • No issues found.

Maintainability

  • No issues found.

Tech Debt

  • No issues found.

Security

  • No issues found.

Documentation/Tests

  • Severity: P3 | Impact: Results docstring omits "webb" as a supported bootstrap weight type, which can mislead users and reviewers. | Fix: Update the docstring to include "webb". | Location: diff_diff/continuous_did_results.py:L95-L100.

…weights in docs, add "webb" to docstring

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 22, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: ce8240e94719555f0e30ec9f54bcd86ea19daf4b


Overall assessment: ⛔ Blocker

Executive summary:

  • P0: Bootstrap stats helper can return se=NaN while still emitting finite CI/p-value for tiny valid bootstrap counts, violating NaN‑propagation guidance and producing misleading inference. Affects ContinuousDiD and CallawaySantAnna. diff_diff/bootstrap_utils.py:L237-L265
  • Methodology alignment otherwise looks solid against the registry: balanced panel enforcement, P(D=0)>0 requirement for not_yet_treated, global knots, and group‑proportional aggregation are all implemented.
  • Add a regression test for the NaN‑propagation edge case (e.g., boot_dist length 1 or all‑non‑finite) once the guard is fixed.

Methodology

  • Severity: P0 | Impact: Partial NaN guard in bootstrap inference can yield se=NaN with finite CI/p‑value when n_valid is 1 (or when se is non‑finite). This violates the registry’s “all inference fields NaN when SE invalid” rule and can mislead users. Affects both ContinuousDiD and CallawaySantAnna since they share this helper. | Fix: In compute_effect_bootstrap_stats, if n_valid < 2 or se is not finite/<=0, return (np.nan, (np.nan, np.nan), np.nan); alternatively gate CI/p‑value on np.isfinite(se) and se > 0. Add a unit test covering the n_valid=1 case. diff_diff/bootstrap_utils.py:L237-L265

Code Quality
No issues found.

Performance
No issues found.

Maintainability
No issues found.

Tech Debt
No issues found.

Security
No issues found.

Documentation/Tests
No issues found.

When n_valid=1 (ddof=1 → NaN) or all-identical samples (SE=0),
compute_effect_bootstrap_stats now returns NaN for all inference
fields instead of mixing finite CI/p-value with NaN SE. Adds
regression tests and updates REGISTRY.md edge-case documentation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 22, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 98082803a6039e5328ddd61a1296dce0ad34e458


Overall assessment: ⚠️ Needs changes

Executive summary:

  • P1: Event‑study aggregation is only tested with default control_group and base_period, leaving key parameter interactions untested.
  • P2: DoseResponseCurve.to_dataframe() uses normal‑approx t‑stats/p‑values even when bootstrap inference is used, which mixes inference regimes.
  • The core ContinuousDiD implementation otherwise appears aligned with the Methodology Registry’s high‑level requirements (global knots, group‑proportional aggregation, D=0 enforcement).

Methodology

  • P2 Mixed inference regime for dose‑response outputs. to_dataframe() always computes t‑stat/p‑value via safe_inference (normal approximation), while bootstrap mode uses percentile CIs and already computes bootstrap p‑values but discards them for ATT(d)/ACRT(d). Impact: users can see bootstrap CIs alongside normal‑approx p‑values, which is internally inconsistent. Fix: store and expose bootstrap p‑values per grid point (keep p from compute_effect_bootstrap_stats) and switch to_dataframe() to use them when n_bootstrap>0, or omit t‑stat/p‑value for bootstrap runs.
    refs: diff_diff/continuous_did.py:L1078-L1093, diff_diff/continuous_did_results.py:L47-L66

Documentation/Tests

  • P1 Missing parameter‑interaction coverage for event‑study aggregation. Current event‑study tests only exercise defaults (no control_group="not_yet_treated" or base_period="universal"), and the standalone tests for those parameters don’t use aggregate="eventstudy". Impact: untested code paths in a core aggregation mode; issues would slip through. Fix: add event‑study tests that set control_group="not_yet_treated" and base_period="universal" (both analytic and bootstrap paths), and assert relative‑period coverage plus finite inference outputs.
    refs: tests/test_continuous_did.py:L300-L326, tests/test_continuous_did.py:L418-L446

Code Quality

  • No issues found.

Performance

  • No issues found.

Maintainability

  • No issues found.

Tech Debt

  • No issues found.

Security

  • No issues found.

Documentation/Tests

  • Tests not run in this review environment.

…ter tests

Address PR #177 review round 9:
- P2: DoseResponseCurve now stores bootstrap p-values per grid point and uses
  them in to_dataframe() instead of recomputing normal-approx t-stats/p-values,
  eliminating mixed inference regimes (bootstrap CIs + normal p-values).
- P1: Add event-study tests for control_group="not_yet_treated",
  base_period="universal", and not_yet_treated with bootstrap inference.
- Add test verifying bootstrap dose-response p-value consistency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 22, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: b9bd264ef802107ac6169ab601c7a9d6afb1216c


Overall assessment: ⚠️ Needs changes

Executive Summary

  • Methods affected: ContinuousDiD (new estimator) and shared bootstrap utilities used by CallawaySantAnna.
  • Methodology largely matches the Registry (global knots, group‑proportional weights, bootstrap NaN guards), but one edge case deviates.
  • P1: Rank‑deficient spline cases (all‑same or highly discrete doses) can propagate NaNs into ATT(d)/ACRT(d), conflicting with the documented “ACRT(d)=0” behavior.
  • Tests don’t currently assert dose‑response behavior in the all‑same‑dose edge case, so the regression would slip.

Methodology

  • P1 | Impact: In rank‑deficient spline cases (e.g., all treated doses identical), solve_ols will return NaN coefficients for dropped columns and att_d/acrt_d are computed via Psi_eval @ beta_hat, which will propagate NaNs. This contradicts the Registry’s stated edge‑case behavior (“ACRT(d) = 0 everywhere”) and R‑style prediction behavior where fitted values remain defined. | Fix: After solve_ols, compute predictions using only identified columns (or replace dropped‑column coefficients with 0 for prediction), or short‑circuit all‑same dose cells by setting att_d to the constant mean effect and acrt_d to 0; keep bootstrap_info consistent with that path. | Locations: diff_diff/continuous_did.py:L722-L753, docs/methodology/REGISTRY.md:L423-L428

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • P2 | Impact: The all‑same‑dose test only checks overall_att, so NaN dose‑response outputs would not be caught. | Fix: Extend the edge‑case test to assert dose_response_att.effects is constant and dose_response_acrt.effects are zeros (or explicitly document/expect NaN if that is intended). | Location: tests/test_methodology_continuous_did.py:L159-L178

Zero out NaN coefficients from dropped rank-deficient columns before
prediction (ATT(d), ACRT(d), acrt_glob), producing correct constant
ATT and zero ACRT when all treated doses are identical. Original
NaN-bearing beta_hat preserved in bootstrap_info for diagnostics.

Extend test_all_same_dose to assert dose-response behavior: ATT(d) =
overall_att everywhere and ACRT(d) = 0 everywhere.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 22, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 55b0d2d948dded4cf43eb5fc8aa6200fd5b47690


Overall assessment: ⚠️ Needs changes

Executive summary:

  • ContinuousDiD largely follows the Methodology Registry and the new methodology doc (global knots, boundary clamping, P(D=0)>0 guard, not‑yet‑treated control exclusion, anticipation/base_period handling).
  • P1: Bootstrap inference in ContinuousDiD can collapse to all‑NaN when rank‑deficient columns are dropped, diverging from the point‑estimate path and R’s NA‑coefficient prediction behavior.
  • Minor cleanup: unused parameter in _aggregate_event_study.
  • Tests added are extensive and cover many edge cases; I did not run the suite.

Methodology

  • P1 — Bootstrap path uses beta_hat with NaNs (dropped columns) without the same NA→0 handling used for point estimates, so any rank‑deficient cell yields NaN bootstrap distributions/SE/CI/p even though estimates exist. Impact: inference becomes invalid or silently NaN for rank‑deficient cells, inconsistent with registry notes and R lm prediction behavior. Fix: store beta_pred (NaNs→0) in _bootstrap_info and use that in _run_bootstrap when constructing beta_b (or replace NaNs in beta_hat before bootstrap). Also apply the same adjustment for ACRT_glob perturbations. Location: diff_diff/continuous_did.py in _compute_dose_response_gt and _run_bootstrap.

Code Quality

  • P3 — Unused parameter treatment_groups in _aggregate_event_study. Impact: dead parameter increases cognitive load. Fix: remove the parameter or use it. Location: diff_diff/continuous_did.py _aggregate_event_study.

Performance

  • No issues found.

Maintainability

  • No issues found.

Tech Debt

  • No issues found.

Security

  • No issues found.

Documentation/Tests

  • No issues found. Tests not run (not requested).

…sed param

Use beta_pred (NaN→0 for dropped columns) instead of beta_hat in bootstrap
path to prevent NaN from propagating into ATT(d)/ACRT(d) distributions.
Remove unused treatment_groups parameter from _aggregate_event_study.
Add bootstrap regression test to test_all_same_dose.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 22, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: a8c1c2ecd7c6be496b553105e16a4a33b213f630


Overall assessment: ⚠️ Needs changes

Executive Summary

  • Methods affected: ContinuousDiD (new estimator) and CallawaySantAnna bootstrap utilities refactor; reviewed against Methodology Registry and the new ContinuousDiD methodology doc.
  • P1 methodology mismatch: SunAbraham bootstrap inference path not updated to the new NaN-gating requirement in the registry.
  • Tests and docs for ContinuousDiD are extensive; no other blocking issues found.

Methodology

  • P1 — SunAbraham bootstrap inference currently computes SE/CI/p directly from the bootstrap distribution without the registry-mandated NaN gating when SE is non-finite/zero or when <50% of samples are valid. Impact: degenerate bootstrap draws can yield finite CIs/p-values, contradicting the Methodology Registry and risking misleading inference. Fix: reuse compute_effect_bootstrap_stats (or equivalent) for both event-study and overall bootstrap stats, including non-finite filtering and SE<=0 guard; propagate NaN outputs accordingly. Locations: diff_diff/sun_abraham.py:L1107-L1128, requirement in docs/methodology/REGISTRY.md:L488-L492. (Helper: diff_diff/bootstrap_utils.py:L205-L260.)

Code Quality
No issues found.

Performance
No issues found.

Maintainability
No issues found.

Tech Debt
No issues found.

Security
No issues found.

Documentation/Tests

  • P3 — Tests not run (review only). Impact: CI status unknown. Fix: run test suite or at least ContinuousDiD and SunAbraham-related tests.

Use compute_effect_bootstrap_stats from bootstrap_utils for both
event-study and overall ATT bootstrap paths, adding non-finite
filtering, SE<=0 guard, and <50% valid sample check. Remove unused
_compute_percentile_ci and _compute_bootstrap_pvalue methods.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 22, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 6e570a73ec3d73124f252ebb173fca7e8183be64


Overall assessment: ⚠️ Needs changes

Executive Summary

  • ContinuousDiD implementation tracks the Methodology Registry and the new continuous-DiD doc well (global knots, balanced panel enforcement, control-group logic, aggregation, bootstrap/analytic inference).
  • Bootstrap utilities consolidate NaN-gating and are wired into CallawaySantAnna and SunAbraham, with tests and registry updates aligning to the new behavior.
  • Needs changes: bootstrap NaN‑gating is still missing in ImputationDiD and TwoStageDiD despite the new shared helper and updated methodology standard, leaving misleading inference possible in those estimators.

Methodology

  • Severity: P1. Impact: ImputationDiD and TwoStageDiD bootstrap paths still compute p‑values/CI without non‑finite filtering or SE<=0 gating, which conflicts with the updated registry’s “all‑NaN when SE is non‑finite/zero” standard and can mislead inference in edge cases. Concrete fix: migrate these estimators to compute_effect_bootstrap_stats (or replicate its non‑finite filtering + all‑or‑nothing NaN gate) and remove the local _compute_percentile_ci/_compute_bootstrap_pvalue usage. References: diff_diff/imputation_bootstrap.py:L58-L97, diff_diff/two_stage_bootstrap.py:L416-L438, TODO.md:L42-L48.

Code Quality

  • No issues found.

Performance

  • No issues found.

Maintainability

  • No issues found.

Tech Debt

  • No issues found.

Security

  • No issues found.

Documentation/Tests

  • No issues found.

If you want, I can also point out the minimal diff needed to update the two remaining bootstrap modules.

Add noise to outcomes in bootstrap regression test so residuals are
non-zero across platforms. Identical outcomes produce zero residuals,
giving zero-variance bootstrap distributions and SE=0 → NaN on macOS.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 22, 2026

/ai-review

Replace random noise approach with naturally heterogeneous outcomes
(different y_pre per unit) while keeping identical dose. This exercises
the rank-deficient bootstrap path with real sampling variance rather
than papering over zero-residual data.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 22, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: fddb6c34bb96ab7e32c514511dd22074fbf3c407


Overall assessment: ⚠️ Needs changes

Executive summary

  • Adds ContinuousDiD with B‑spline dose‑response, plus new methodology doc/registry entries and extensive tests.
  • Centralizes bootstrap stats in bootstrap_utils and wires it into Sun‑Abraham/Callaway‑Sant’Anna.
  • P1: compute_effect_bootstrap_stats does not guard non‑finite original_effect; Sun‑Abraham now calls it directly, so NaN point estimates can yield finite SE/CI/p‑values.
  • P1: The NaN‑gating bootstrap fix is only partially applied; ImputationDiD and TwoStageDiD still use manual inference without the new non‑finite/SE<=0 safeguards.

Methodology

  • Severity: P1 — Impact: Bootstrap inference can return finite SE/CI/p‑values when the point estimate is NaN/Inf, which violates the registry’s “NaN propagation for undefined estimates” intent and can mislead users. This is a regression in Sun‑Abraham after switching to the shared helper. — Concrete fix: Add an early guard in compute_effect_bootstrap_stats to return (NaN, (NaN, NaN), NaN) when original_effect is non‑finite (or add the guard at call sites). Add a regression test for NaN original_effect with finite boot_dist. diff_diff/bootstrap_utils.py:L205-L269, diff_diff/sun_abraham.py:L1108-L1123
  • Severity: P1 — Impact: The bootstrap NaN‑gating fix is not consistently applied across estimators; ImputationDiD and TwoStageDiD still compute SE/CI/p without filtering non‑finite draws or enforcing the “SE<=0 ⇒ all NaN” rule, leading to estimator‑specific inference behavior that contradicts the new standard. — Concrete fix: Migrate imputation_bootstrap.py and two_stage_bootstrap.py to compute_effect_bootstrap_stats (or equivalent non‑finite filtering + SE<=0 gating) for overall/event‑study/group inference. diff_diff/imputation_bootstrap.py:L262-L298, diff_diff/two_stage_bootstrap.py:L225-L236

Code Quality

  • No issues found.

Performance

  • No issues found.

Maintainability

  • No issues found.

Tech Debt

  • No issues found beyond the P1 partial‑migration noted above.

Security

  • No issues found.

Documentation/Tests

  • No issues found.

Return all-NaN inference when the point estimate is NaN/Inf, preventing
finite SE/CI/p-value from a valid bootstrap distribution when the
estimate itself is undefined. Adds parametrized regression tests and
updates REGISTRY.md bootstrap notes for both CallawaySantAnna and
SunAbraham sections.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 22, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 5449bbb361c838b113ae963df86f814df376d959


Overall assessment: ⛔ Blocker

Executive summary

  • P0: ContinuousDiD’s not-yet-treated control group ignores anticipation, so cohorts already in the anticipation window can be used as controls, biasing ATT(d)/ACRT(d) when anticipation > 0.
  • P1: No regression test covers anticipation > 0 with control_group="not_yet_treated" (and bootstrap), so this control-group bug can regress silently.
  • Bootstrap refactor (bootstrap_utils) aligns NaN-gating across estimators and matches the updated methodology registry for CS/SunAbraham.
  • ContinuousDiD otherwise tracks the registry on global knots, group-proportional weights, and balanced-panel requirements.

Methodology

  • Severity: P0. Impact: For control_group="not_yet_treated" with anticipation > 0, the control pool includes cohorts already within the anticipation window (i.e., effectively treated), which biases ATT/ACRT and violates the intended control definition. Concrete fix: update the control mask to exclude cohorts with effective treatment start G_i - anticipation <= t (e.g., use unit_cohorts > t + self.anticipation and != g), and ensure this applies for all (g,t) cells. Location: diff_diff/continuous_did.py:L688-L694.

Code Quality
No findings.

Performance
No findings.

Maintainability
No findings.

Tech Debt
No findings.

Security
No findings.

Documentation/Tests

  • Severity: P1. Impact: The anticipation/control-group interaction isn’t covered by tests, so the bias bug above won’t be caught by CI. Concrete fix: add a regression test to tests/test_continuous_did.py that sets anticipation > 0, control_group="not_yet_treated", and verifies that cohorts in the anticipation window are excluded from controls (and that pre/anticipation effects are near zero).

The not-yet-treated control group in both ContinuousDiD and
CallawaySantAnna used `G > t` instead of `G > t + anticipation`,
incorrectly including cohorts in the anticipation window as controls.
This matches R's `did::compute.att_gt()` logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 22, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 999da34cf6edc2d652d6572d0fbb89fc15d5a8a0


Overall assessment: Looks good

Executive Summary

  • Affected methods: ContinuousDiD (new), CallawaySantAnna control mask with anticipation, and SunAbraham bootstrap inference utilities.
  • Cross-checked against docs/methodology/REGISTRY.md and docs/methodology/continuous-did.md; no methodology mismatches or undocumented deviations found.
  • Edge-case handling and NaN-gating look consistent with the registry updates.
  • One minor performance improvement suggestion in ContinuousDiD’s data prep loop.

Methodology
Affected methods: ContinuousDiD, CallawaySantAnna, SunAbraham. Cross-check vs docs/methodology/REGISTRY.md and docs/methodology/continuous-did.md found no P0/P1 mismatches.

Code Quality
No findings.

Performance

  • P3 | Impact: iterrows() and per-call time_periods.index(t) add Python-level overhead for large panels; this can become a bottleneck for big n_units * n_periods. | Fix: vectorize outcome matrix construction (e.g., pivot_table/to_numpy with ordered categories) and precompute a prev_period mapping to avoid O(T) lookups in the inner loop. diff_diff/continuous_did.py:L607-L612, diff_diff/continuous_did.py:L667-L670

Maintainability
No findings.

Tech Debt
No findings.

Security
No findings.

Documentation/Tests
No findings.

@igerber igerber merged commit 3e57eac into main Feb 22, 2026
10 checks passed
@igerber igerber deleted the worktree-continuous-did branch February 22, 2026 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant