Pre-Diagnostics (`src.diagnostics.pre_diagnostics`)

Overview

pre_diagnostics.py runs statistical checks before posterior interpretation: stationarity on the target, multicollinearity checks on regressors, and pairwise transfer entropy screening. Outputs are written to stage 10_pre_diagnostics/.

Function Signatures

from src.diagnostics.pre_diagnostics import (
    run_stationarity_tests,
    run_vif_tests,
    run_transfer_entropy,
    run_all_pre_diagnostics,
)

run_stationarity_tests(
    data: pd.DataFrame,
    date_col: str,
    cols: list[str],
    *,
    kpss_regression: str = "c",
    kpss_nlags: str | int | None = "auto",
    adf_maxlag: int | None = None,
    adf_regression: str = "c",
    dropna: bool = True,
) -> pd.DataFrame

run_vif_tests(
    data: pd.DataFrame,
    cols: list[str],
    *,
    include_constant: bool = True,
    dropna: str = "pairwise",
) -> pd.DataFrame

run_transfer_entropy(
    data: pd.DataFrame,
    date_col: str,
    x_cols: list[str],
    y_col: str,
    *,
    max_lag: int = 1,
    bins: int = 8,
    permutations: int = 200,
    random_state: int = 42,
    normalize: bool = True,
    dropna: bool = True,
) -> pd.DataFrame

run_all_pre_diagnostics(
    data: pd.DataFrame,
    config: dict[str, Any],
    results_dir: str,
    *,
    stationarity_cols: list[str] | None = None,
    vif_cols: list[str] | None = None,
    te_x_cols: list[str] | None = None,
    te_y_col: str | None = None,
    te_include_controls_in_x: bool = False,
    stationarity_kwargs: dict[str, Any] | None = None,
    vif_kwargs: dict[str, Any] | None = None,
    te_kwargs: dict[str, Any] | None = None,
) -> dict[str, str]

Parameters (Orchestrator)

Parameter	Description
`data`	Processed modelling dataframe.
`config`	Pipeline configuration; used to infer `date_col`, `target_col`, media `spend_col`s, and `extra_features_cols`.
`results_dir`	Run root directory; `save_csv(...)` routes into `10_pre_diagnostics/`.
`stationarity_cols`	Optional override for ADF/KPSS variables. Default is target only.
`vif_cols`	Optional override for VIF variables. Default is media + controls.
`te_x_cols`, `te_y_col`	Optional overrides for transfer-entropy direction setup.
`te_include_controls_in_x`	Adds controls to TE X-set when `True`.
`*_kwargs`	Extra keyword arguments passed to each underlying diagnostic function.

Artefacts Produced

Filename	Stage folder	Description
`stationarity_summary.csv`	`10_pre_diagnostics/`	ADF + KPSS metrics and combined stationarity conclusion by variable.
`vif_summary.csv`	`10_pre_diagnostics/`	VIF/tolerance/correlation summary with high-VIF flagging.
`transfer_entropy_summary.csv`	`10_pre_diagnostics/`	Pairwise TE(X→Y), TE(Y→X), permutation p-values, and direction label.

Interpretation Guidance

1. Stationarity (ADF + KPSS)

ADF result	KPSS result	Conclusion
Reject H0 (p < 0.05)	Fail to reject H0 (p >= 0.05)	Likely stationary
Fail to reject H0 (p >= 0.05)	Reject H0 (p < 0.05)	Likely unit root
Other combinations	Other combinations	Inconclusive

Remediation for likely unit-root behaviour:

First differencing.
Detrending.
Log transform for multiplicative trends.

2. Variance Inflation Factor (VIF)

VIF	Severity	Action
`< 5`	Low collinearity	Usually no action.
`5` to `< 10`	Moderate collinearity	Monitor and stress-test estimates.
`>= 10`	High collinearity	Combine/remove regressors, or add stronger regularisation structure.

3. Transfer Entropy (Pairwise, Unconditional)

Condition	Direction	Interpretation
TE(X→Y) significant and stronger than TE(Y→X)	`x→y`	X may contain predictive information for Y.
TE(Y→X) significant and stronger than TE(X→Y)	`y→x`	Reverse predictive direction may dominate.
Both significant	`bidirectional`	Mutual predictive relationship.
Neither significant	`none`	No strong directional signal.

Important caveat:

This implementation is pairwise TE and does not control for confounders.
Use as an exploratory diagnostic, not a causal identification claim.

Usage Example

import pandas as pd

from src.diagnostics.pre_diagnostics import run_all_pre_diagnostics

# Example only: in production, the V2 driver prepares the processed dataframe.
data = pd.read_csv("processed_input.csv")

config = {
    "date_col": "DATE",
    "target_col": "revenue",
    "media": [
        {"display_name": "TV", "spend_col": "tv_spend"},
        {"display_name": "Search", "spend_col": "search_spend"},
    ],
    "extra_features_cols": ["price_index", "competitor_sales"],
}

paths = run_all_pre_diagnostics(
    data=data,
    config=config,
    results_dir="results/run_20260304_101500",
)

print(paths)

V2 Driver Integration

Pre-diagnostics are orchestrated through the V2 driver workflow (MMMBaseDriverV2 -> WorkflowExecutor) rather than relying on a standalone script entrypoint.

from src.driver.base import MMMBaseDriverV2

driver = MMMBaseDriverV2(
    config_filename="data-config/demo_config.yml",
    input_filename="data-config/demo_data.csv",
    holidays_filename="data-config/holidays.csv",
    results_filename="results",
)

driver.main()

Relationship to Workflow Stages and Gates

Stage: 10_pre_diagnostics/.
These checks are pre-fit quality diagnostics that support gate g1 interpretation readiness and reduce downstream model risk.
They are advisory diagnostics; hard machine-readable gate states are emitted later by convergence/calibration diagnostics in 50_diagnostics/.