Input Validator (`src.diagnostics.input_validator`)
Overview
Section titled “Overview”input_validator.py provides fail-fast data integrity checks that run before staged diagnostics and model fitting. It does not write artefacts; it raises ValueError when a critical input issue is detected.
Function Signatures
Section titled “Function Signatures”from src.diagnostics.input_validator import ( check_nans, check_duplicate_columns, check_date_column, check_column_variance,)
check_nans( dataframe: pd.DataFrame, target_col: str, media_cols: list[str], control_cols: list[str],) -> None
check_duplicate_columns( dataframe: pd.DataFrame,) -> None
check_date_column( date_series: pd.Series, config: dict[str, Any],) -> None
check_column_variance( dataframe: pd.DataFrame, columns: list[str], check_zeros_only: bool = False,) -> NoneParameters
Section titled “Parameters”| Function | Key parameters |
|---|---|
check_nans | target_col, media_cols, control_cols define which columns must be non-null. |
check_duplicate_columns | Checks duplicate column names in dataframe.columns. |
check_date_column | Validates date parseability, sort order, inferred frequency, missing dates, and weekly start-day consistency. Reads config.get("date_format") if needed. |
check_column_variance | Checks constant/all-zero columns over columns. If check_zeros_only=True, only all-zero columns are flagged. |
Artefacts Produced
Section titled “Artefacts Produced”This module does not produce files and does not target a stage folder.
| Output | Stage folder | Description |
|---|---|---|
| None | N/A | Validation runs in-memory and raises exceptions on failure. |
Interpretation Guidance
Section titled “Interpretation Guidance”| Check | Failure meaning | Typical action |
|---|---|---|
| NaN check | Missing values in required model columns | Impute, drop, or fix upstream extract/joins before fitting. |
| Duplicate columns | Ambiguous feature references | Deduplicate headers before preprocessing. |
| Date validation | Irregular or unsorted time index | Correct sort order, parsing, and frequency gaps. |
| Variance check | Constant/all-zero regressors | Remove or repair non-informative predictors. |
Usage Example
Section titled “Usage Example”import pandas as pd
from src.diagnostics.input_validator import ( check_nans, check_duplicate_columns, check_date_column, check_column_variance,)
config = {"date_format": "%Y-%m-%d"}media_cols = ["tv_spend", "search_spend"]control_cols = ["price_index", "competitor_sales"]
check_duplicate_columns(df)check_date_column(df["DATE"], config)check_nans(df, target_col="revenue", media_cols=media_cols, control_cols=control_cols)check_column_variance(df, columns=media_cols + control_cols, check_zeros_only=False)Relationship to Workflow Stages and Gates
Section titled “Relationship to Workflow Stages and Gates”- This validator runs before stage-folder artefacts are produced.
- It acts as an entry condition for the staged workflow and must pass before
10_pre_diagnostics/and later gate checks (g1tog6) are meaningful. - Pass/fail behaviour is exception-based (
ValueError) rather than report-based.