Data Preparation Guide
Correct input structure is essential for a stable AMMM run. This page explains what the pipeline expects and what is validated automatically, whether you run via python runme.py or directly via MMMBaseDriverV2.
Minimum required columns
Section titled “Minimum required columns”Your dataset must include:
- Date column
- Name set by
date_colin config (default oftendate) - Should be parseable by pandas (recommended format:
YYYY-MM-DD)
- Name set by
- Target column
- Name set by
target_col - Numeric KPI (for example revenue, sales, conversions)
- Name set by
- Media spend columns
- Defined in
media[*].spend_col - Numeric values for each channel
- Defined in
Optional but common:
- Control columns in
extra_features_cols - Ignored columns in
ignore_cols
Data quality expectations
Section titled “Data quality expectations”- Consistent granularity (
raw_data_granularity, for example weekly) - Chronologically sorted dates
- No unintended gaps for the chosen frequency
- Numeric modelling columns
- No duplicate column names
Automatic validation at load time
Section titled “Automatic validation at load time”During V2 driver initialisation, AMMM runs input validation (via diagnostics/input_validator.py) and raises ValueError on critical failures.
Automatic checks include:
- duplicate column names,
- NaN values in modelling columns,
- date parsing/sorting/frequency consistency checks,
- zero-variance (constant) feature checks.
If one of these fails, fix the input data or configuration before re-running.
Automatic pre-diagnostics
Section titled “Automatic pre-diagnostics”After data loading, the workflow runs pre-diagnostics automatically and writes outputs to 10_pre_diagnostics/:
stationarity_summary.csvvif_summary.csvtransfer_entropy_summary.csvprior_predictive_check.pngprior_predictive_summary.csv
These are early warning signals for misspecification risk and should be reviewed before trusting downstream outputs.
Relation to gate policy
Section titled “Relation to gate policy”diagnostics_gating (strict, warn, off) controls how strongly the pipeline enforces diagnostic outcomes downstream.
Practical guidance:
- use
strictfor production-style governance, - use
warnduring exploratory modelling, - avoid consuming
40_decomposition/,60_response_curves/, or70_optimisation/outputs when core diagnostics are poor.
Example CSV layout
Section titled “Example CSV layout”DATE,revenue,channel_1_spend,channel_2_spend,competitor_sales,events2023-01-02,1500,25000,12000,200,02023-01-09,1520,26000,11800,195,02023-01-16,1490,24000,12500,205,1...Use your own column names, but ensure they match config keys exactly.
Practical tips
Section titled “Practical tips”- Keep all time-series columns aligned to the same frequency.
- Encode categorical controls numerically before ingestion.
- Keep a clean
data-config/folder with one config and one data file for auto-detection. - Use
holidays.csv/holidays.xlsxwhen holiday controls are enabled.