Data Preparation Guide

Correct input structure is essential for a stable AMMM run. This page explains what the pipeline expects and what is validated automatically, whether you run via python runme.py or directly via MMMBaseDriverV2.

Minimum required columns

Your dataset must include:

Date column
- Name set by date_col in config (default often date)
- Should be parseable by pandas (recommended format: YYYY-MM-DD)
Target column
- Name set by target_col
- Numeric KPI (for example revenue, sales, conversions)
Media spend columns
- Defined in media[*].spend_col
- Numeric values for each channel

Optional but common:

Control columns in extra_features_cols
Ignored columns in ignore_cols

Data quality expectations

Consistent granularity (raw_data_granularity, for example weekly)
Chronologically sorted dates
No unintended gaps for the chosen frequency
Numeric modelling columns
No duplicate column names

Automatic validation at load time

During V2 driver initialisation, AMMM runs input validation (via diagnostics/input_validator.py) and raises ValueError on critical failures.

Automatic checks include:

duplicate column names,
NaN values in modelling columns,
date parsing/sorting/frequency consistency checks,
zero-variance (constant) feature checks.

If one of these fails, fix the input data or configuration before re-running.

Automatic pre-diagnostics

After data loading, the workflow runs pre-diagnostics automatically and writes outputs to 10_pre_diagnostics/:

stationarity_summary.csv
vif_summary.csv
transfer_entropy_summary.csv
prior_predictive_check.png
prior_predictive_summary.csv

These are early warning signals for misspecification risk and should be reviewed before trusting downstream outputs.

Relation to gate policy

diagnostics_gating (strict, warn, off) controls how strongly the pipeline enforces diagnostic outcomes downstream.

Practical guidance:

use strict for production-style governance,
use warn during exploratory modelling,
avoid consuming 40_decomposition/, 60_response_curves/, or 70_optimisation/ outputs when core diagnostics are poor.

Example CSV layout

DATE,revenue,channel_1_spend,channel_2_spend,competitor_sales,events
2023-01-02,1500,25000,12000,200,0
2023-01-09,1520,26000,11800,195,0
2023-01-16,1490,24000,12500,205,1
...

Use your own column names, but ensure they match config keys exactly.

Practical tips

Keep all time-series columns aligned to the same frequency.
Encode categorical controls numerically before ingestion.
Keep a clean data-config/ folder with one config and one data file for auto-detection.
Use holidays.csv / holidays.xlsx when holiday controls are enabled.