Skip to content

Data Preparation Guide

Correct input structure is essential for a stable AMMM run. This page explains what the pipeline expects and what is validated automatically, whether you run via python runme.py or directly via MMMBaseDriverV2.

Your dataset must include:

  • Date column
    • Name set by date_col in config (default often date)
    • Should be parseable by pandas (recommended format: YYYY-MM-DD)
  • Target column
    • Name set by target_col
    • Numeric KPI (for example revenue, sales, conversions)
  • Media spend columns
    • Defined in media[*].spend_col
    • Numeric values for each channel

Optional but common:

  • Control columns in extra_features_cols
  • Ignored columns in ignore_cols
  • Consistent granularity (raw_data_granularity, for example weekly)
  • Chronologically sorted dates
  • No unintended gaps for the chosen frequency
  • Numeric modelling columns
  • No duplicate column names

During V2 driver initialisation, AMMM runs input validation (via diagnostics/input_validator.py) and raises ValueError on critical failures.

Automatic checks include:

  • duplicate column names,
  • NaN values in modelling columns,
  • date parsing/sorting/frequency consistency checks,
  • zero-variance (constant) feature checks.

If one of these fails, fix the input data or configuration before re-running.

After data loading, the workflow runs pre-diagnostics automatically and writes outputs to 10_pre_diagnostics/:

  • stationarity_summary.csv
  • vif_summary.csv
  • transfer_entropy_summary.csv
  • prior_predictive_check.png
  • prior_predictive_summary.csv

These are early warning signals for misspecification risk and should be reviewed before trusting downstream outputs.

diagnostics_gating (strict, warn, off) controls how strongly the pipeline enforces diagnostic outcomes downstream.

Practical guidance:

  • use strict for production-style governance,
  • use warn during exploratory modelling,
  • avoid consuming 40_decomposition/, 60_response_curves/, or 70_optimisation/ outputs when core diagnostics are poor.
DATE,revenue,channel_1_spend,channel_2_spend,competitor_sales,events
2023-01-02,1500,25000,12000,200,0
2023-01-09,1520,26000,11800,195,0
2023-01-16,1490,24000,12500,205,1
...

Use your own column names, but ensure they match config keys exactly.

  • Keep all time-series columns aligned to the same frequency.
  • Encode categorical controls numerically before ingestion.
  • Keep a clean data-config/ folder with one config and one data file for auto-detection.
  • Use holidays.csv / holidays.xlsx when holiday controls are enabled.