How reliable is your marketing mix model (MMM)?

Can you trust your MMM?

A marketing Mix Model (MMM) can produce challenging results, especially if you have been relying on last touch( LT) attribution. When presented with challenging results your first question may well be “can I trust the model?”.

The answer to this depends largely on the quality of the model as expressed through a set of diagnostics. The problem is not all diagnostics are shared.

Let’s explore MMM diagnostics in some more detail so you are better placed to form your own conclusions on model quality.

What you should get from your MMM partner

Let’s start with the diagnostics you should see from your MMM partner. They should present you with these test diagnostics:

Exploratory Data Analysis (EDA)

Correlation Analysis (before modelling): Identifies highly correlated variables early, helping avoid multicollinearity in the model build and guiding variable selection – this is important because correlated variables can destabilise any modelling and results.
Trend & Seasonality Checks: These checks confirm long‑term trends and seasonal peaks are properly captured before modelling media effects – this is important because if you don’t isolate these components, you risk incorrectly misattributing their effects to media.
ACF (Autocorrelation Function): Ensures time‑based patterns are understood and removed so the model isn’t biased by serial correlation – this is also important. Your sales or revenue data can include patterns or “tidal effects” where your sales in the current weeks are partly related to sales in the previous week. Again, if you don’t control these relationships, your model will produce erroneous results.

Model Fit & Accuracy

R²: This is a diagnostic that most marketers will be familiar with – it shows how much of the sales movement the model can account for by the model – but beware this is by no means a full diagnostic of the model.
Adjusted R²: A fairer version of R² that penalises unnecessary variables, but again, not a full diagnostic.
Actual vs Fitted plot which shows how well the model’s estimates match the actual data.
MAE (Mean Absolute Error): The average size of the model’s prediction error in real sales units. For example, if a typical week delivers 1,000 sales and the model predicts values between 900 and 1,100, the absolute error is 100 units.
MAPE (Mean Absolute Percentage Error): The model’s average prediction error expressed as a percentage of actual sales. Using the example above, an absolute error of 100 on 1,000 sales corresponds to a MAPE of 10%.

Residual Diagnostics

Durbin–Watson test: Checks whether any time‑based pattern is left in the errors; values near 2 indicate no remaining autocorrelation
Breusch–Pagan test: This test checks whether error variance is stable; instability can affect model coefficient reliability

Model Stability & Interpretability

VIF (Variance Inflation Factor): Helps you measure the correlation between channels in a model and helps ensure channels aren’t too correlated to separate cleanly. If your model contains variables that are highly correlated with each other, it is not likely to be reliable.

These tests are all mission critical for evaluating both the data going into the model and the model itself. But while they diagnose the quality of the model, they do not evaluate its predictive quality. Predictive quality tests are further tests that assess the model’s ability to predict outcomes on previously unseen test data. In some ways these validation tests are the ‘acid test’ in model diagnostics. When you combine the standard diagnostics (1-4 above) with the predictive quality tests you will see:

Good results in both mean you are dealing with a reliable model.
Good results in one but not the other should be cause for concern.
Poor results in both diagnostics and validation tests mean the model should be ignored and re-specified.

Unfortunately, the predictive quality tests are not always shared by MMM providers – I’ll leave you to figure out why.

So, what are the predictive quality tests?

MMM validation route 1: The holdout test

How does the test work? In this test, we test the model on a subset of the data, usually towards the end of the sample. Let’s assume you have a model built on three years’ worth of weekly data – that’s 156 observations. You can ‘slice off’ the last 26 weeks and use the first 130 observations to “train” the model. You can then apply the trained model to the last 26 weeks, also known as the “test” period.
What should you see? You will have the actual sales data for the 26 week test period. The key ‘acid test’ question for you is how well the model’s prediction for those 26 test weeks match actual sales in the test period. If you model is predicting within +/- 20% that’s a strong result. If it’s predicting between +/- 10% that is a very strong result.

MMM validation route 2: Cross Fold validation

How does this test work? In this test we slice your 156 weeks of data into 10 consecutive time “blocks” of 15-16 weeks. In a similar way to the holdout test, we provide the explanatory variables for each block and ask the model to predict what it estimates sales to be in each of these blocks based on the model coefficients. We then examine the prediction vs the actual and the size of the error in each block.
What should you see? We will have the actual sales data for each of the blocks. Again, the key ‘acid test’ question for you is how well the model’s prediction for those blocks matches actual sales in each of those blocks. If your model is predicting within +/- 20% that’s a strong result. If it’s predicting between +/- 10% that is a very strong result. It’s worth noting that in some categories the 20% might be too tight, you could relax it to say 20-30%, but you need to be close to that range to have a validated model.

What do these results mean? If you see a model prediction within 20% of the actual in this holdout period, then you have a model that is working well. It is unlikely to be ‘overfitted’. Overfitting is a way of making the model look good within the sample, but it does not perform well outside the sample and would therefore fail these tests.

Post Views: 72