Can matching models accurately estimate lift?
Attribution is the problem of assigning value back to a company’s various marketing channels and is a tricky proposition even for a company that only advertises online and has clean, reliable data. For instance, let’s imagine a website that sells fancy socks and a customer who visits the website three times via three channels, as follows:
- Paid Search
- Social media (A Facebook remarketing advert)
- Affiliate (25% discount)
On the final visit, the consumer buys some socks with the discount.
Attribution is hard because there are so many unknown variables. We don’t know whether the consumer only found the website because of the paid search advert or whether they would have found the website anyway. Likewise, we don’t know whether the discount was the deciding factor determining the sale, or whether the consumer would have bought the socks anyway.
Clearly, accurate attribution is difficult. Most companies rely on simple heuristic methods such as assigning all value to the last channel (Last Click) or assigning value to each channel equally (Linear).
More recently a number of more complex algorithms have been proposed by the data-science community. The claim is that these methods are more accurate, however, there is no evidence to support these claims. Instead, the rationale seems to be that because the theory is “better” than the simple heuristic methods described above, the results must be more accurate. This is somewhat surprising for a field with “science” in the name. Theory is cheap, evidence is gold.
I write this blog because in 2019 a study was published by Gorden et al. which compared the output of data-driven methods to a series of large scale randomised control trials (RCTs) and revealed some very telling results. For those who are unaware, RCTs are the gold standard of experimental design and when applied to the marketing domain can provide an accurate measure of a marketing channels value.
Before reviewing the results of this study, I’ll briefly go over what an RCT is because if you don’t know, you really should.
Randomised Control Trials (RCTs)
As every good scientist will tell you, the only way to put a definitive value on your channel is to run a rigorous experiment. In this case, rigorous means the experiment should be both double-blind and randomised.
Randomised means the individuals are assigned to the control or test groups randomly. The aim is to ensure that both groups are statistically identical except for the experimental manipulation.
Double-blind means that neither the experimenter nor the subjects know whether they are in the test or the control group.
To meet these criteria when testing online advertising we first have to identify our target group of users. Let’s imagine we are running prospecting advertising on Facebook and we want to target female users between 20–29 in the San Francisco Bay area. This is our target group, but before we roll out a full advertising campaign, we want to run a trial experiment. In which case we would only show adverts to a randomly chosen percentage of the target group. This is the test group, while the remaining users belong to the control group. After running the experiment for a set period of time, we would compare the conversion rate of the two groups and draw our conclusions. Simple.
One may ask why advertisers don’t always run RCTs on their marketing channels. The simple answer is that many advertising channels do not provide the facility. While Facebook makes it very clear how to run an RCT on their platform, I do not know how to run an RCT on Google’s search platform.
The aforementioned study by Gorden et al. 2019 compared the results of data-driven methods to a series of large scale RCT experiments on Facebook. The trail was large, with 15 RCTs, including over 1.6 million ad-impressions.
There are a number of flavours of data-driven attribution and this paper evaluated a class known as the matching methodology. These work by identifying two groups of users, those who did see an advert and those who did not. We could call these groups the pseudo-control and pseudo-test. The two groups are matched so they are as similar as possible along a number of dimensions such as age, gender, social-economic status or propensity to buy and each method differs in the exact matching approach. Thus unlike the RCTs which uses the principle of randomisation to ensure statistical equality between the test and control groups, the data-driven method looks through the historical data and explicitly attempts to identify groups that are equal on various the metrics except for exposure to the advert in question.
The key takeaway is that none of the data-driven methods evaluated could replicate the results of the RCT trials and most massively overestimated the value of the advertising, often by over 300%. The interested reader is advised to examine Figure 10 Gorden et al. 2019 for a full results breakdown.
The upshot is that if you rely on these data-driven methods to determine the success of your advertising, you are likely overvaluing your advertising or even mistaking a loss for a profit.
Indeed, let me speculate that this is why many advertising portals do not provide a facility for you to run RCTs. Call me a cynic, but I suspect most advertising portals are happy for you to overestimate the value of your advertising.
Gordon, BR et al. “A comparison of approaches to advertising measurement: Evidence from big field experiments at Facebook.” Marketing Science 38.2 (2019): 193–225
So what’s going on? In science lingo, there must be an unobserved variable unaccounted for by the data-driven methods. The algorithms fail to match the pseudo-control to the pseudo-test. Thus there must be something other than exposure to the advert which differentiates the two groups.
The users in the pseudo-test group are those who have seen the advert, so by definition, they must have been on the advertising portal during the period of the experiment, whilst those in the pseudo-control group were not, otherwise they would have seen the advert. Thus one theory is that those in the pseudo-test group are simply more active at the time of the experiment than those in the pseudo-control group and it is not a huge jump to suggest that those who are more active are also more likely to buy. This proposal was first put forth by Lewes et al (2011) and makes for a great read.
If true, then the only way to truly control for this effect is to deliberately not show adverts to a certain proportion of those visiting the advertising portal. Ergo, run a randomised control trial.
Lewis, RA et al. “Here, there, and everywhere: correlated online behaviours can lead to overestimates of the effects of advertising.” Proceedings of the 20th international conference on World Wide Web. 2011.
In this article, we have reviewed the effectiveness of a type of data-driven attribution known as the matching method. The results are pretty clear, the methods do not work and tend to overestimate the uplift of Facebook adverts. Other popular methods exist, such as the Markov and Shapley methods and we cannot definitively say these also do not work. However, we also cannot definitively say they will work either or indeed that they are more accurate than the simple heuristic methods.
Herein lies the problem inherent to much of data-science. Companies have an explosion of data, but knowing how to derive value from this data is complex. Companies that navigate this space well will have people with a footing in both the scientific and commercial worlds. People for whom the above arguments are familiar. Importantly, these people will have a key role in the decision-making process.