Matt Williamson: Busting the ROI Myth
In this day and age, digital marketers have the ability to be pickier than ever when it comes to new software. For every marketing problem, there are at least a handful of solutions to choose from; however, they all come with a price attached. One of the most painstaking jobs that marketers are tasked with is accurately measuring ROI from all of their channels and tools. Every tool comes with promises of dramatically increasing ROI, but how do you, as a marketer fully realize whether or not the software you’re using is actually working for you?
Generally, marketers will look at two common methods for attribution and ROI modeling: Last touch and multi-touch attribution. These can be imperfect methods for really getting to the meat of what a specific marketing effort or tool is doing for your business. Let’s look at each of these in turn.
Last Touch Attribution:
Last touch attribution is a simplified way of allocating ROI, it looks at the last channel a customer came through and gives 100% of the credit for the sale to that channel. In fact, up to 98% of visitors to your site will not buy on their first visit, meaning that on any given sale, the chances that a visitor has interacted with your brand through various channels is high. Last touch gives you insight into the very last channel a customer used before converting, but it doesn’t tell the story of every touch point that drove a customer to purchase.
In our business, one of the scoring models we use is Predictive Customer Value. If the future lifetime value of a specific customer is going to be high, it is arguable that a campaign triggered by our platform isn’t what created that next purchase. The customer was already predicted to be a high spender. Just because they touched our email or Facebook ad last doesn’t mean that we get full credit (as much as we may want it!) for the transaction.
Said simply, let’s say you had a group of 10,000 best customers. Left to their own devices, and based on your current marketing mix, you’d see 100 of them make a purchase this month. If a vendor (such as Windsor Circle) came in, applied their models, identified this group of buyers, sent them a series of emails, and 100 of them made a purchase, that doesn’t mean that Windsor Circle made that happen - 100 of them were going to buy anyway! But the last touch attribution model shows all 100 of these clicking on the email before making their purchase, so Windsor Circle would claim credit for those sales.
Last touch attribution really struggles here for obvious reasons.
Every marketer lives in this world of having to hold vendors accountable to results, and, with last touch attribution, often struggles to get to an accurate perception of what marketing techniques are really bringing home the bacon and which are just noise.
Multi-Touch Attribution:
Multi-touch attribution is a more complex method of assigning percentages of the total conversion to the different channels that customers touched on their way to making a purchase. Multi-touch attribution models can be created manually or you can use predictive software to do this for you. If you decide to use a predictive tool to assign attribution, make sure you trust the data and your data is accurate. While multi-touch attribution gives you an idea of the journey that a customer took to get to the sale, the percent attributed to each channel can be biased (if you’re manually creating the scoring system) and it can still leave out a wide swath of “channels” such as offline sources.
Done correctly, multi-touch attribution gives a better sense of the influencers or dials that you have at your disposal to try to accelerate or increase buying behavior.
Controlled Group Testing - Treated vs. Non-Treated Customers:
The most effective way to truly understand the value that certain tools and marketing initiatives may be driving for you is to set up a test - treat a certain percent of your customers and measure lift in revenue from that subset as compared to your non-treated (control) group. At Windsor Circle we automate this process within the platform to inform our clients of the lift that they are achieving (or not!) from the campaigns they are running.
Step 1: Identify Cohorts
The first step is to identify your cohort (what campaigns are you testing? What customer segments are a part of it?) and the percentage of customers you want to treat. In most cases we treat 90% and leave a 10% control group. We leverage a Universal Control Group (meaning that this 10% of the customer base receives absolutely no treatment). Within the Treatment Group, we then create control and treatment groups for each automated campaign. This allows us to see if treatment of any kind is creating statistically significant lift, and then, separately, to measure the individual lift created by each specific automated campaign. The question “is Windsor Circle creating lift?” is different than “is the abandon cart automator or the predictive replenishment campaign creating lift?” These control and treatment groups, both at the universal and the campaign level allow us to test at both levels.
Step 2: Randomize Assignment to Cohorts
For each customer segment, we randomize who is added to the treatment group and the control group. So, continuing from the logic above, we randomly assign all clients to either the control group of 10% or the treatment group of 90% upon initial load, and then as we update with ongoing purchases, continue that random assignment. This is also true at the automator level. A customer that is randomly assigned to the universal control group will never receive our campaigns. That said, a customer in the universal treatment group is eligible to receive campaigns (treatments). From there, they will be randomly assigned to the treatment or control groups for each specific campaign. So, it’s purely possible for someone to be in the treatment group for Abandoned Cart Recovery (and thus receiving campaigns) and in the control group for the Predictive Replenishment Campaigns (and thus not receiving automated replenishment reminders). This randomization helps to eliminate noise in the data.
Step 3: Treat the Treatment Group (and Don’t Treat the Control Group!)
Once you have your cohorts assigned, it’s time to run the test. Most of Windsor Circle’s campaigns are triggered through our platform partners (Salesforce, Oracle, IBM, Bronto, WhatCounts, MailChimp, etc). As such, you set up the automators in our system, and press go, and we handle the rest.
Step 4: Measure the Results
Here’s where it gets interesting.
If you can measure the transactional purchase data (we can through our platform), you can bypass the process of trying to measure attribution via clicks, coupons or other means of trying to assign value to marketing treatments.
We can simply measure the randomized control and treatment groups to assess whether there is a difference in spending patterns. Said simply, we’re going to sprinkle some magic dust on a randomly selected group of recipients, and not sprinkle it on others, and then over time, we’re going to measure whether or not those that got the magic dust grew more than those that didn’t.
Interesting side note for fellow data nerds… We’ve found in our work that it’s important to handle, in industry-accepted ways, the outliers that inevitably show up in the data. That process is called “winsorizing” (not “Windsorizing,” although that’d be cool if it was). It’s basically a systemized way of trimming the excessively small or large values in the data set. You can read more about “winsorized means” here if it tickles your fancy.
Step 5: Analyze for Statistical Significance
A trap that many marketers fall into is trying to conclude, too soon, what the effect of a given treatment is. It is important that you observe enough data points to truly achieve statistical significance.
If I wanted to know the average number of trips to a buffet that a restaurant patron might eat, I could sit down at the local cafeteria and just start counting numbers of trips each person makes. But let’s say that I only came to measure for one evening. And that evening happened to be the night before a big collegiate football game, and the patrons that showed up were 50 ravenously hungry, massively large football players who each made 5-6 trips to the buffet.
Can I assume that the average is 5.5 trips from this small observation set? Of course not. We took too small a sample (or at least for the purpose of observing restaurant patrons as a whole). And that would lead us to the wrong conclusion (perhaps not about football players, but that wasn’t what we were measuring).
Even we at Windsor Circle have to be careful not to declare successes too early. We often coach our team to avoid saying anything until statistical significance is achieved, even if they see a lift percentage associated with the data. In other words, don’t say “it looks like a 15% lift and we’ll confirm it as we go.” We coach our team to simply say “the data is inconclusive right now, and it’d be irresponsible of me to try to infer anything until we have enough observations.”
Depending on the desired outcome, inconclusive results can drive different behaviors. If you’re pretty sure that the treatment is working, then you’ll want to maximize the treatment group while more slowly establishing statistical significance because it maximizes potential revenue. This has to be a gut call… you can’t infer anything from the data yet, so you just have to do your job as a marketer and make the judgment call.
If, on the other hand, you want to more rapidly get clarity, you may choose to forego higher levels of treatment by adding more people to the control group (perhaps a 70/30 split or even a 50/50 split) to get to statistical significance as quickly as possible. In this case, you’re willing to forego potential revenue lift in exchange for knowing sooner if you can really rely on the new treatment to drive growth.
One decision maximizes potential revenue. The other maximizes speed to getting clarity. Neither decision is right or wrong… just depends on what you most value.
Step 6: Infer
If your results are considered statistically significant, what did you find? Did you see incremental lift? If so, the tool you’re using can be considered a success. We’re finding that individual treatments (such as the Replenishment Automator) are generating as high as an 8% lift to treated individuals. We can see across hundreds of thousands of observations that the combination of data-driven campaigns are creating 30%+ lift in revenue for treated customers.
But not for everyone!
We see some clients where individual automators achieve statistical significance and have little to no effect. When this happens one must revisit that specific campaign and ask whether it’s worth the investment of time and energy, or if something needs to be modified in the cadence, the messaging, or some other aspect of the automated campaign.
Summary
One of the most important jobs that a marketer has is to discern whether a given product or strategy works for their business. While attribution models paint a nice picture of the customer journey, ultimately you need to be able to understand, in a data-driven, statistically significant way, the lift or impact that a certain product or strategy has on your business in order to make smart financial decisions to grow revenue and customer lifetime value. Leveraging controlled group testing provides a means for assessing performance with rigor.