3 Reasons Why Subject Line Split Tests Aren’t Teaching You Anything About your Customers

3 Reasons Why Subject Line Split Tests Aren’t Teaching You Anything About your Customers

(editors note. Full title of post: “3 pretty complicated but incredibly important reasons why your subject line split tests aren’t teaching you very much about your customers”)

Split testing works.

It works amazingly well… if you plan the tests, and analyze and interpret the results correctly. I’ve worked with hundreds of marketers across the world who implement split testing with varying levels of statistical robustness and success.

What I’ve learned from this is that most email marketers don’t run their experiments optimally.

And this means they aren’t learning very much from their split tests. They focus on one-off lifts in response rates… and don’t learn much about the all-important cognitive buying signals you can learn from correct split testing methodology.

Following is a whole bunch of stuff that may make your head hurt a little bit. I’m not gonna lie, some of this stuff is pretty heavy. But if you want to truly improve your email subject lines, split testing is, and will remain, the only way to do so. So read on…

Here are the 3 things you need to start doing now to ensure your split tests deliver long-run value to your business, not just fleeting and unpredictable one-off uplifts.

1. Experimental design matters. A lot.

Consider these two subject lines:

“Get up to 50% off our beautiful autumn wardrobe plus free delivery”

“Need to update your wardrobe? Our autumn collection is out now… save up to 50%!”

This is based upon an actual A/B test conducted by an (un-named) fashion company last week. The second subject line won by about 2% opens.

So what have they learned? Perhaps one of the below:

-       Leading with a discount doesn’t work

-       Free delivery doesn’t work

-       Questions work

-       Ending with a discount works

-       Using superlative adjectives doesn’t work

-       “Collection” is superior to “Wardrobe” sometimes, but not always

-       Using 3 periods to separate sentences works

-       3 sentence fragments work better than one full sentence

-       Exclamation points and question marks increase response

-       “Save” works better than “Get”

Or is it something else? Or all of them? Or none of them?

In essence, while they fortunately got 2% more opens, there are so many differences between the two subject lines that it’s impossible to identify what caused the increase.

If you assume that one of these assumptions, say for example that “Save” works better than “Get,” then you will face massive response uncertainty in your next experiment.

Poor experimental design results in unpredictable fluctuations in response. This is annoying, but the problem is more fundamental:

Poor experimental design teaches you the wrong lessons.

Well-designed experiments will allow you to learn about your audience. Poorly-designed experiments will lead you up the garden path.

2. Correlation Does Not Equal Causation

Jot this one down, stick it on a post-it note, and memorize it: Correlation does not equal causation.

Why does this matter?

Take this example from something we all experience – food.

It is widely believed that if a higher proportion of your calorific intake is fat, then you will have a higher propensity to develop heart disease.

Makes sense, right?

But hang on – is this true? Let’s look at the country perhaps most famous for culinary expertise, France.

Anyone who enjoys French food knows that they love their butter, cream and other animal fat forms. Julia Child made herself famous by adding butter to everything.

And yet, France’s rates of heart disease are amongst the lowest on the planet!

This phenomenon is known as the French Paradox (Wikipedia that one later.)

What it doesn’t mean is that fat intake and heart disease incidence aren’t related.

What it does mean is that the relationship is not necessarily causal. It’s certainly broadly correlated, but there are other variables at work here as well (for example, exercise rates, levels of processed food, quality of data collected by health authorities, and the like.)

Fat intake is correlated with heart disease. But it doesn’t necessarily cause it.

Going back to our fashion subject line example above, let’s say you take your assumptions from that campaign and send out your next campaign with this subject line:

“Our new autumn collection is out! Are you ready? You can save up to 50%!”

And let’s say it gets the same open rate as the winner from last time.

Great, good job… and now you believe you’ve cracked the code.

Until it stops working and you’re back to square one.

3. The Future Changes, But the Past Doesn't

This is a hugely important point.

What worked yesterday won’t necessarily work today. Consumers change, public tastes change, and what’s popular today may be the subject of a backlash tomorrow.

Strong, statistically robust analysis of your experimental results is how you learn things.

So let’s say you’ve done a bunch of well-designed controlled experiments. Congrats! Step one is done.

Now, you have to analyze them.

One simple way is to line up all of the different hypothesis you’ve tested, take the “winner” from each, and form a subject line like Voltron.

If everything else stayed the same, this would be the dominant strategy. However, unless you can stop time by snapping your fingers like Zack Morris, this simply isn’t feasible.

A pure, straight-line analysis doesn’t adapt over time.

A single experiment will indeed tell you which variable is superior at that point of time. For this, a frequentist perspective (that is, normalised confidence intervals) is appropriate.

One obvious limitation of purely looking at the frequentist “winner” of a test is that you’re not looking at the effect size – that is, how *much* better a particular winning variant is when compared to the other variable results. By purely considering statistical significance without comparing 1) the variance and the 2) variance of the variance, your analytic model lacks context.

But that’s not the main limitation. The whole point of subject line testing is to predict the future. While a frequentist perspective is appropriate to determine individual winners, it’s not very useful when you need to combine results and create stronger subject lines in your next campaign.

To predict the future, the appropriate model is Bayesian.

Bayesian inference is rooted in conditional probability. It answers questions like:

“What is the probability of my next subject line being amazing if it contains the word ‘Save’ and includes a ‘Free Delivery option’… given that in two previous experiments Save outperformed ‘get’ and ‘free delivery’ didn’t respond well.”

See the difference? It’s not just answering a simple question. It’s answering the question while taking into account the relative importance of previous results.

If you want to predict the future, you need a Bayesian inference model that will translate all your available frequentist learnings into a robust probabilistic equation.

Math is hard. But it’s worth the effort.

If you’ve made it this far, you’re either interested in improving your subject line strategy, or you are a masochist. Or both.

The point is, to quote a wise man, “Stupid is as stupid does.” If you have a poorly designed experiment, if you analyse it incorrectly, and if you interpret results ineffectively, your split tests results are not going to help you learn anything about your customers.

Of course, option two is clear: if you want to save time and get lots of opens on your next campaign, just use this subject line:

“Free beer!”

It’ll work the first time and crash the second time, but it’ll be pretty easy to figure out why.


Editors Note: Parry is conducting a split test survey. To fill it out go here:


The data will be available to readers of the OI newsletter shortly.



Related Posts



No comments made yet. Be the first to submit a comment
Already Registered? Login Here
Wednesday, 27 May 2020