12 A/B Split Testing Mistakes I See Businesses Make All The Time

https://conversionxl.com/blog/12-ab-split-testing-mistakes-i-see-businesses-make-all-the-time/

 

A/B testing is fun. With so many easy-to-use tools around, anyone can (and should) do it. However, there’s actually more to it than just setting up a test. Tons of companies are wasting their time and money by making these 12 mistakes.

Here are the top mistakes I see again and again. Are you guilty of making these mistakes? Read and find out.

#1: A/B tests are called early

Statistical significance is what tells you whether version A is actually better than version B—if the sample size is large enough. 5o% statistical significance is a coin toss. If you’re calling tests at 50%, you should change your profession. And no, 75% statistical confidence is not good enough either.

Any seasoned tester has had plenty of experiences where a “winning” variation at 80% confidence ends up losing bad after giving it a chance (read: more traffic).

What about 90%? Come on, that’s pretty good!

Nope. Not good enough. You’re performing a science experiment here. Yes, you want it to be true. You want that 90% to win, but more important than having a “declared winner” is getting to the truth.

early

 

Image credit

As an optimizer, your job is to figure out the truth. You have to put your ego aside. It’s very human to get attached to your hypothesis or design treatment, and it can hurt when your best hypotheses end up not being significantly different. Been there, done that. Truth above all, or it all loses meaning.

A very common scenario, even for companies that test a lot: they run one test after another for 12 months and have many tests they declare as winners and roll out. A year later the conversion rate of their site is the same as it was when they started. Happens all the damn time.

Why? Because tests are called too early and/or sample sizes are too small. You should not call tests before you’ve reached 95% or higher. 95% means that there’s only a 5% chance that the results are a complete fluke. A/B split testing tools like Optimizely or VWO both tend to call tests too early: their minimum sample sizes are way too small.

Here’s what Optimizely tells you:optimizelybsA sample size of 100 visitors per variation is not enough. Optimizley leads many people to call tests early and doesn’t have a setting where you may change the minimum sample size needed before declaring a winner.

VWO has a sample size feature, but their default is incredibly low. You can configure it in the test settings:

vwoConspiracy theorists say VWO and Optimizely do it on purpose to generate excitement about testing so users keep on paying them. Not sure that’s true, but they really should stop calling tests early. Here’s an example I’ve used before. Two days after starting a test these were the results:
The variation I built was losing bad—by more than 89% (and no overlap in the margin of error). Some tools would already call it and say statistical significance was 100%. The software I used said Variation 1 has 0% chance to beat Control. My client was ready to call it quits.

However since the sample size here was too small (only a little over 100 visits per variation) I persisted and this is what it looked like 10 days later:That’s right, the variation that had 0% chance of beating control was now winning with 95% confidence.

Watch out for A/B testing tools “calling it early” and always double check the numbers. The worst thing you can do is have confidence in data that’s actually inaccurate. That’s going to lose you money and quite possibly waste months of work.

How big of a sample size do I need?

You don’t want to make conclusions based on a small sample size. A good ballpark is to aim for at least 350-400 conversions per variation (can be less in certain circumstances – like when the discrepancy between control and treatment is very large). BUT – magic numbers don’t exist. Don’t get stuck with a number – this is science, not magic.

You NEED TO calculate the actual needed sample size ahead of time, using sample size calculators like this or other similar ones. This is a pretty useful tool for understanding the relation between uplift percentages and needed sample sizes: http://www.testsignificance.com.

What if I have 350 conversions per variation, and confidence is still not 95% (or higher)?

If the needed sample size has been achieved, this means there is no significant difference between the variations. Check the test results across segments to see if significance was achieved in one segment or another (great insights lie always in the segments – but you also need enough sample size for each segment). In any case, you need to improve your hypothesis and run a new test.

#2: Tests are not run for full weeks

Let’s say you have a high traffic site. You achieve 98% confidence and 250 conversions per variation in 3 days. Is the test done? Nope.

We need to rule out seasonality and test for full weeks. Did you start the test on Monday? Then you need to end it on a Monday as well. Why? Because your conversion rate can vary greatly depending on the day of the week.

So if you don’t test a full week at a time, you’re again skewing your results.  Run a conversions per day of the week report on your site, see how much fluctuation there is. Here’s an example:dayoftheweekWhat do you see here? Thursdays make 2x more money than Saturdays and Sundays, and the conversion rate on Thursdays is almost 2x better than on a Saturday.

If we didn’t test for full weeks, the results would be inaccurate. So this is what you must always do: run tests for 7 days at a time. If confidence is not achieved within the first 7 days, run it another 7 days. If it’s not achieved with 14 days, run it another 7 days.

Of course, first of all you need to run your tests for a minimum of 2 weeks anyway (my personal minimum is 4 weeks, since 2 weeks is often inaccurate), and then apply the 7 day rule.

The only time when you can break this rule is when your historical data says with confidence that every single day the conversion rate is the same. But it’s better to test 1 week at a time even then.

Always pay attention to external factors

Is it Christmas? Your winning test during the holidays might not be a winner in January. If you have tests that win during shopping seasons like Christmas, you definitely want to run repeat tests on them once the shopping season is over. Are you doing a lot of TV advertising or running other massive campaigns? That may also skew your results. You need to be aware of what your company is doing.

External factors definitely affect your test results. When in doubt, run a follow-up test.

#3: A/B split testing is done even when they don’t even have traffic (or conversions)

If you do 1 to 2 sales per month, and run a test where B converts 15% better than A – how would you know? Nothing changes!

I love A/B split testing as much as the next guy, but it’s not something you should use for conversion optimization when you have very little traffic. The reason is that even if version B is much better, it might take many months to achieve statistical significance.

So if your test took 5 months to run, you wasted a lot of money. Instead, you should go for massive, radical changes – and just switch to B. No testing, just switch – and watch your bank account. The idea here is that you’re going for massive lifts – like 50% or 100%. And you should notice that kind of an impact on your bank account (or in the number of incoming leads) right away. Time is money. Don’t waste time waiting for a test result that takes many months.

#4: Tests are not based on a hypothesis

I like spaghetti. But spaghetti testing (throw it against the wall, see if it sticks) not so much. It’s when you test random ideas just to see what works. Testing random ideas comes at a huge expense—you’re wasting precious time and traffic. Never do that. You need to have a hypothesis. What’s a hypothesis?

hypothesis is a proposed statement made on the basis of limited evidence that can be proved or disproved and is used as a starting point for further investigation.

And this shouldn’t be a spaghetti hypothesis either (crafting a random statement). You need to complete proper conversion research to discover where the problems lie, and then perform analysis to figure out what the problems might be, ultimately coming up with a hypothesis for overcoming the site’s problems.

If you test A vs. B without a clear hypothesis, and B wins by 15%, that’s nice, but what have you learned from this? Nothing. What’s even more important is what we learned about the audience. That helps us improve our customer theory and come up with even better tests.

#5: Test data is not sent to Google Analytics

Averages lie, always remember that. If A beats B by 10%, that’s not the full picture. You need to segment the test data, that’s where the insights lie.

While Optimizely has some built-in segmentation of results, it’s still no match to what you can do within Google Analytics. You need to send your test data to Google Analytics and segment it. If you use Visual Website Optimizer, they have a nice global setting for tests, so the integration is automatically turned on for each test you run.

Set it and forget it:inte  Optimizely makes you suffer for whatever stupid reason. They make you switch on the integration for each test separately.
opti

They should know that people are not robots and sometimes forget. Guys, please make a global setting for it. So what happens here is that they send the test info into Google Analytics as custom variables. You can run advanced segments and custom reports on it. It’s super useful, and it’s how you can actually learn from A/B tests (including losing and no-difference tests).

ab

But Monetate – which should be a class above the other two services, since it costs way more, is not even able to send custom reports. Ridiculous, I know. They can only send test data as events.monetate  So in order to get more useful data, create advanced segments for each variation and create a new segment based on the event label:monetateAnd you can check whatever metrics in GA with a segment for each variation applied:segBottom line: always send your test data to Google Analytics. And segment the crap out of the results.

#6: Precious time and traffic are wasted on stupid tests

So you’re testing colors, huh? Stop.

There is no best color, it’s always about visual hierarchy. Sure you can find tests online where somebody found gains via testing colors, but they’re all no brainers. Don’t waste time on testing no brainers, just implement. You don’t have enough traffic, nobody does. Use your traffic on high-impact stuff. Test data-driven hypotheses.

#7: They give up after the first test fails

You set up a test, and it failed to produce a lift. Oh well. Let’s try running tests on another page?

Not so fast! Most first tests fail. It’s true. I know you’re impatient, so am I, but the truth is iterative testing is where its at. You run a test, learn from it, and improve your customer theory and hypotheses. Run a follow-up test, learn from it, and improve your hypotheses. Run a follow-up test, and so on.

Here’s a case study where it took 6 tests (testing the same page) to achieve the kind of lift we were happy with. That’s what real testing life is like. People who approve testing budgets—your bosses, your clients—need to know this.

If the expectation is that the first test will knock it out of the ballpark, money will get wasted and people will get fired. Doesn’t have to be that way. It can be lots of money for everyone instead. Just run iterative tests. That’s where the money is.

#8: They don’t understand false positives

Statistical significance is not the only thing to pay attention to. You need to understand false positives too. Impatient testers will want to skip A/B testing, and move on to A/B/C/D/E/F/G/H testing. Yeah, now we’re talking!

Or why stop here, Google tested 41 shades of blue! But that’s not a good idea. The more variations you test against each other, the higher the chance of a false positive. In the case of 41 shades of blue, even at 95% confidence level the chance of a false positive is 88%.

Watch this video, you’ll learn a thing or three:

https://player.vimeo.com/video/54004040?wmode=transparent

Main takeaway: don’t test too many variations at once. And it’s better to do simple A/B testing anyway, you’ll get results faster, and you’ll learn faster—improving your hypothesis sooner.

#9: They’re running multiple tests at the same time with overlapping traffic

You found a way to cut corners by running multiple tests at the same time. One on the product page, one on the cart page, one on the home page (while measuring the same goal). Saving time, right?

This may skew the results if you’re not careful. It’s actually likely to be fine unless you suspect strong interactions between tests, and there’s large overlap of traffic between tests. Thing get more tricky if interactions and traffic overlap are likely to be there.

If you want to test a new version of several layouts in the same flow at once—for instance running tests on all 3 steps of your checkout—you might be better off using multi-page experiments or MVT to measure interactions, and do attribution properly.

If you decide to run A/B tests with overlapping traffic, keep in mind even distribution. Traffic should be split evenly, always. If you test product page A vs B, and checkout page C vs D, you need to make sure that traffic from B is split 50/50 between C and D (e.g. as opposed to 25/75).

#10: They’re ignoring small gains

Your treatment beat the control by 4%. “Bhh, that’s way too small of a gain! I won’t even bother to implement it”, I’ve heard people say.

Here’s the thing. If your site is pretty good, you’re not going to get massive lifts all the time. In fact, massive lifts are very rare. If your site is crap, it’s easy to run tests that get a 50% lift all the time. But even that will run out.

Most winning tests are going to give small gains—1%, 5%, 8%. Sometimes, a  1% lift can result in millions of dollars in revenue. It all depends on the absolute numbers we’re dealing with. But the main point in this: you need to look at it from a 12-month perspective.

One test is just one test. You’re going to do many, many tests. If you increase your conversion rate 5% each month, that’s going to be an 80% lift over 12 months. That’s compounding interest. That’s just how the math works. 80% is a lot.

So keep getting those small wins. It will all add up in the end.

#11: They’re not running tests at all times

Every single day without a test is  a wasted day. Testing is learning. Learning about your audience, learning what works and why. All the insight you get can be used in all of your marketing, like PPC ads and what not.

You don’t know what works until you test it. Tests need time and traffic (lots of it).

Having one test up and running at all times doesn’t mean you should put up garbage tests. Absolutely not. You still need to do proper research, have a proper hypothesis and so on.

Have a test going all the time. Learn how to create winning A/B testing plans. Never stop optimizing.

#12: Not being aware of validity threats

Just because you have a decent sample size, confidence level and test duration doesn’t mean that your test results were actually valid. There are several threats to the validity of your test.

Instrumentation effect

This is the most common issue. It’s when something happens with the testing tools (or instruments) that causes flawed data in the test.

It’s often due to wrong code implementation on the website, and will skew all of the results. You’ve got to really watch for this. When you set up a test, watch it like a hawk. Observe that every single goal and metric that you track is being recorded. If some metric is not sending data (e.g. add to cart click data), stop the test, find and fix the problem, and start over by resetting the data.

History effect

Something happens in the outside world that causes flawed data in the test. This could be a scandal about your business or an executive working there, it could be a special holiday season (Christmas, Mother’s Day etc), maybe there’s media story that gets people biased against a variation in your test, whatever. Pay attention to what is happening in the external world.

Selection effect

This occurs when we wrongly assume some portion of the traffic represents the totality of the traffic. Example: you send promotional traffic from your email list to a page that you’re running a test on. People who subscribe to your list like you way more than your average visitor. So now you optimize the page (e.g. landing page, product page etc) to work with your loyal traffic, thinking they represent the total traffic. But that’s rarely the case!

Broken code effect

One of the variations has bugs that causes flawed data in the test. You create a treatment, and make it live! However, it doesn’t win or no difference. What you don’t know is that your treatment displayed poorly on some browsers and/or devices. Whenever you create a new treatment or two, make sure you conduct quality assurance testing on them to make sure they display properly in all browsers and devices.

Conclusion

Today there are so many great tools available that make testing easy, but they don’t do the thinking for you. I understand Statistics was not your favorite subject in college, but time to brush up. Learn from these 12 mistakes, so you can avoid them, and start making real progress with testing.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s