Jun 30, 2017 by Flaviu

A/B Testing for Mobile Apps at Scale

Here at Softvision, we’ve been collaborating with one client in particular, who is a great fan and advocate of A/B testing. They have been an early adopter of A/B testing from the start and, now, any new feature that goes into both the mobile and web apps is A/B tested thoroughly through a tried and proven protocol. This has led to a steady increase of conversion rates over the years and, alongside good marketing and SEO, A/B testing contributes significantly to driving the business forward. The main product of our client is one of the top e-commerce apps in the world.

The client has native apps for both iOS and Android devices, alongside mobile web and desktop web. For each of the native mobile apps we’re looking at close to 200 active experiments each at any given time, and for mobile and desktop web, close to 150.

The process through which a new experiment gets added to the roster is pretty straightforward and well documented, and so are the implementation, evaluation and exit phases of its lifetime.

Initially, we need to identify a point of improvement inside the app. Broadly, this is done yearly by the product owner team, who decides the direction the app should be pushed in. Based on those resolutions, new features are outlined and added to the roadmap, which, in turn, is broken down into release cycles (or sprints). While features are not necessarily assigned to a sprint (we have a 3-week release cycle, so if something doesn’t make it in one release, it will make it in the next one), there is a general idea about when we expect them to make it in. All customer-facing features, without exception, are gated behind at least one A/B experiment.

Once the features are separated into app components, a rough draft of target sprints for them is laid out, they get implemented and released. At this point, there are several things that need to happen:

  1. Tracking — Our client has their own internally developed tracking system which has been custom made from scratch to match their specific needs. Of course, there’s a plethora of 3rd party SDKs which can offer good tracking alternatives for mobile and web apps — Localytics, AppSee, AppAnalytics, FlightRecorder and so on. Combining our in-house tracking system, with a tool capable of searching, monitoring, and analyzing machine-generated big data (we like Splunk, but there are plenty of other alternatives out there — Elastic, Graylog etc), gives us the control we need to be able to collect and analyze the way users interact with our app.
  2. Evaluating results — This process has two parts, depending on the type of feature:
    1. Features which directly affect conversion are easy to evaluate. Success of a treatment variant for this type of features translates directly into a better bottom line than 3 or 6 months prior (most conversion-oriented features run for this amount of time). Tracking here is mostly needed to correlate the bottom line with in-app usage statistics and avoid any types of data irregularities.
    2. Features which indirectly affect conversion (oriented towards gaining new users, increasing return rates for existing users, making navigation easier, adding eye candy etc.) are easy to evaluate as well, but conversion is not the primary metric which needs to be observed. This is where a deep understanding of user behaviour through tracking comes into play. At this point, we’re looking at clicks, pageviews, deal views, page loads, sign in / sign out / sign up (both through dedicated accounts or through Facebook / Google accounts), times spent on various app pages and a lot of other things. Success for this type of features, while it does show on the bottom line eventually, is primarily observed through three main metrics — increase in the number of unique visitors / day, increase in the average time a user spends inside the app, how the user navigates the app (random browsing vs structured browsing vs search).

    In both cases, though, the end result is the same — if, based on the observed metrics, treatment did better than control, we keep treatment, else we keep control and discard treatment. The success criteria depends on the objective set for that feature/experiment when it was conceived.

  3. Killing the losing variant — once a winner is determined for an experiment, the A/B test is considered complete. At that point, the experiment and the losing variant are removed from the code, and the winning variant becomes the default app behaviour. In case any other experiments will run on that part of the app, then this will become the new control variant.

Worth mentioning is how we’re actually able to activate or deactivate experiments for the app, without having to submit different versions of the app to the App Store / Play Store. The short answer is — the client-server model. Basically, our mobile apps are dumb. The underlying code basically knows only a few important things: How to build the UI based on local variables (device language, device location, device country) or server-side variables (more on this shortly) and how that UI should work when interacted with by the user, how to use any implemented 3rd party SDKs and how to send/request information to and from the server. For everything else, it relies on information that it receives from the server in the form of JSON response to https requests (this includes — deal inventory, prices, filters, user information, decoration, in-app messaging and the list really goes on).

One of these requests, which gets fired every time the app is cold started (launched after being closed completely, not just suspended) receives a response from the server which contains a dictionary of experiments and their set variants for that user. Based on that information, the client app knows how to construct the UI, which features to toggle on, and which features to toggle off. The feature behaviour itself is coded inside the app, but the availability of the feature is determined by the backend. This basically gives us the ability to control what features any user sees inside the app. Besides the obvious benefits of allowing us to have a streamlined A/B testing process, this piece of functionality also acts as a safety net. In case anything goes horribly wrong with a feature, we don’t need to have the app live for days until we’re able to submit a patch (which would cause tremendous loss in both revenue, image and credibility) — we can just turn that feature off for all the users, fix the issues(s), then submit normally, at the end of our last release cycle (or if we’re in a hurry, do a patch, while the experiment is turned off and it doesn’t generate negative results).

The bucketing, or availability of a feature to our user base is computed automatically by upstream services which rely on complex algorithms to create a random, but meaningful spread. However, we can control this manually by using some internally-developed tools which can override the default allocations. This is something easily achievable by anyone, by simple DB manipulation using any DB management tool depending on their DB deployment (SQL vs NoSQL).