Tips and Tricks for Dubugging and Fixing Slow/Flaky Capybara Specs

Note: This article has been cross posted on the UrbanBound product blog.

In a previous post, I wrote about how the proper use of Capybara’s APIs can dramatically cut back on the number of flaky/slow tests in your test suite. But, there are several other things you can do to further reduce the number of flaky/slow tests, and also debug flaky tests when you encounter them.

Use have_field(“.some-element”, with: “X”) to check text field contents

Your test may need to ensure that a text field contains a particular value. Such an expectation can be written as:

This can be a flaky expectation, especially if the contents of #some-text-field are loaded via an AJAX request. The problem here is that this expectation will check the value of the text field as soon as it hits this line. If the AJAX request has not yet come back and populated the value of the field, this test will fail.

A better way to write this expectation would be to use have_field(“.some-element”, with: “X”):

This expectation will wait Capybara.default_wait_time for the text field to contain the specified content, giving time for any asynchronous responses to complete, and change the DOM accordingly.

Disable animations while running the tests

Animations can be a constant source of frustration for an automated test suite. Sometimes you can get around them with proper use of Capybara’s find functionality by waiting for the end state of the animation to appear, but sometimes they can continue to be a thorn in your side.

At UrbanBound, we disable all animations when running our automated test suite. We have found that disabling animations have stabilized our test suite and made our tests more clear, as we no longer have to write code to wait for animations to complete before proceeding with the test. It also speeds up the suite a bit, as the tests no longer need to wait for an animations to complete.

Here is how we went about disabling animations in our application: https://gist.github.com/jwood/90e7bf6873774055b169

Use has_no_X instead of !have_X

Testing to make sure that an element does not have a class, or that a page does not contain an element, is a common test to perform. Such a test will sometimes be implemented as:

Here, have_css will wait for the element to appear on the page. When it does not, and the expression returns false, it will be negated to true, allowing the expectation to pass. However, there is a big problem with the above code. have_css will wait Capybara.default_wait_time for the element to appear on the page. So, with the default settings, this expectation will take 2 whole seconds to run!

A better way to check for the non-existence of an element or a class is to use the has_no_X matcher:

Using to_not will also behave as expected, without waiting unnecessarily:

No sleep for the fast feature test

Calls to sleep are often used to get around race conditions. But, they can considerably increase the amount of time it takes to run your suite. In almost all cases, the sleep can be replaced by waiting for some element or some content to exist on the page. Waiting for an element or some content using Capybara’s built in wait functionality is faster, because you only need to wait the amount of time it takes for that element/content to appear. With a sleep, your test will wait that full amount of time, regardless.

So, really scrutinize any use of sleep in a feature test. There are a very small number of cases, for one reason or another, where we have not been able to replace a call to sleep with something better. However, these cases are the exception, and not the rule. Most of the time, using Capybara’s wait functionality is a much better option.

Reproduce race conditions

Flaky tests are usually caused by race conditions. If you suspect a race condition, one way to reproduce the race condition is to slow down the server response time.

We used the following filter in our Rails application’s ApplicationController to slow all requests down by .25 seconds:

Intentionally slowing down all requests by .25 seconds flushed out a number of race conditions in our test suite, which we were then able to reliably reproduce, and fix.

Capture screenshots on failure

A picture is worth a thousand words, especially when you have no friggin idea why your test is failing. We use the capybara-screenshot gem to automatically capture screenshots when a capybara test fails. This is especially useful when running on CI, when we don’t have an easy way to actually watch the test run. The screenshots will often provide clues as to why the test is failing, and at a minimum, give us some ideas as to what might be happening.

Write fewer, less granular tests

When writing unit tests, it is considered best practice to make the tests as small and as granular as possible. It makes the tests much easier to understand if each test only tests a specific condition. That way, if the test fails, there is little doubt as to why it failed.

In a perfect world, this would be the case for feature tests too. However, feature tests are incredibly expensive (slow) to setup and run. Because of this, we will frequently test many different conditions in the same feature test. This allows us to do the expensive stuff, like loading the page, only once. Once that page is loaded, we’ll perform as many tests as we can. This approach lets us increase the number of tests we perform, without dramatically blowing out the run time of our test suite.

Sharing is caring

Have any tips or tricks you’d like to shrare? We’d love to hear them in the comments!

Fix Flaky Feature Tests by Using Capybara’s APIs Properly

Note: This article has been cross posted on the UrbanBound product blog.

A good suite of reliable feature/acceptance tests is a very valuable thing to have. It can also be incredibly difficult to create. Test suites that are driven by tools like Selenium or Poltergeist are usually known for being slow and flaky. And, flaky/erratic tests can cause a team to lose confidence in their test suite, and question the value of the specs as a whole. However, much of this slowness and flakiness is due to test authors not making use of the proper Capybara APIs in their tests, or by overusing calls to sleep to get around race conditions.

The Common Problem

In most cases flaky tests are caused by race conditions, when the test expects an element or some content to appear on the page, but that element or content has not yet been added to the DOM. This problem is very common in applications that use JavaScript on the front end to manipulate the page by sending an AJAX request to the server, and changing the DOM based on the response it receives. The time that it takes to respond to a request and process the response can vary. Unless you write your tests to account for this variability, you could end up with a race condition. If the response just happens to come back quick enough and there is time to manipulate the DOM, then your test will pass. But, should the response come a little later, or the rendering take a little longer, your test could end up failing.

Take the following code for example, which clicks a link with the id "foo", and checks to make sure that the message "Loaded successfully" displays in the proper spot on the page.

There are a few potential problems here. Let’s talk about them below.

Capybara’s Finders, Matchers, and Actions

Capybara provides several tools for working with asynchronous requests.

Finders

Capybara provides a number of finder methods that can be used to find elements on a page. These finder methods will wait up to the amount of time specified in Capybara.default_wait_time (defaults to 2 seconds) for the element to appear on the page before raising an error that the element could not be found. This functionality provides a buffer, giving time for the AJAX request to complete and for the response to be processed before proceeding with the test, and helps eliminate race conditions if used properly. It will also only wait the amount of time it needs to, proceeding with the test as soon as the element has been found.

In the example above, it should be noted that Capybara’s first API will not wait for .message to appear on the DOM. So if it isn’t already there, the test will fail. Using find addresses this issue.

The test will now wait for an element with the class .message to appear on the page before checking to see if it contains "Loaded successfully". But, what if .message already exists on the page? It is still possible that this test will fail because it is not giving enough time for the value of .message to be updated. This is where the matchers come in.

Matchers

Capybara provides a series of Test::Unit / Minitest matchers, along with a corresponding set of RSpec matchers, to simplify writing test assertions. However, these matchers are more than syntactical sugar. They have built in wait functionality. For example, if has_text does not find the specified text on the page, it will wait up to Capybara.default_wait_time for it to appear before failing the test. This makes them incredibly useful for testing asynchronous behavior. Using matchers will dramatically cut back on the number of race conditions you will have to deal with.

Looking at the example above, we can see that the test is simply checking to see if the value of the element with the class .message equals "Loaded successfully". But, the test will perform this check right away. This causes a race condition, because the app may not have had time to receive the response and update the DOM by the time the assertion is run. A much better assertion would be:

This assertion will wait Capybara.default_wait_time for the message text to equal "Loaded successfully", giving our app time to process the request, and respond.

Actions

The final item we’ll look at are Capybara’s Actions. Actions provide a much nicer way to interact with elements on the page. They also take into account a few different edge cases that you could run into for some of the different input types. But in general, they provide a shortened way of interacting with the page elements, as the action will take care of performing the find.

Looking at the example above, we can re-write the test as such:

click_link will not just look for something on the page with the id of #foo, it will restrict its search to a link. It will perform a find, and then call click on the element that find returns.

Summary

If you write feature/acceptance tests using Capybara, then you should spend some time getting familiar with Capybara’s Finders, Matchers, and Actions. Learning how to use these APIs effectively will help you steer clear of flaky tests, saving you a whole lot of time and aggravation.

Managing Development Data for a Service Oriented Architecture

A service oriented architecture (SOA) provides many benefits. It allows for better separation of responsibilities. It simplifies deployment by letting you only deploy the services that have changed. It also allows for better scalability, as you can scale out only the services that are being hit the hardest.

However, a SOA does come with some challenges. This blog post addresses one of those challenges: managing a common dataset for a SOA in a development environment.

The Problem

With most SOAs there tends to be some sharing of data between applications. It is common for one application to store a key to data which is owned by another application, so it can fetch more detailed information about that data when necessary. It is also possible that one application may store some data owned by another application locally, to avoid calling the remote service in certain scenarios. Either way, the point is that in most SOAs the applications are interconnected to some degree.

Problems arise with this type of architecture when you attempt to run an application in isolation with an interconnected dataset. At some point, Application A will need to communicate with Application B to perform some task. Unfortunately, simply starting up Application B so Application A can talk to it doesn’t necessarily solve your problem. If Application A is trying to fetch information from Application B by key, and Application B does not have that data (the application datasets are not “in sync”), then the call will obviously fail.

Stubbing Service Calls

Stubbing service calls is one way to approach this issue. If Application A stubs out all calls to Application B, and instead returns some canned response that fulfills Application A‘s expectations, then there is no need to worry about making sure Application B‘s data is in sync. In fact, Application B doesn’t even need to be running. This greatly simplifies your development environment.

Stubbing service calls, however, is very difficult to implement successfully.

First, the stubbed data must meet the expectations of the calling application. In some cases, Application A will be requesting very specific data. Application A, for example, may very well expect the data coming back to contain certain elements or property values. So any stubbing mechanism must be smart enough to know what Application A is asking for, and know how to construct a response that will satisfy those expectations. In other words, the response simply can’t contain random data. This means the stubbing mechanism needs to be much more sophisticated (complex).

Handling calls that mutate data on the remote service are especially difficult to handle. What happens when the application tries to fetch information that it just changed via another service call? If the requests are stubbed, it may appear that the call to mutate the data had no effect. This could possibly lead to buggy behavior in the application.

Also, if you stub out everything, you’re not really testing the inter-application communication process. Since you’re never actually calling the service, stubs will completely hide any changes made to the API your application uses to communicate with the remote service. This could lead to surprises when actually running your code in an environment that makes a real service call.

Using Production Data

In order for service calls to work properly in a development environment, the services must be running with a common dataset. Most people I’ve spoken with accomplish this by downloading and installing each application’s production dataset for use in development. While this is by far the easiest way to get up and running with a common dataset, it comes with a very large risk.

Production datasets typically contain sensitive information. A lost laptop containing production data could easily turn into a public relations disaster for your company, and more importantly it could lead to severe problems for your customers. If you’ve ever had personal information lost by a 3rd party then you know what this feels like. Even if your hard drive is encrypted, there is still a chance that a thief could gain access to the data (unless some sort of smart card or biometric authentication system is used). The best way to prevent sensitive information from being stolen by keeping it locked up and secure, on a production server.

Using Scrubbed Production Data

Using a production dataset that has been scrubbed of sensitive information is also an option. This approach will get you a standardized dataset, without the risk of potentially losing sensitive information (assuming your data scrubbing process is free of errors).

However, if your dataset is very large, this may not be a feasible option. My MacBook Pro has a 256GB SSD drive. I know of datasets that are considerably larger than 256GB. In addition, you have less control over what your test data looks like, which could make it harder to test certain scenarios.

Creating a Standardized Dataset

The approach we’ve taken at Centro to address this issue is to create a common dataset that individual applications can use to populate their database. The common dataset consists of a series of YAML files, and is stored in a format that is not specific to any particular application. The YAML files are all stored together, with the thought that conflicting data is less likely to be introduced if all of the data lives in the same place.

The YAML files may also contain ERB snippets. We are currently using ERB snippets to specify dates.

- id: 1
  name: Test Campaign
  start_date: <%= 4.months.ago.to_date %>
  end_date: <%= 3.months.ago.to_date %>

Specifying relative dates using ERB, instead of hard coding them, gives us a dataset that will not grow stale with time. Simply re-seeding your database with the common dataset will give you a current dataset.

Manually creating the standardized dataset also enables us to construct the dataset in such a way that edge cases that are not often seen in production data are exposed, so we can better test how the application will handle that data.

Importing the Standardized Data into the Application’s Database

A collection of YAML files by itself is useless to the application. We need some way of getting that data out of the YAML files and into the application’s database.

Each of our applications has a Rake task that reads the YAML files that contain the data it cares about, and imports that data into the database by creating an instance of the model object that represents the data.

This process can be fairly cumbersome. Since the data in the YAML files are stored in a format that is not specific to any particular application, attribute names will often need to be modified in order to match the application’s data model. It is also possible that attributes in the standardized dataset will need to be dropped, since they are unused by this particular application.

We solved this, and related issues by building a small library that is responsible for reading the YAML files, and providing the Rake tasks with an easy to use API for building model objects from the data contained in the YAML file. The library provides methods to iterate over the standardized data, map attribute names, remove unused attributes, or find related data (perhaps in another data file). This API greatly simplifies the application’s Rake task.

In the code below, we are iterating over all of the data in the campaigns.yml file, and creating an instance of our Campaign object with that data.

require 'application_seeds'

namespace :application_seeds do
  desc 'Dump the development database and load it with standardized application seed data'
  task :load, [:dataset] => ['db:drop', 'db:create', 'db:migrate', :environment] do |t, args|
    ApplicationSeeds.dataset = args[:dataset]
    seed_campaigns
  end

  def seed_campaigns
    ApplicationSeeds.campaigns.each do |id, attributes|
      ApplicationSeeds.create_object!(Campaign, id, attributes)
    end
  end
end

With the Rake task in place, getting all of the applications up and running with a standardized dataset is as simple as requiring the seed data library (and the data itself) in the application’s Gemfile, and running the Rake task to import the data.

The application_seeds library

The library we created to work with our standardized YAML files, called application_seeds, has been open sourced, and is available on GitHub at https://github.com/centro/application_seeds.

Drawbacks to this Approach

Making it easy to perform real service calls in development can be a double edged sword. On one hand, it greatly simplifies working with a SOA. On the other hand, it makes it much easier to perform service calls, and ignore the potential issues that come with calling out to a remote service. Service calls should be limited, and all calling code should be capable of handling all of the issues that may result from a call to a remote service (the service is unavailable, high latency, etc).

Another drawback is that test data is no substitute for real data. No matter how carefully it is constructed, the test dataset will never contain all of the possible combinations that a production dataset will have. So, it is still a good idea to test with a production dataset. However, that testing should be done in a secure environment, where there is no risk of the data being lost (like a staging environment).

Why Code Coverage Alone Doesn’t Mean Squat

Agile software development is all the rage these days. One of Agile’s cornerstones is the concept of test driven development (TDD). With TDD, you write the test first, and write only enough code to make the test pass. You then repeat this process until all functionality has been implemented, and all tests pass. TDD leads to more modular, more flexible, and better designed code. It also gives you, by the mere nature of the process, a unit test suite that executes 100% of the code. This can be a very nice thing to have.

However, like most things in life, people often focus on the destination, and pay little attention to the journey required to get there. We as human beings are always looking for short cuts. Some software managers see 100% code coverage as a must have, not really caring how that goal is achieved. But it is the journey to 100% code coverage that provides the benefits that most associate with simply having 100% code coverage. Without taking the correct roads, you can easily create a unit test suite that exercises 100% of your code base, and still end up with a buggy, brittle, and poorly designed code base.

100% code coverage does not mean that your code is bug free. It doesn’t even mean that your code is being properly tested. Let me walk through a very simple example.

I’ve created a class, MathHelper that I want to test. MathHelper has one method, average, that takes a List of Integers.

/**
 * Helper for some simple math operations.
 */
public class MathHelper {

    /**
     * Average a list of integers.
     * 
     * @param integerList The list of integers to average.
     * @return The average of the integers.
     */
    public float average(List<Integer> integerList) {
        ...
    }
}

Caving into managerial pressure to get 100% code coverage, we quickly whip up a test for this class. Abracadabra, and poof! 100% code coverage!

coverage-green-bar

So, we’re done. 100% code coverage means our class is adequately tested and bug free. Right? Wrong!

Let’s take a look the the test suite we put together to reach that goal of 100% code coverage.

public class MathHelperTest {

    private MathHelper _testMe;

    @Before
    public void setup() {
        _testMe = new MathHelper();
    }

    @Test
    public void poor_example_of_a_test() {
        List<Integer> nums = Arrays.asList(2, 4, 6, 8);
        _testMe.average(nums);
    }
}

Ugh! What are we really testing here? Not much at all. poor_example_of_a_test is simply verifying that the call to average doesn’t throw an exception. That’s not much of a test at all. Now, this may seem like a contrived example, but I assure you it is not. I have seen several tests like this testing production code, and I assume that you probably have too.

So, let’s fix this test by actually adding a test!

    @Test
    public void a_better_example_of_a_test() {
        List<Integer> nums = Arrays.asList(2, 4, 6, 8);
        _testMe.average(nums);
        assertEquals(5.0, result);
    }

Let’s run it, and see what we get.

java.lang.AssertionError: expected:<5.0> but was:<2.0>

Well, that’s certainly not good. How could the average of 2, 4, 6, and 8 be 2? Let’s take a look at the method under test.

    public float average(List<Integer> integerList) {
        long sum = 0;
        for (int i = 0; i < integerList.size() - 1; i++) {
            sum += integerList.get(i);
        }
        return sum / integerList.size() - 1;
    }

Ok, there’s the bug. We’re not iterating over the full list of integers that we have been passed. Let’s fix it.

    public float average(List<Integer> integerList) {
        long sum = 0;
        for (Integer i : integerList) {
            sum += i;
        }
        return sum / integerList.size();
    }

We run the test once again, and very that our test now passes. That’s better. But, let’s take a step back for a second. We had a method with unit tests exercising 100% of the code that still contained this very critical, very basic error.

With this bug now fixed, we commit the code to source control, and push a patch to production. All is fine and dandy until we start getting hammered with bug reports describing NullPointerExceptions and ArithmeticExceptions being thrown from our method. Taking another look at the code above, we realize that we have not done any validation of the input parameter to our method. If the integerList is null, the for loop will throw a NullPointerException when it tries to iterate over the list. If the integerList is an empty list, we will end up trying to divide by 0, giving us an ArithmeticException.

First, let’s write some tests that expose these problems. The average method should probably throw an IllegalArgumentException if the argument is invalid, so let’s write our tests to expect that.

    @Test(expected=IllegalArgumentException.class)
    public void test_average_with_an_empty_list() {
        _testMe.average(new ArrayList<Integer>());
    }
    
    @Test(expected=IllegalArgumentException.class)
    public void test_average_with_a_null_list() {
        _testMe.average(null);
    }

We first verify that the new tests fail with the expected NullPointerException and ArithmeticException. Now, let’s fix the method.

    public float average(List<Integer> integerList) 
            throws IllegalArgumentException {
        
        if (integerList == null || integerList.isEmpty()) {
            throw new IllegalArgumentException(
                "integerList must contain at least one integer");
        }
        
        long sum = 0;
        for (Integer i : integerList) {
            sum += i;
        }
        return sum / integerList.size();
    }

We run the tests again, and verify everything now passes. So, there wasn’t just one bug that slipped in, but three! And, all in code that had 100% code coverage!

As I said in the beginning of the post, having a test suite that exercises 100% of your code can be a very valuable thing. If achieved using TDD, you will see many or all of the benefits I list at the top of the post. Having a solid test suite also shields you from introducing regressions into your code base, letting you find and fix bugs earlier in the development cycle. However, it is very important to remember that the goal is not to have 100% coverage, but to have complete and comprehensive unit tests. If your tests aren’t really testing anything, or the right thing, then they become virtually useless.

Code coverage tools are great for highlighting areas of your code that you are neglecting in your unit testing. However, they should not be used to determine when you are done writing unit tests for your code.