Tips to fix CI builds

Having a stable CI setup is important and in this post, I will share my experience in fixing some specs that were constantly failing on an open-source project.
Preparation
The project has a good spec coverage. For reference, it has 211 feature spec written in Cucumber but unfortunately, it seems that 30 of them are failing. So as the first step, are those specs failing locally? Well, they pass, hmmm we need to think about our next move now. As those specs fail only on CI, we need to know exactly which ones are failing without relying on open pull requests. The reason is simple, we don’t know if the specs are red due to the changes from the PR or for different reasons. Then, what should we do? To answer this question, we have to know that most CI run a new build if they detect changes in the commit hash code. We can achieve this by adding an empty commit in a new branch:
  git checkout -b fix-ci-tests
  git commit --allow-blank -m "bump commit history"
  git push origin fix-ci-tests
After opening a new pull request, we have to wait for the build and note the failing specs. (This is the best moment to grab a coffee or do something else :smile: ).
Usually, the build on CI takes some time and we want a faster feedback loop for debugging. As an option, we want to run only those specs. How we do that? That’s a good question, here we are investigating cucumber specs and we can assign a new tag to the failing features only. For example:
@fix-ci
Scenario: Check 3 miles
    Given I run the import doit service with a radius of 3 miles
    Then there should be 63 doit volunteer ops stored	    
    And all imported volunteer ops have latitude and longitude coordinates
The project has a Travis CI setup, so quick editing to the script section is required. Previously, it runs unit and javascript tests with cucumber specs.
script:
- USE_JASMINE_RAKE=true bundle exec rake jasmine:ci
- bundle exec rspec
- bundle exec rake cucumber:first_try
- bundle exec rake cucumber:second_try
And it was changed to run the tagged specs only.
script:
- bundle exec cucumber -t @fix-ci
With this improvement, we managed to cut more than 50% of the build time as it went down from 11 minutes to 5 minutes.
Investigation
Now, we can start investigating the issue, the main points to check are:
- Any warning logs, in this case, I discovered that a service logged some warning about query usage.
- The environment variables, this is a common issue in CI setup as developers introduce new environment variables, they tend to forget to set it in the CI config.
- Recently updated gems, this one is hard to investigate sometimes and could be mitigated by following these simple steps.
The case, I worked on recently was caused by a missing environment variable combined with recently updated VCR gem that required fixing a lot of cassettes.
Conclusion
Investigating CI issues is always time-consuming and needs some patience. But with a good process and some experience, we can learn the common problems and how to mitigate them.
