Change failure rate is one of the most interesting metrics you should use to evaluate your team’s quality and overall efficiency of your Engineering performance.
There are four key Accelerate metrics that we will cover in detail:
- Lead time for changes;
- Deployment frequency;
- Time to recover; and
- Change failure rate.
In this article, we are focusing on Change failure rate. So keep on reading to know what Change failure rate is, how we measure it in Pulse, and how your team can start using it to achieve elite engineering performance.
What is Change failure rate?
In a nutshell, Change failure time measures the percentage of deployments causing a failure in production requiring remediation (e.g., hotfix, rollback, patch).
“A key metric when making changes to systems is what percentage of changes to production (including, for example, software releases and infrastructure configuration changes) fail. In the context of Lean, this is the same as percent complete and accurate for the product delivery process, and is a key quality metric.”
Accelerate: The science of lean software and DevOps: Building and scaling high performing technology organizations
What does Change failure rate measure?
Change failure rate is a good indicator of the teams’ capabilities and overall development process efficiency. It is one of the most interesting quality metrics for an organization to work on to ensure their software's stability and correct functioning.
This metric is inspired by the Lean principles and analyzes how your team guarantees the security of code changes and how deployments are being managed.
How to read Change failure rate?
On the one hand, a Change failure rate that is too high indicates there may be broader issues, like manual or inefficient deployment processes and lack of testing before deployment. On the other hand, a low Change failure rate shows that your team was able to shift left enough verifications to identify infrastructure errors and application bugs before deployment.
Your team’s goal should be to continuously reduce your Change failure rate because it means that your code is being extensively tested to avoid a disruption in your software. As a result, you are not compromising your software's availability and correct functioning.
Change failure rate in Pulse
Products like Pulse make it easy for your team to know what your Change failure rate looks like. Pulse connects seamlessly with your GitHub, Jira, and PagerDuty to give you Change failure rate and other Engineering metrics out-of-the-box. Your team only needs to focus on making informed decisions to improve your results.
How is Pulse measuring Change failure rate?
Pulse measures Change failure rate by calculating the percentage of deployments causing a failure in production (e.g., service impairment or unplanned outage). It’s computed with the following formula:
number of deployments that caused incidents / total number of deployments
Pulse calculates the metrics per repository or service, and, when displaying them, they are aggregated, using an average, by time interval.
Change failure rate in PulsePulse takes care of all the details to ensure your metrics are accurate and reliable. Pulse can automatically detect your deployments by picking signals from GitHub and then associate incidents with the failed deployments, simplifying the integration and enabling you to explore the metrics by repository, service, and each of your GitHub teams.
What Change failure rate should a team have?
The best practice performance level for this metric, published every year on the Accelerate State of DevOps 2021 report:
- Elite: 0-15%
- High: 15-30%
- Medium: N/A*
- Low: 30-100%
* The Accelerate State of DevOps 2021 report defines the same range of values for both Medium and Low performance levels, so we’ve opted to skip the Medium level in Pulse.
How to use Change failure rate
Over time, Change failure rate should be getting smaller while your team is increasing in the performance level until reaching best practice levels.
Having a small Change failure rate allows your team to have a sound overall deployment process, and deliver high-quality software quickly.
Causes of high Change failure rate
- Relying on manual deployment processes, prone to human error;
- Perform infrastructure changes that are not reproducible;
- Tests being carried out manually;
- Poor code quality that difficult maintainability and introduction of new code, leading to more complex testing and unexpected application errors.
How to reduce Change failure rate
- Work on small and self-contained changes: working with smaller versions of changes makes it easier to test them and less likely to break;
- Leverage infrastructure and application configuration as code: guarantee that mission-critical configurations and services' infrastructure are visible and reproducible;
- Use test automation: incorporating automated tests at every stage of the CI/CD pipeline can help you reduce delivery times since it helps catch issues earlier;
- Automate code reviews: using automated code review tools like Codacy can help you improve your code quality and save your team valuable time.
Conclusion
Change failure rate is a fundamental metric to analyze and improve the efficiency of your Engineering team. It is a valuable metric to understand the capabilities of your team and how they learn from previous problems to improve later workflows.
Together with Lead time for changes, Deployment frequency, and Time to recover, this metric provides valuable insights for your team to achieve full engineering performance.
With Pulse, you can easily measure and analyze the Change failure rate in your projects and make better decisions.
We also suggest you (re)watch our webinar on How to boost your Engineering Speed & Quality with the right Metrics. We talked about how your team can deploy faster and more efficiently without compromising code quality, and how you can make the most out of the engineering data already available to you. Don’t miss out!