From Fragile to Antifragile Software

One of my favourite books is Antifragile by Nassim Taleb where the author talks about things that gain from disorder. Nacim introduces the concept of antifragility which is similar to hormesis in biology or creative destruction (accessible version creative-destruction) in economics and analyses it characteristics in great details. If you find this topic interesting, there are also other authors who have examined the same phenomenon in different industries such as Gary Hamel, C. S. Holling, Jan Husdal. The concept of antifragile is the opposite of the fragile. A fragile thing such as a package of wine glasses is easily broken when dropped but an antifragile object would benefit from such stress. So rather than marking such a box with “Handle with Care”, it would be labelled “Please Mishandle” and the wine would get better with each drop (would be awesome woulnd’t it).

It didn’t take long for the concept of antifragility to be used also for describing some of the software development principles and architectural styles. Some would say that SOLID prinsiples are antifragile, some would say that microservices are antifragile, and some would say software systems cannot be antifragile ever. This article is my take on the subject. According to Taleb, fragility, robustness, resilience and antifragility are all very different. Fragility involves loss and penalisation from disorder. Robustness is enduring to stress with no harm nor gain. Resilience involves adapting to stress and staying the same. And antifragility involves gain and benefit from disorder. If we try to relate these concepts and their characteristics to software systems, one way to define them would be as the following.

Disorder

Different systems are affected by different kind of disorder, such as stress, time, change, volatility, debt, etc. For software systems, the main disorder is the change. The business is in a constant changing environment and the software needs to adapt to the business needs quickly. That is implementing new requirements, changes to existing functionality, even creating new business opportunities through innovation. A software system has to change all the time, otherwise it is obsolete. Apart from development time challenges, there are also runtime challenges for software systems too. Software systems are created and exists to add value by running in a production environment. And while doing so, they are under stress by end users and other systems. This is another kind of disorder, that software systems have to deal with.

Fragile

This property describes systems that suffer when put under stress. Imagine a software project that is not easy to change at development time. For example if it is not easy to extend, modify and deploy to the production environment. Or a system that is not able to handle unexpected user inputs or external system failures and breaks easily. That’s is a fragile system that is harmed by stress and penalised by change, a good example of fragile.

Robust

This is a system that can continue functioning in the presence of internal and external challenges without adaptation. Every system is robust up to a level. For example a bottle is robust until it reaches the level of breaking point. A software system can be made robust to handle unanticipated user input, or failures in external systems. For example handling NPE in Java, using try-catch-finally statements to handle unreliable invocations, having a thread pool to handle concurrent users, creating network connections using timeouts, are all examples of robustness for a software system. But a robust system doesn’t adapt to a chaining environment and when the stress and change threshold is reached it would break. The qualities that define the robustness vary from system to system. An ATM for example needs to be robust and not fail in the middle of a transaction, whereas, a media streaming service can drop a frame or two, as long as it continues streaming under stress.

Resilient

A system is resilient when it can adapt to internal and external challenges by changing its method of operation. The key here is that the system is responding to stress by changing its internal behaviour rather than resisting stress with a predefined buffer.

  • A typical example here is the circuit breaker pattern which changes its internal state to adapt to the external system behaviour to protect itself.

Overview

Before looking at the next level of software evolution — the antifragile systems, let’s visualise and summarise different kind of software system characteristics.

  • A fragile system is difficult to modify, and cannot cope with a changing environment. Even if it provides some value when used in a stable non changing environment, when faced with further stress and change, it quickly turns into a liability. Many organisations have applications (mainframes for example) which are impossible to change, very expensive to maintain, but still running on high cost as they are very critical to the business.

Antifragile

Many things in life are antifragile, such as the human body. When stressed at the right level, a muscle or bone wold come back stronger. But can a software system be antifragile? Certainly there are some tools, platforms, architectural styles, methodologies that can help create software with antifragile characteristics. Let’s see some of the more popular ones.

Auto scaling feature allows applications to handle increasing load by creating more instances of the application. To achieve that, the software system have to able to measure and then react to change and stress. Some good examples here are AWS autoscaling of EC2 instances at infrastructure level, and OpenShift autoscaling of application containers for the application level. This is a feature that transitions applications from resiliency to antifragility since the software system is shifting resources from one part of the system into another to respond to stress.

Microservices. According to Taleb, at times of stress, the large is doomed to breaking. And that phenomen has been observed with mammals, corporations, administrations, etc. In software and large projects, this behaviour has been observed even more often. The bigger a software project is, the harder it becomes to change and react to stress. Microservices is an architecture styles that allows easier change by having autonomous services with well defined APIs — features that allow change. Russ Miles is a strong believer and proponent of Antifragile Software through Microservices (here is an intro video from him).

Chaos engineering is a technique to create antifragility by evolving systems to survive chaos. Rather than waiting for things to break at the worst possible time, the idea of chaos engineering is to proactively inject failures in order to be prepared when disaster strikes. Netflix’s Simian Army is a very good materialization of this technique, designed to generate failures and help isolate system’s weaknesses.

Continuous deployments to a production environment creates continuous partial system failures and forces organisations to react better to failures through redundancy, rolling upgrades, rollbacks, and avoiding single points of failure. Other techniques such as canary release, blue-green deployments, are used to reduce the risk of introducing new software into production environment. Some other methods such as A/B testing even allows experimenting with change and measuring its effect, in order to gain from change.

The human element

Antifragility is not an universal characteristics. Different systems are antigragile towards different kind of disorder. For example Chaos Monkey will make your system antifragile towards EC2 deaths, and autoscaller will make your system respond to specific type of load. But your systems will not be antifragile towards other kinds of stress. And if you look above at the different ways of introdusing antifragility into software systems, all of them are means for making the social-technical system antifragile (and not only the software system):

  • Chaos engineering forces human feedback to the injected randomness and makes the system antifragile.

Innovate

If it takes few weeks to create a new developer environment, you can not react to change. If it takes three months to release a new feature, you can not react to change. If your ops team is watching the metrics dashboard and manually scaling applications up and down, you cannot embrace change. If the team is hardly catching up with the change, there is no way to gain from change. But once you put an appropriate organisational structure, the right tools and culture in place, then you can start gaining from change. Then you can afford having Friday Hackathons, then you can start exploring open source projects and start contributing to them, then you can start open sourcing your internal projects and benefit from a community, and generally be the change itself. And why not the Netflix or the Amazon of tomorrow.

Originally published at www.ofbizian.com.

Author of Kubernetes Patterns | Technical Product Manager @RedHat for @Debezium & Data Integration | Committer @ApacheCamel

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store