One of the characteristics that make man a fantastic being is resilience, or the ability to go through times of crisis, overcome adversity and recover. This concept originates in physics, but has been used frequently in psychology and more recently in technology to describe the ability of complex systems to self-organize in the face of difficult situations .

If we imagine our body as a complex system, it is resilience that keeps us alive and functioning, even when we are in trouble. We can even lose the function of a significant part of our brain, and still be able to survive and even recover a practically normal life. This regenerative and adaptive characteristic is at the base of our structure and can even be seen in our cells, which generally know how to deal very well with duplication errors or malformations. This is so common that even within a perfectly healthy individual, problems can happen, but the body corrects itself and finds alternatives to keep going well.

Evolution and cloud computing

This parallel helps us understand what is happening with the evolution of the cloud computing infrastructure of the most modern datacenters on the market. Undoubtedly these are systems that are becoming more and more complex , to the point of being humanly impossible to keep track of everything that happens in the thousands of virtual machines, which run on hundreds of networks, accessing thousands of disk volumes. And if we look at it from a macro level, everything seems to work fine, at least most of the time.

Test, test and test...

When we started to design our cloud environments, we went through a period of experimentation where we exhaustively tested different procedures, components and interfaces. We noticed that everything works very well in short and medium time frames. So well that we ignored many checkpoints, because they never presented problems. And early versions of auto.sky came out like this, quite optimistic about the Cloud's response, much like a program doesn't question whether a processor will correctly respond to a 1+1 operation.

However, it didn't take many weeks for us to realize that everything has flaws, even in the Cloud, in different ways and at any time. Of course, this realization came as we began managing thousands of users across hundreds of instances. The significant increase in our sample base exposed this issue and clearly showed us that a good cloud system is like any complex system, and as such requires a high degree of resilience .

We reviewed our systems and changed the strategy to adopt a much more critical posture in relation to the interaction with the Cloud. In a pragmatic way, it is necessary to guarantee that all the critical points of the solution have high availability, through redundant systems and that these are permanently checked against failures.

More sophisticated clouds, such as AWS , have helped us a lot in this regard, allowing the use of multiple availability zones and more reliable verification mechanisms . Furthermore, we assume that any virtual machine can fail and therefore we have to constantly check its functioning and isolate the ones that are malfunctioning.

We learned that several other procedures are also susceptible to failures, but with adequate checks, our system can be immune to many problems , and we can deliver a service with very high availability and performance to the end user, even if we know that small things can go wrong. …

Did you like the text? Leave your comment!

Written by

Sky.One Team

This content was produced by SkyOne's team of cloud and digital transformation experts.