Simon Ramo wrote a short book describing the phenomenon of two types of games that are played in tennis. These games are differentiated as winner’s game and loser’s game – the former is played by professionals, requiring skills and technique to win, whereas the latter involves amateur opponents in which the winner would be the one who makes fewer mistakes or loses fewer points.
Wouldn’t you agree that operations support is also a sort of game with cloud environment as the opponent? Sounds odd but look closer, and you will see the familiarity.
Rapid cloud adoption in the recent past has resulted in increased complexity around cloud application operations. It becomes more complicated with containerized applications deployed on cloud-native platforms that also consume PaaS services. The way Site Reliability Engineers (SREs) work is slowly but steadily becoming a de-facto operations methodology for cloud environments. Let’s examine why these principles are necessary and how they become critical to ensure a reliable and resilient environment.
The traditional environments keep throwing reliability challenges due to multiple factors like application bugs, limited testing coverage, misconfigured platform, and monitoring tools or security vulnerabilities, to name a few. And as an opponent, operations teams need to be fully prepared to handle these challenges. The game can incline to either side, and that’s the reason the operations team keeps looking for solutions with a focus on upgrading skills, creating custom tools, and devising new ways of working, just to keep up with the fast-changing pace of technology upgrades.
While the operation teams’ key task is to keep applications and environments up and running in all conditions, the game between the operations team and the environment oscillates between the two opponents. Let’s explore how to balance it out, till we reach a state where the operations team wins irrespective of the kind of game they play.
Understand the game and strategies
Cloud has changed the paradigm and operations support of cloud-native applications by directly impacting the platforms. The conventional work methods further increase the challenges in reaching an optimal and timely solution collaboratively. So, what are the key elements that directly impact the state of the critical application in any cloud environment?
The first and foremost element is keeping track of changes introduced in the environment. So, the operations team needs to be aware of the who, what, and how of any new change rollout. They also need to play an active role in the decision process. Usually, this isn’t the protocol as the teams responsible for the rollout change seldom talk to the team responsible for managing the environment. As we all know, classic DevOps is reduced to just Continuous Integration and Continuous Deployments (CI/CD) and to manage the release. It doesn’t bring development (Dev) and operations (Ops) any closer, leaving the gap unabridged. This problem compounds once the changes are made every day or even every hour in the typical cloud-native applications. Just this one problem can make the game a loser’s game wherein one cannot predict the next move of the opponent as they were never involved in understanding the new rules of the game, e.g., change impact and dependencies in this case. This is one important aspect of the SRE principle known as DevSecOps. It covers all aspects of release and environment management along with the shift-left approach of security.
One important rule of this new-age game is to understand the impact of any activity on the overall environment, gaining this visibility will help decide the next move you make. Now, if you can further create a mindmap of chained consequences of that activity, you can very well decide if you are in the winner’s game or the loser’s game.
That brings the most important aspect of operations into the picture -- the observability factor. To understand the moves of your opponent you need to cover a full-stack environment with observability. This includes sitting with developers and asking them to instrument the code as per the monitoring tool in the environment. Through this, you will be talking to developers and making them understand the importance of tracking in the monitoring world and how it simplifies the life of the application operations team.
Now that you are talking to developers, isn’t it a kind of collaboration that the industry talks about in DevOps? And, this collaboration among teams is another critical factor to be prepared for a winner’s game.
Winning formula for cloud-native operations
Here are some key points to take care of while setting up a cloud-native operations team. These focus areas are interdependent and constitute a winning formula that will help you play the game skillfully. Focusing on these points will ensure you avoid the common mistakes that can make you lose the game.
The operations team must understand key business processes running in the environment. It will help prioritize focus and prepare a plan for business-critical services. The operations model should be aligned with business KPIs.
Focus on performance, capacity, and security
Before you face chaos in the environment, why not plan for it in advance? Build or use freely available chaos engineering tools to test the environment. The developers should also be involved in planning and share the benefits of collaboration.
It is ideal to do capacity planning well in advance before any major release as that directly impacts the resource consumption. The responsibility of security shouldn’t be left to the infosec team solely, as they will only provide organization-wide policies. The operations team needs to help left-shift the security toward developers as well. Now security impacts code, containers, cluster, and cloud and is an essential responsibility of the operations team only.
Regularly liaising with developers and other participant teams at every stage is crucial to avoid major trouble later in operations. It would be troublesome if the platform architects are not involved in gathering inputs while designing a new feature. Container orchestrator’s owner will be the best person to help the developer with the perfect deployment architecture.
Manual patching, changing configurations on the fly, and lack of self-healing solutions are key reasons attributed to environment downtimes. With help of end-to-end observability, the focus on proactive performance gap identifications and collaborations with application architects for capacity planning will provide ideas for implementing automation in the environment. Prioritizing and implementing automation will avoid manual errors and doer-checker process.
HCLTech Cloud Smart and its innovative Cloud Application Reliability Engineering (CARE) framework is specifically designed to cater to many such use cases. CARE is inspired by #HCLmodernoperations and SRE principles and is used in multiple cloud operation engagements in HCLTech. Unique add-ons and accelerators of the CARE framework provide the necessary and right skills to the operations team to be a winner. One such accelerator is a tool to continuously measure the reliability index of key operations aspects and keep track of their environments' operation maturity. CARE framework also provides you the necessary guidelines to implement the best practices and continuously monitor the environment, even when new patches or changes keep coming to environments.
Get in touch for more details around the #HCLCLOUDSMART and #HCLCARE framework.