Hybrid and multi-cloud strategies have been adopted heavily in the last few years, because they give enterprises the required flexibility to host applications where they are most suited for business requirements. And the result is a digital foundation that offers an optimal way of tying legacy applications and infrastructure more inclusively with next-generation workloads, such as containers, artificial intelligence, the Internet of Things (IoT).
While this can help enterprises with the financial cost implication by using lower-cost infrastructure where possible, the operating and managing of workloads across public, private, and hybrid environments can be a very daunting task. On one hand IT teams may have integration and portability challenges, and on the other hand the mapping of infrastructure to business services for end to end visibility can get murky leading to more incidents, financial and business risks.
Even though Operations teams have been monitoring IT infrastructure and applications since the start of IT systems, over a period, the entire IT landscape has become increasingly complex. IT infrastructures have now metamorphosed to a software defined, and cloud native in recent years.
And for many enterprises, the decades-old legacy systems and mainframes will coexist with virtualized and cloud native (both on- and off-premises) and as-a-service offerings. Thus, creating a complex maze of operations that must be navigated when trying to monitor performance.
And when Operations group cannot efficiently trace business services as they move from customer-facing systems of engagement to back-end systems of record, it can be difficult to find the exact the root cause of issues.
As a result, performance suffers while mean-time-to-resolution increases, leading to chaos.
The Fact of the matter is “You can’t do today’s job with yesterday’s methods and be in business tomorrow”
Meet AgileOps - The road to Nirvana
AgileOps is an IT operating model crafted for digital businesses by embracing the principles of Agile to develop agile ways of working. With AgileOps, Infrastructure delivery and operations teams specialize in the deployment, operations and ongoing support of digital services and applications, that are created in continuous development and continuous delivery environments.
We have witnessed that many enterprises have accelerated application development and life cycle management by adopting agile methodology and principles, but when it comes to infrastructure operations an organization has to go through a structural and cultural shift. And change will only occur when enterprises start adopting DevOps methodologies in which both Development and Operations groups have a unified objective of delivering better-quality, higher-performing applications at greater speeds while maximizing stability and uptime of Digital Platforms.
Let us now explore what are the critical steps to Achieve Agile Operations
The first and foremost thing required for implementing AgileOps is the focus on people and cultural change. In order to operate and maintain resilient digital infrastructure, people are as crucial as your infrastructure and processes. As highlighted by leading analysts – in the Hybrid world Infrastructure operations teams will now have to embrace Product-Team Thinking.
Where a team of product managers (say, UX designers, developers etc.) will focus on the user level features enhancements for an application or digital service.
It is the operations team who will have to support and deliver the underlying complex infrastructure. Which means, while there is still a variety of critical platforms requiring ongoing maintenance: networking, monitoring, storage, continuous delivery, and cloud operations, Such platforms will also be needed to be automated, have self-provisioned interfaces for delivering perfected user experience as per different customer expectations.
As a result, we now need teams/ squads with end-to-end accountability for the underlying platform they manage. These Squads, referred to as “platform operators” or Full stack engineers for the underlying platform, bridge the technical and business worlds. They must own the strategy and roadmap for their platform and oversee delivery of Infrastructure as a Platform to their internal customers - developers, application teams.
Focus on Continuous Process Improvement with collaboration –
Regarding Agile, the Agile manifesto has the concept of continuous improvement as one of its core principles. For the sake of understating in this blog I will restrict my discussion to incident management as a process.
Organizations that have been focused on resolving just the incidents, will now have to switch to a culture of continuous process improvement. Incident management, as we see it, is largely a functionally oriented structure. Specialized resource groups (such as Windows, Linux, DBAs and network admins) complete tasks in an order with many hand-offs between groups. This working style causes innumerable delays and multiple hops every time a request is passed from one resource group to other, leading to increasing wait times.
With AgileOps, enterprises can eliminate many of these delays by creating cross-functional teams. Such teams can minimize or even eliminate process hands-off by managing end to end incidents, followed by SRE led incident postmortems.
Incorporate SRE principles for achieving digital infrastructure Resiliency.
Site reliability engineer (SRE), coined by Google, has much prevalence today in all types of enterprises. More than a role, its more of a mindset that brings system resiliency. The core responsibility of SRE falls under the following broad categories:
- Focus on availability
- Eliminate waste by toil elimination
- Ensuring efficient incident response systems
And all this can be achieved by answering the following questions:
- Inspect and adopt - What happened to cause this incident and why?
- What necessary steps we took to resolve an incident?
- How can we auto -remediate the incident?
- What can be done to prevent the re-occurrence of similar incidents/ Failures in the future?
Once the incident has been resolved, based on the retrospective andthe nature of problems identified, SREs can help squad leaders to resolve the issues proactively. In short, instead of viewing SRE as a role, think of applying SRE as a mindset within the operations group.
Improving Agility with Lean-Agile Mindset – IT operations team have largely been infamous for a cumbersome ticketing culture. Ticketing systems create rigid sequenced processes and hamper team productivity. While many would argue that this first-come first-out approach ensures seamless workflow, it creates unnecessary barriers and checkpoints.
An agile approach like Kanban can help the IT operations team. With Kanban, we can achieve a high level of visibility by putting each task or request on a Kanban board, streamline the workflow and centralize the requests, get access to real-time status tracking. Not just visibility, we can protect operations team from chaos and countless interruptions caused by unplanned work. Teams can prioritize the tasks depending upon the criticality.
Similarly, imagine a scenario wherein a team is working on production issues and new feature development, causing them product backlogs and prioritization issues. In both areas, popular agile methodology frameworks like Scrum can be very helpful. Scrum, the highly adopted framework, utilizes short iterations of work, called sprints, and daily meetings, called stand-ups, to tackle discrete portions of a project in succession until the project is complete.
Teams can assess the high-priority production issues with this agile framework by answering questions like:
- Are they being adequately prioritized?
- Are they evenly distributed throughout the team, or at least properly planned to accommodate everyone?
- And will they impact the completion of planned new features?
Finally, create a common WoW- Define a common operating model detailing how teams should organize and interact with each other (Application and Operations Group). This will Bridge the gap between Dev teams and Ops (Infra) teams to bring DevOps culture, where Ops is the enabler of this mindset, mowing the way from Application operations to application service management.
The objective of common WOW is realized when these two groups work together to:
- Share responsibilities, metrics and goals
- Collaborate, distribute knowledge and learn from continuous feedback
- Trust in each other, the technology and the agile methodology
So, with Agile in Infra Operations what are the key Takeaways?
Hybrid Cloud adoption will succeed in the long term, but only if IT enterprises make fundamental changes in their culture and adopt agile ways of working.
Just to take an example - with hybrid cloud in place, more and more enterprises will adopt the cloud service brokerage model over time, and it becomes prudent for operations team to focus on service reliability.
In addition to “keeping the lights on”, they need to ensure that the services are available and operating at peak performance, they need to be able to understand what is happening with the service at any given moment – from an end-user perspective, application perspective, and an infrastructure perspective. This necessitates the requirement of a cross functional team.
A cross-functional team that collaborates and owns an outcome versus people who work in silos without regard to how that work interacts with others.
A team that can deliver something working at the end of a time frame, or the ability to measure throughput and cycle time reliably.
Second, shift to Agile Metrics to drive business value
In the digital world, technologies like Infra as a code, service catalog-based consumption, API led integration, are challenging the IT operations world. The value of IT operations is more than the sum of its current tasks under siloed functional towers, and traditional metrics / KPIs such as put/Output Operations Per Second (IOPS), bandwidth, response time, etc. alone won’t help.
They now need to add digital, agile metrics that focus on business value. Sample metrics for digital products and services are: Net promoter score, MTBF (Mean time between the failure), number of Transactions/Hour, Total Service Uptime, Unplanned Downtime, Time to restore/ resolve service / Cycle time, number of vulnerabilities Remediated/Period.
Finally, explore and invest in the Tools that enable Agile Operations – More than ever before, enterprises will have to Invest in intelligent hybrid cloud automation tools that allow them to proactively manage IT infrastructure performance across traditional and cloud environments.
They will be able to do this through a single, unified view and architecture and deliver service level insights. This will reduce the manual burden on IT operations staff, and constantly uncover new ways to optimize infrastructure for User experience analytics across multiple platforms.