SRE: Definition & Origin
The need for digital transformation has shifted the focus of IT enterprises from on-premises-hosted applications to the SaaS model, from hardware to software-defined infrastructure (SDI), and from error-prone and inconsistent manual processes to reliable, repeatable, and consistent automated tasks.
Cloud adoption has changed the way we build and operate products (applications) and manage their workloads - be it Infrastructure as a service (IaaS) or Platform as a service (PaaS). Even if the enterprises have their own datacenters, they use it in the form of a private cloud and “Infrastructure as a code”.
Site reliability engineering (SRE) is a software engineering approach that involves the maintenance of programmable infrastructure and maximizes the availability of workloads running on that infrastructure. It is a discipline that integrates different components of software engineering and utilizes them to solve problems related to infrastructure and operations. The key goals of SRE are to create extremely reliable and highly scalable software systems.
The SRE paradigm was conceptualized at Google to help them run the Google.com site reliably and efficiently. SRE is something that happens when a software engineer is asked to design an operations function.
SRE Vs. DevOps: Differences
Although site reliability engineering and DevOps share the same foundational principles, there are subtle differences between the two. While DevOps engineers tend to focus on providing support to continuous delivery and developer velocity, site reliability engineering is a specific function responsible for reliability and automation throughout the software life cycle, with emphasis on successful deployment, monitoring releases, and keeping Software-defined infrastructure (SDI) buzzing. Site reliability engineering (SRE) evolved partially as a response to the division between product development and operations teams and preceded DevOps. Therefore, it is not uncommon to consider SRE as a function within DevOps.
Development teams are required to launch new features and functionalities in response to dynamic business requirements whereas the operations teams are averse to changes in services as most outages are caused by a change. This change could be of any kind, such as a new configuration, a new feature launch, or a new type of user traffic— leading to tension and conflicts of interest between these two teams.
This conflict is not inevitable and can be addressed through the SRE function. A few basic principles embraced by SRE to balance these conflicting interests are as follows:
- Operations is a software problem which can be resolved by using software engineering approaches.
- SRE is managed by service-level objectives. However, maintaining 100% availability is not SRE’s goal.
- Automation enablement is key. The purpose here is to avoid toil through automation of repetitive tasks and reduction of operations overload.
- Sharing ownership with developers is important. This provides a holistic view of the service and reduces boundaries.
Transform IT Services Management (ITSM) Processes with SRE
Roles and Responsibilities of Different Teams in IT Services Management Processes
Figure 1: Functions and Inter-Relationships and of a SaaS Delivery Model
Let us explore the responsibilities of each function in the model above, their focus areas, and their roles in IT service management processes:
DevOps Team- The DevOps team is responsible for development of software products, packaging them into deployable images, and maintaining the source code. DevOps engineers oversee primary functions like defining the application architecture, application design and development, maintenance of the code base, and proactive application management. The development team does not typically operate at an instance level.
SRE Team- The site reliability engineering team is responsible for individual instances of the software product, and are also responsible for the following at an instance level:
- Deployment architecture
- Availability designing
- Load balancing
- Capacity and performance management
- Operation automation through scripting
The SRE team is responsible for defining the deployment architecture for specific instances of the software product, high availability designing using solutions such as Kubernetes, and capacity and performance management by monitoring instances.
In addition, they are also responsible for the automation of instance maintenance activities through scripting and self-healing solutions. SRE is fundamentally doing the work that has historically been done by operations teams. However, SRE is using engineers with software expertise, and banking on the fact that these engineers can design and implement automation software. This team will only have access at an instance level and will not have access to the product’s source code.
Implementation Team- The on-boarding (implementation) team is responsible for the on-boarding of customers onto the software product with organization-specific configuration and foundation data. For instance, an out-of-the-box IT services management ticketing system will be configured with customer-specific support groups, service-level agreements (SLAs), and workflows.
This team typically extends operational support during the hyper-care period before handing over operational and administrative tasks to the operations team. The implementation team can only access customer-specific domain-separated instances and not the source code or other instances.
App Admin Team (Customer Ops)- The operations team is responsible for app administration activities and addressing issues related to configuration changes or customizations introduced within their specific domain-separated instance. This role will be performed by the system administrator who only has access to the customer-specific instance and not the source code or other instances.
Figure 2: Traditional ITSM Process Mapping for Cloud-Native Operations
There are a few differences between the traditional model and the SRE model. While there is always a conflict between development and operations in the traditional model, the SRE model acts as a bridge between development and operations and eliminates the conflict.
The below table highlights the differences between these two models:
Conflict between Development and Operations teams
SRE Acts as a Bridge and Eliminates the Conflict
Application Hosted On-Premises
Hosted on Cloud– SaaS Model
Infrastructure Maintenance by Customer
Managed by Cloud Service Provider
Relatively Low Automation
Highly Automated Ways of Working
Moving Forward with Renewed ITSM
With an increase in software-powered services, IT service teams within an organization need to prepare themselves to deliver value quickly as part of their digital transformation journey. Therefore, IT service management stands at the apex of organizations in this context.
Irrespective of the business size or scope, IT services management still ensures that core processes such as incidents, service requests, problems, changes, and IT assets are managed in a streamlined manner. However, they work in a way that is significantly different from what it was in the past. Now, automation and service intelligence are the factors driving the transformation of service management. Like ITSM, automation and artificial intelligence (AI) have also led to transformation in fields such as delivery, logistics, and supply chain management. Despite this, the fundamentals at the core remains the same in each of these businesses. Therefore, any meaningful progress can take place only if both fundamentals and advancements go hand in hand with each other.