Site reliability engineering (SRE) - A POV on how different it is from traditional IT operations | HCLTech

Site reliability engineering (SRE) - A POV on how different it is from traditional IT operations

Site reliability engineering (SRE) - A POV on how different it is from traditional IT operations
September 22, 2022

The evolution in IT has been massive. More and more applications are being transformed to cloud native or written on cloud native (cloud native first). Architectures are continuously evolving, and IT infrastructure and platforms have become modernized.

Therefore, the operating model for this evolved estate has also transformed. As a result, site reliability engineering (SRE) has become an integral part of this operating model. Site reliability engineering is the modern approach for running applications and services efficiently, and here software engineering principles are utilized to increase the reliability of the services. Key SRE practices include reducing toil by automating the environment, complete observability with well-defined, service-level objectives, service criticality understanding through upstream and downstream dependencies, codifying the resolution of operational issues, and increasing reliability. SREs are highly user-centric, and many activities are directly targeted toward improving the user experience of the services in scope. This should clarify how SRE is different from normal IT operations.

It is interesting how every other organization defines SRE, and you might have referred to several SRE definitions as a consequence. These different versions are not because it’s challenging to understand the basic concepts of SRE, but because most organizations are trying to define and adopt SRE while still being a part of the old organization structure with archaic operating models.”

Let me explain this with a couple of examples. The application operations team states that SRE is nothing but applications operations and even coined the term Apps SRE. The Windows support teams have created a new term, Windows SRE, and the list is endless. Since teams in organizations are still structured in traditional ways such as Windows, Linux, Backup, Storage, Database, AppOps, etc., new versions of SREs often evolve. The original objective of SREs is not to create more silos but rather to dissolve some of the silos and bring in more efficiencies.

Where do SREs contribute?

While SREs majorly operate in the support and operations phase, they play a vital role during the planning, design, development, and transition phases, with the overall objective of creating reliable and efficient applications and services.

Therefore, if workloads and services are to be modernized and transformed, Site reliability engineers (SREs) help make the services reliable and efficient. This will also require a change in the operating models, leading to Service provider organizations evolving their operating models. HCLTech has been an early starter with a transformed operating model so that the SREs serve the intended purpose in the overall value chain and deliver optimum results. HCLTech has an evolved operating model under the CARE(Cloud operations reliability engineering) offering, which provides us an edge over our competitors regarding maturity in SRE-related offerings, consulting, and implementation. Cloud Native Labs in HCLTech provides SRE consulting and SRE resource enablement services, which immensely benefit our customers and delivery groups.

Get HCL Technologies Insights and Updates delivered to your inbox