Explore our latest thought leadership, ideas, and insights on the issues that are shaping the future of business and society.
Choose a partner with intimate knowledge of your industry and first-hand experience of defining its future.
Discover our portfolio – constantly evolving to keep pace with the ever-changing needs of our clients.
Become part of a diverse collective of free-thinkers, entrepreneurs and experts – and help us to make a difference.
See our latest news, and stories from across the business, and explore our archives.
We are a global leader in partnering with companies to transform and manage their business by harnessing the power of technology.
Our number one ranked think-tank
Explore our brands
Explore our technology partners
Explore careers with our brands
Problem statement – Due to the current state of how we monitor, alert, and log our digital ecosystem, it takes more time to detect, diagnose, and fix production incidents. Communication and status updates on production incidents are not reliable or timely. Business continuity and disaster recovery capabilities are even more important today as we are betting more on digital experiences.
SRE is causing disruption in quality engineering. In the past, the focus of quality engineering was on shift-left testing, especially requirements review, functional, and non-functional testing. With SRE, the focus is shifting towards shift-right, or production testing.
DevOps vs SRE
One obvious question concerns the crossover between SRE and DevOps, and rightly so. There is a significant overlap between them as they both tend to address the silos between “dev” and “ops.” Also, in terms of practices followed, there are numerous parallels. However, the approach and objectives are quite different in both cases.
SRE is focused on four delivery pillars:
Incident management: The main goal of this pillar is to reduce mean time to detect (MTTD) and mean time to resolve (MTTR) to desired numbers. Setting up a single pane of glass of telemetry and observability makes investigating and diagnosing problems easier. Defining the SLOs and SLIs at each service layer and customer experience web site level are the key milestones. Availability, latency, and system throughput are the KPI needs to be tracked.
Problem management: This pillar deals with root cause analysis and prevention and self-healing mechanisms in the digital ecosystem. SRE dashboards and data-driven insights provide information about the overall service health, which helps us to identify the service availability for given amount of time during production monitoring
Environment management: Business continuity and disaster recovery are the core areas of focus for this pillar. Security, compliance, data management, and DevOps governance will also be part of by this pillar. Chaos engineering is a disciplined approach to identify vulnerabilities in systems in the production environment. It is implemented to check the system’s reliability, stability, and ability to survive in unstable and unexpected conditions.
Outage communications: Curated communication of production incidents, software releases, scheduled and unscheduled maintenance. The goal of this pillar is to use all channels to clearly communicate on system health and incident resolution
SRE can be implemented in a phased manner. The below diagram depicts the three stage approach:
SRE Jumpstart consists of aspects such as:
For more information on site reliability engineering, please contact Genesis Robinson @ firstname.lastname@example.org
Reference – Keep the lights on – The SRE way