Question 1

What does SRE mean?

Accepted Answer

SRE stands for Site Reliability Engineering — the discipline (from Google) of applying software engineering to operations: automating toil, defining reliability targets (SLOs), and treating uptime as an engineering problem rather than a heroics problem.

Question 2

What is a site reliability engineer?

Accepted Answer

An engineer responsible for a service's reliability: setting SLOs and error budgets, building automation and monitoring, running incident response, and reducing manual operational work through code.

Question 3

What is the difference between SRE and DevOps?

Accepted Answer

DevOps is a culture of shared ownership between development and operations; SRE is a concrete implementation of it with specific practices — SLOs, error budgets, blameless postmortems, and a cap on manual toil.

Question 4

Do SRE practices apply to physical infrastructure?

Accepted Answer

Increasingly, yes. Error budgets and automation assume you can measure and act on the whole stack — which for on-prem estates means extending telemetry below the OS: hardware health, power, thermals, and out-of-band remediation.

What Is SRE (Site Reliability Engineering)?

The Four Habits of SRE

SLOs & error budgets

Automate the toil

Blameless postmortems

Ops as engineering

SRE Below the Operating System

Common Questions About SRE

What does SRE mean?

What is a site reliability engineer?

What is the difference between SRE and DevOps?

Do SRE practices apply to physical infrastructure?

Give your SREs the bottom half of the stack