Resource · Glossary

    What Is SRE (Site Reliability Engineering)?

    SRE is what happens when you treat operations as a software problem. Born at Google, Site Reliability Engineering replaces heroic firefighting with engineering: measurable reliability targets, automation for everything repetitive, and the radical idea that some amount of failure is a budget to spend, not a sin to hide.

    Core Practices

    The Four Habits of SRE

    SLOs & error budgets

    Define how reliable is reliable enough; spend the remainder on shipping.

    Automate the toil

    Anything done twice by hand becomes a candidate for code.

    Blameless postmortems

    Incidents produce fixes and learning, not scapegoats.

    Ops as engineering

    Runbooks, monitoring, and remediation built and versioned like software.

    The On-Prem Gap

    SRE Below the Operating System

    SRE grew up in environments where hardware was someone else's problem. On-prem and data center teams adopting SRE hit the gap immediately: your error budget doesn't care whether the burn came from a bad deploy or a dying power supply — but your tooling usually only sees the first. Applying SRE to physical estates means extending the same discipline downward: component telemetry as SLIs, hardware alarms in the same correlation engine as application alerts, automated inspection replacing walkthroughs, and remediation through out-of-band control that works when SSH doesn't. That lower half of the stack is exactly the layer Sensaka instruments.

    Hardware health as first-class SLIs
    One correlation engine, all layers
    Automated inspection, zero toil
    OOB remediation for dead-OS states
    Error budgets that see the whole stack
    FAQ

    Common Questions About SRE

    What does SRE mean?

    SRE stands for Site Reliability Engineering — the discipline (from Google) of applying software engineering to operations: automating toil, defining reliability targets (SLOs), and treating uptime as an engineering problem rather than a heroics problem.

    What is a site reliability engineer?

    An engineer responsible for a service's reliability: setting SLOs and error budgets, building automation and monitoring, running incident response, and reducing manual operational work through code.

    What is the difference between SRE and DevOps?

    DevOps is a culture of shared ownership between development and operations; SRE is a concrete implementation of it with specific practices — SLOs, error budgets, blameless postmortems, and a cap on manual toil.

    Do SRE practices apply to physical infrastructure?

    Increasingly, yes. Error budgets and automation assume you can measure and act on the whole stack — which for on-prem estates means extending telemetry below the OS: hardware health, power, thermals, and out-of-band remediation.

    Give your SREs the bottom half of the stack