Monitoring Collecting, processing, aggregating, and displaying real-time quantitative data about a system, such as query counts and types, error counts and types, processing times, and server lifetimes. White-box monitoring Monitoring based on metrics exposed by the internals of the system, including logs, interfaces like the Java Virtual Machine Profiling Interface, or an HTTP handler that emits internal statistics.

Black-box monitoring Testing externally visible behavior as a user would see it. A dashboard may have filters, selectors, and so on, but is prebuilt to expose the metrics most important to its users. The dashboard might also display team information such as ticket queue length, a list of high-priority bugs, the current on-call engineer for a given area of responsibility, or recent pushes.

Alert A notification intended to be read by a human and that is pushed to a system such as a bug or ticket queue, an email alias, or a pager. Respectively, these alerts are classified as tickets, email alerts, 1 and pages.

A given incident might have multiple root causes: Each of these factors might stand alone as a root cause, and each should be repaired.

Node and machine Used interchangeably to indicate a single instance of a running kernel in either a physical server, virtual machine, or container. There might be multiple services worth monitoring on a single machine.

The services may either be: Related to each other: There are many reasons to monitor a system, including: Analyzing long-term trends How big is my database and how fast is it growing? How quickly is my daily-active user count growing? Comparing over time or experiment groups Are queries faster with Acme Bucket of Bytes 2.

How much better is my memcache hit rate with an extra node? Is my site slower than it was last week? Alerting Something is broken, and somebody needs to fix it right now! Or, something might break soon, so somebody should look soon.

Building dashboards Dashboards should answer basic questions about your service, and normally include some form of the four golden signals discussed later in this chapter. Conducting ad hoc retrospective analysis i.

System monitoring is also helpful in supplying raw input into business analytics and in facilitating analysis of security breaches. If an employee is at work, a page interrupts their workflow.

If the employee is at home, a page interrupts their personal time, and perhaps even their sleep. Outages can be prolonged because other noise interferes with a rapid diagnosis and fix. Effective alerting systems have good signal and very low noise. Setting Reasonable Expectations for Monitoring Monitoring a complex application is a significant engineering endeavor in and of itself.

Even with substantial existing infrastructure for instrumentation, collection, display, and alerting in place, a Google SRE team with 10—12 members typically has one or sometimes two members whose primary assignment is to build and maintain monitoring systems for their service.

We avoid "magic" systems that try to learn thresholds or automatically detect causality. Rules that detect unexpected changes in end-user request rates are one counterexample; while these rules are still kept as simple as possible, they give a very quick detection of a very simple, specific, severe anomaly.

Other uses of monitoring data such as capacity planning and traffic prediction can tolerate more fragility, and thus, more complexity. Google SRE has experienced only limited success with complex dependency hierarchies. We seldom use rules such as, "If I know the database is slow, alert for a slow database; otherwise, alert for the website being generally slow.

