Monitoring is the act of observing a system’s performance over time. Monitoring tools collect and analyze system data and translate it into actionable insights. Fundamentally, monitoring technologies, such as application performance monitoring (APM), can tell you if a system is up or down or if there is a problem with application performance. Monitoring data aggregation and correlation can also help you to make larger inferences about the system. Load time, for example, can tell developers something about the user experience of a website or an app.
Vertical Relevance highly recommends that the following foundational best practices be implemented when creating a monitoring solution.
Measuring Environment Data
Using the four golden signals as a guideline, you can begin to look at how those metrics would be expressed throughout the hierarchy of your systems. Since services are often built by adding layers of abstraction on top of more basic components, metrics should be designed to add insight at each level of the deployment.
We will look at different levels of complexity present in common distributed application environments:
- Individual server components
- Applications and services
- Collections of servers
- Environmental dependencies
- End-to-end experience
The ordering above expands the scope and level of abstraction with each subsequent layer.
Individual Server Components
Towards the bottom of the hierarchy of primitive metrics are host-based indicators. These would be anything involved in evaluating the health or performance of an individual machine, disregarding for the moment its application stacks and services. These can give you a sense of factors that may impact a single computer’s ability to remain stable or perform work. These include:
- Disk space
To measure CPU, the following measurements might be appropriate:
- Latency: Average or maximum delay in CPU scheduler
- Traffic: CPU utilization
- Errors: Processor specific error events, faulted CPUs
- Saturation: Run queue length
For memory, the signals might look like this:
- Latency: (none – difficult to find a good method of measuring and not actionable)
- Traffic: Amount of memory being used
- Errors: Out of memory errors
- Saturation: OOM killer events, swap usage
For storage devices:
- Latency: average wait time (await) for reads and writes
- Traffic: read and write I/O levels
- Errors: filesystem errors, disk errors in /sys/devices
- Saturation: I/O queue depth
Network and Connectivity Metrics
Monitoring your networking layer can help you improve the availability and responsiveness of both your internal and external services. For most types of infrastructure, network and connectivity indicators will be another dataset worth exploring. These are important gauges of outward-facing availability, but are also essential in ensuring that services are accessible to other machines for any systems that span more than one machine. Like the other metrics we’ve discussed so far, networks should be checked for their overall functional correctness and their ability to deliver necessary performance by looking at:
The networking signals can look like this:
- Latency: Network driver queue
- Traffic: Incoming and outgoing bytes or packets per second
- Errors: Network device errors, dropped packets
- Saturation: overruns, dropped packets, retransmitted segments, bandwidth utilizations
Operating System Metrics
Along with representations of physical resources, it is also a good idea to gather metrics related to operating system abstractions that have limits enforced. Some examples that fall into this category are file handles and thread counts. These are not physical resources, but instead constructs with ceilings set by the operating system to prevent processes from overextending themselves. Most can be adjusted and configured with commands like ulimit, but tracking changes in usage of these resources can help you detect potentially harmful changes in your software’s usage.
Metrics to Collect for Applications and Services
Moving up a layer, we start to deal with the applications and services that run on the servers. These programs use the individual server components we dealt with earlier as resources to do work. Metrics at this level help us understand the health of our single-host applications and services. We’ve separated distributed, multi-host services into a separate section to clarify the factors most important in those configurations.
These are metrics concerned with units of processing or work that depend on the host-level resources, like services or applications. The specific types of metrics to look at depends on what the service is providing, what dependencies it has, and what other components it interacts with. Metrics at this level are indicators of the health, performance, or load of an application.
While the metrics in the last section detailed the capabilities and performance of individual components and the operating system, the metrics here will tell us how well applications are able to perform the work we ask of them. We also want to know what resources our applications depend on and how well they manage those constraints.
It is important to keep in mind that the metrics in this section represent a departure from the generalized approach we were able to use last time. The metrics that are most important from this point on will be very dependent on your applications’ characteristics, your configuration, and the workloads that you are running on your machines. We can discuss ways of identifying your most important metrics, but your results will depend on what the server is specifically being asked to do.
For applications that serve clients, the four golden signals are often fairly straightforward to pick out:
- Latency: The time to complete requests
- Traffic: Number of requests per second served
- Errors: Application errors that occur when processing client requests or accessing resources, Service failures and restarts
- Saturation: The percentage or amount of resources currently being used
These indicators help determine whether an application is functioning correctly and with efficiency.
The four golden signals were designed primarily for distributed microservices, so they assume a client-server architecture. For applications that do not use a client-server architecture, the same signals are still important, but the “traffic” signal might need to be reconsidered slightly. This is basically a measurement of busyness, so finding a metric that adequately represents that for your application will serve the same purpose. The specifics will depend on what your program is doing, but some general substitutes might be the number of operations or data processed per second.
Metrics to Measure Collections of Servers and Their Communication
When dealing with horizontally scaled infrastructure, another layer of infrastructure you will need to add metrics for is pools of servers. Most services, especially when operated in a production environment, will span multiple server instances to increase performance and availability. This increased level of complexity adds additional surface area that is important to monitor. Distributed computing and redundant systems can make your systems more flexible, but network-based coordination is more fragile than communication within a single host. Robust monitoring can help alleviate some of the difficulties of dealing with a less reliable communication channel.
Collecting data that summarizes the health of collections of servers is important for understanding the actual capabilities of your system to handle load and respond to changes.
While metrics about individual servers are useful, at scale a service is better represented as the ability of a collection of machines to perform work and respond adequately to requests. This type of metric is in many ways just a higher level extrapolation of application and server metrics, but the resources in this case are homogeneous servers instead of machine-level components. Some data you might want to track are:
Beyond the network itself, for distributed services, the health and performance of the server group is more important than the same measures applied to any individual host. While services are intimately tied to the computer they run on when confined to a single host, redundant multi-host services rely on the resources of multiple hosts while remaining decoupled from direct dependency on any one computer.
The golden signals at this level look very similar to those measuring service health in the last section. They will, however, take into account the additional coordination required between group members:
- Latency: Time for the pool to respond to requests, time to coordinate or synchronize with peers
- Traffic: Number of requests processed by the pool per second, Scaling adjustment indicators
- Errors: Application errors that occur when processing client requests, accessing resources, or reaching peers
- Saturation: The amount of resources currently being used, the number of servers currently operating at capacity, the number of servers available, degraded instances
While these have a definite resemblance to the important metrics for single-host services, each of the signals grows in complexity when distributed. Latency becomes a more complicated issue as processing can require communication between multiple hosts. Traffic is no longer a function of a single server’s abilities, but is instead a summary of the groups capabilities and the efficiency of the routing algorithm used to distribute work. Additional error modes are introduced related to networking connectivity or host failure. Finally, saturation expands to include the combined resources available to the hosts, the networking link connecting each host, and the ability to properly coordinate access to the dependencies each computer needs.
Metrics Related to External Dependencies and the Deployment Environment
Some of the more important metrics you’ll want to keep track of are those related to dependencies. These exist at the boundary of your application or service, outside of your direct control – external dependencies including those related to your hosting provider and any services your applications are built to rely on. These represent resources you are not able to administer directly, but which can compromise your ability to guarantee your own service.
Because external dependencies represent critical resources, one of the only mitigation strategies available in case of full outages is to switch operations to a different provider. This is only a viable strategy for commodity services, and even then only with prior planning and loose coupling with the provider. Even when mitigation is difficult, knowledge of external events affecting your application is incredibly valuable. The golden signals applied to external dependencies may look similar to this:
- Latency: Time it takes to receive a response from the service or to provision new resources from a provider
- Traffic: Amount of work being pushed to an external service, the number of requests being made to an external API
- Errors: Error rates for service requests
- Saturation: Amount of account-restricted resources used (instances, API requests, acceptable cost, etc.)
- Service status and availability
- Success and error rates
- Run rate and operational costs
- Resource exhaustion
These metrics can help you identify problems with your dependencies, alert you to impending resource exhaustion, and help keep expenses under control. If the service has drop-in alternatives, this data can be used to decide whether to move work to a different provider when metrics indicate a problem is occurring. For situations with less flexibility, the metrics can at least be used to alert an operator to respond to the situation and implement any available manual mitigation options.
Metrics that Track Overall Functionality and End-to-End Experience
The highest level metrics track requests through the system in context of the outermost component that users interact with. This might be a load balancer or other routing mechanism that is responsible for receiving and coordinating requests to your service. Since this represents the first touch point with your system, collecting metrics at this level gives an approximation of the overall user experience.
While the previously described metrics are incredibly useful, the metrics in this section are often the most important to set up alerting for. To avoid response fatigue, alerts—especially pages—should be reserved for scenarios that have a recognizable negative effect on user experience. Problems surfaced with these metrics can be investigated by drilling down using the metrics collected at other levels.
The signals we look for here are similar to those of the individual services we described earlier. The primary difference is the scope and the importance of the data we gather here:
- Latency: The time to complete user requests
- Traffic: Number of user requests per second
- Errors: Errors that occur when processing client requests or accessing resources
- Saturation: The percentage or amount of resources currently being used
As these metrics parallel user requests, values that fall outside of acceptable ranges for these metrics likely indicate direct user impact. Latency that does not conform to customer-facing or internal SLAs (service level agreements), traffic that indicates a severe spike or drop off, increases in errors rates, and an inability to serve requests due to resource constraints are all fairly straightforward to reason about at this level. Assuming that the metrics are accurate, the values here can be directly mapped against your availability, performance, and reliability goals.
There are many other types of metrics that can be helpful to collect. Conceptualizing the most important information at varying levels of focus can help you identify indicators that are most useful for predicting or identifying problems. Keep in mind that the most valuable metrics on higher levels are likely to be resources provided by lower layers.
Factors That Affect What You Choose to Monitor
For peace of mind, in an ideal world you would track everything related to your systems from the beginning in case an item may one day be relevant to you. However, there are many reasons why this might not be possible or even desirable.
A few factors that can affect what you choose to collect and act on are:
- Resources available for tracking: Depending on your human resources, infrastructure, and budget, you will have to limit the scope of what you keep track of to what you can afford to implement and reasonably manage.
- The complexity and purpose of your application: The complexity of your application or systems can have a large impact on what you choose to track. Items that might be mission critical for some software might not be important at all in others.
- The deployment environment: While robust monitoring is most important for production systems, staging and testing systems also benefit from monitoring, though there may be differences in severity, granularity, and the overall metrics measured.
- The likelihood of the metric being useful: One of the most important factors affecting whether something is measured is its potential to help in the future. Each additional metric tracked increases the complexity of the system and takes up resources. The necessity of data can change over time as well, requiring reevaluation at regular intervals.
- How essential stability is: Simply put, stability and uptime might not be priorities for certain types of personal or early stage projects.
The factors that influence your decisions will depend on your available resources, the maturity of your project, and the level of service you require.