In today’s digital age, monitoring and observability are critical components of any software or application development process. Effective monitoring and observability can help developers identify and resolve issues quickly, improve performance, and optimize resource utilization. However, achieving these goals requires careful planning, implementation, and ongoing maintenance.
According to a survey by AppDynamics, 84% of organizations have experienced a failure in their applications in the last year, and the average cost of downtime is $5,600 per minute. In addition, a study by Gartner found that by 2023, 75% of large enterprises will have adopted a multi-cloud or hybrid IT strategy, increasing the complexity of application and infrastructure monitoring. These stats highlight the importance of effective monitoring and observability to prevent downtime and ensure optimal performance in today’s digital age.
In this blog, we will discuss the best practices for effective monitoring and observability.
1. Define your objectives and metrics:
To define your objectives and metrics, you need to understand what’s important for your application and business. For example, if you’re running an e-commerce website, you may want to track metrics such as the number of orders, revenue, and conversion rate. You can use tools like Google Analytics, Mixpanel, or Amplitude to track these metrics.
Example:
//Google Analytics code to track pageviews and events
<script async src="https://www.googletagmanager.com/gtag/js?id=GA_TRACKING_ID"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'GA_TRACKING_ID');
</script>
2. Use the right tools
There are many monitoring and observability tools available, and choosing the right one depends on your requirements. For example, if you’re running a Kubernetes cluster, you may want to use tools like Prometheus, Grafana, and Fluentd to monitor your infrastructure and applications.
Example:
//Prometheus code to monitor Kubernetes cluster
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: example-app
labels:
app: example-app
spec:
selector:
matchLabels:
app: example-app
endpoints:
- port: web
path: /metrics
interval: 15s
3. Monitor everything:
To monitor everything, you can use tools like Nagios or Zabbix, which can monitor your infrastructure, network, and applications.
Example:
//Nagios code to monitor network devices
define host {
use generic-switch
host_name switch1
address 192.168.1.1
}
define service {
use generic-service
host_name switch1
service_description Ping
check_command check_ping!100.0,20%!500.0,60%
}
define service {
use generic-service
host_name switch1
service_description SNMP Uptime
check_command check_snmp!-C public -o sysUpTime.0 -r 5 -m RFC1213-MIB
}
4. Automate as much as possible:
To automate monitoring tasks, you can use tools like Puppet, Ansible, or Chef, which can automate the deployment and configuration of monitoring tools.
Example:
//Puppet code to deploy and configure Prometheus
class { 'prometheus':
version => '2.30.2',
}
prometheus::rule { 'disk_space':
record => 'disk_space_available',
expr => 'node_filesystem_avail_bytes / node_filesystem_size_bytes',
alert => 'warning',
}
prometheus::alert { 'disk_space':
expr => 'disk_space_available < 0.2',
for => '1h',
labels => { severity => 'critical' },
annotations => { summary => 'Disk space is running low' },
}
5. Monitor in real-time
To monitor in real-time, you can use tools like Datadog or New Relic, which can provide real-time insights into your applications and infrastructure.
Example:
//Datadog code to monitor real-time container metrics
apiVersion: v1
kind: ConfigMap
metadata:
name: datadog-agent
data:
datadog.yaml: |-
logs_enabled:
- type: docker
image: gcr.io/datadoghq/agent:latest
env:
- name: DD_API_KEY
value: YOUR_API_KEY_HERE
- name: DD_DOGSTATSD_ORIGIN_DETECTION
value: "true"
- name: DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL
value: "true"
- name: DD_LOGS_CONFIG_LOGS_DD_SERVICE
value: "datadog-agent"
- name: DD_APM_ENABLED
value: "true"
- name: DD_APM_NON_LOCAL_TRAFFIC
value: "true"
- name: DD_PROCESS_AGENT_ENABLED
value: "true"
- name: DD_CONTAINER_EXCLUDE
value: "name:dd-agent, name:kube-proxy, name:istio-proxy"
- name: DD_AC_INCLUDE
value: "name:nginx, name:redis"
- name: DD_KUBERNETES_COLLECT_EVENTS
value: "true"
- name: DD_KUBERNETES_KUBELET_TLS_VERIFY
value: "false"
volumeMounts:
- name: dockersock
mountPath: /var/run/docker.sock
- name: procdir
mountPath: /host/proc
readOnly: true
- name: cgroups
mountPath: /host/sys/fs/cgroup
readOnly: true
volumes:
- name: dockersock
hostPath:
path: /var/run/docker.sock
- name: procdir
hostPath:
path: /proc
- name: cgroups
hostPath:
path: /sys/fs/cgroup
6. Ensure scalability
As your applications and infrastructure grow, so too will the amount of data you need to monitor. Ensure that your monitoring and observability tools can scale to meet your needs. This includes ensuring that your infrastructure can support the data collection and analysis and that your tools can handle the increased workload.
7. Monitor user behavior
Monitoring user behavior is critical to understanding how your applications are being used and identifying issues before they become problems. Use tools that can track user behavior and identify patterns that may indicate issues with your application.
8. Collaborate
Effective monitoring and observability require collaboration between developers, operations teams, and other stakeholders. Make sure that all stakeholders have access to the data and insights they need to make informed decisions and work together to resolve issues quickly.
9. Review and Analyze data
Collecting data is only the first step. To get the most out of your monitoring and observability efforts, you need to review and analyze the data regularly. Use tools that can help you visualize and analyze the data, identify trends and patterns, and provide insights into performance and user behavior.
There are various tools available to help you visualize and analyze data, but one of the most popular tools is Grafana. Grafana is a free and open-source platform for data visualization, monitoring, and analysis.
To get started with Grafana, you need to first install it and configure it to connect to your data sources. Once you have done that, you can create dashboards that display your data in various formats, such as graphs, tables, and heatmaps.
Here’s an example of how to create a simple Grafana dashboard to visualize system metrics:
-
First, install and configure Grafana to connect to your data sources. You can follow the instructions on the Grafana website to do this.
-
Once you have installed Grafana and configured your data sources, log in to the Grafana web interface and create a new dashboard.
-
In the dashboard, add a new panel and select the type of visualization you want to use. For example, you can use a graph to visualize CPU usage over time.
-
Select the data source you want to use for the panel. For example, you can select your server monitoring tool as the data source.
-
Choose the metric you want to visualize. For example, you can choose the CPU usage metric.
-
Configure the panel settings to customize the visualization. For example, you can set the time range, add annotations, and adjust the graph style.
-
Save the panel and add more panels to the dashboard as needed.
Here’s an example of the code for a simple Grafana dashboard that displays CPU usage:
{
"title": "Server Metrics",
"panels": [
{
"title": "CPU Usage",
"type": "graph",
"targets": [
{
"query": "cpu.usage",
"data source": "server-monitoring-tool"
}
],
"time": {
"from": "now-1h",
"to": "now"
},
"annotations": {
"list": [
{
"value": "Server rebooted",
"time": "2023-03-20T13:30:00Z",
"title": "Reboot"
}
]
}
}
],
"
10. Continuously Improve
Effective monitoring and observability are ongoing processes that require continuous improvement. Regularly review your monitoring and observability practices, and look for ways to optimize your processes, tools, and data collection.
To continuously improve, you can use tools like Grafana or Kibana to visualize your data and identify trends and patterns. You can also conduct post-incident reviews to identify areas for improvement.
Example:
//Grafana code to visualize application metrics
{
"alias": "$tag_env - $tag_service",
"bars": false,
"datasource": "prometheus",
"fill": 1,
"id": 1,
"legend"
11. Set up alerts and notifications:
To set up alerts and notifications, you can use tools like PagerDuty, OpsGenie, or VictorOps, which can send notifications via email, SMS, or chat.
Example:
//PagerDuty code to set up an alert for high CPU usage
{
"routing_key": "YOUR_ROUTING_KEY",
"event_action": "trigger",
"payload": {
"summary": "High CPU usage on server1",
"source": "server1",
"severity": "critical",
"custom_details": {
"cpu_usage": "95%"
}
}
}
12. Correlate data from different sources:
To correlate data from different sources, you can use tools like Splunk or ELK (Elasticsearch, Logstash, Kibana), which can aggregate and correlate data from different sources.
Example:
//ELK code to correlate data from different sources
input {
beats {
port => 5044
}
}
filter {
if [fields][type] == "nginx-access" {
grok {
match => { "message" => "%{COMBINEDAPACHELOG}" }
}
date {
match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
}
geoip {
source => "clientip"
}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "%{[@metadata][beat]}-%{[@metadata][version]}-%{+YYYY.MM.dd}"
}
}
Effective monitoring and observability are critical for preventing downtime, optimizing performance, and ensuring the success of your business. By following best practices such as defining your objectives and metrics, using the right tools, monitoring everything, automating as much as possible, monitoring in real-time, and correlating data from different sources, you can gain real-time insights into your applications and infrastructure, and take proactive measures to ensure optimal performance and prevent failures.