milearning: Prometheus Monitoring

Saturday, 7 November 2020

Prometheus Monitoring - Part 3

Please check my previous posts as we have discussed in detail over scrape targets installations and configuration.

https://sunlnx.blogspot.com/2020/11/prometheus-monitoring-part-2.html

We would be discussing the below items on the Prometheus.

PromQL Queries
Saving/Recording Rules
Alerting Rules

PromQL Queries

The query language used in Prometheus is called PromQL (Prometheus Query Language). The data can either be viewed as a graph, as tabled data, or in external systems such as Grafana, Zabbix and others.

Simple examples

node_cpu_seconds_total{}

node_cpu_seconds_total{instance="promonode01.devtestlabs.in:80"}

Regular Expressions
list only nodes with specific domains.

node_cpu_seconds_total{instance=~".*.devtestlabs.*"}

node_cpu_seconds_total{instance=~".*.devtestlabs.*",mode=~".*irq*"}

Data Types

Scalar

A numeric floating point value

Instant vector

A set of time series containing a single sample for each time series.

Range Vector

A set of time series containing a range of data points over time for each time series.

node_cpu_seconds_total{instance=~".*.devtestlabs.*",mode=~".*irq*"}[1m]
node_cpu_seconds_total{instance=~".*.devtestlabs.*",mode=~".*irq*"}[5m]

Functions/Subfunctions

Start with this instant vector node_netstat_Tcp_InSegs{instance="localhost:9100"}
Convert it to a Range Vector and then convert it back to an instant vector using rate rate(node_netstat_Tcp_InSegs{instance="localhost:9100"}[1m])
Wrap it in the ceiling function ceil(rate(node_netstat_Tcp_InSegs{instance="localhost:9100"}[1m]))
Convert it to a range vector and get the per-second derivative of the time series deriv(ceil(rate(node_netstat_Tcp_InSegs{instance="localhost:9100"}[1m]))[1m:])

More Info: https://prometheus.io/docs/prometheus/latest/querying/functions/

Saving Rules
To run a query every time would be difficult, hence we can store the resultant query, Custom rules can be created and saved in the config file.

Examples

Memory available percentage: 100 - (100 * node_memory_MemFree_bytes / node_memory_MemTotal_bytes))
Root Disk Space: 100 * node_filesystem_free_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}

vim /etc/prometheus/prometheus_rules.yml
groups:
- name: custom_rules
rules:
- record: node_memory_free_percent
expr: 100 - (100 * node_memory_MemFree_bytes / node_memory_MemTotal_bytes)
- record: node_filesystem_free_percent
expr: 100 * node_filesystem_free_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}

check for syntax,
promtool check rules prometheus_rules.yml
Checking prometheus_rules.yml
SUCCESS: 2 rules found

Include these rules in the main prometheus configurations.
vim /etc/prometheus/prometheus.yml
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- "prometheus_rules.yml"

check for the syntax of the main config file,

promtool check config prometheus.yml
Checking prometheus.yml
SUCCESS: 1 rule files found

Checking prometheus_rules.yml
SUCCESS: 2 rules found

since all are fine, restart prometheus server,
systemctl restart prometheus

You would now be able to see the custom rules in the dashboard.

Alerts
We will create a new group named alert_rules and add in the same config file which created earlier rules.
vim /etc/promotheus/prometheus_rules.yml
- name: alert_rules
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute."

- alert: DiskSpaceFree10Percent
expr: node_filesystem_free_percent <= 10
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} has 10% or less Free disk space"
description: "{{ $labels.instance }} has only {{ $value }}% or less free."

check for the syntax of the config file and restart the service when there are SUCCESS message.

promtool check config prometheus.yml
Checking prometheus.yml
SUCCESS: 1 rule files found

Checking prometheus_rules.yml
SUCCESS: 3 rules found

sudo service prometheus restart
sudo service prometheus status

milearning

pages

Saturday, 7 November 2020

Prometheus Monitoring - Part 3

No comments:

Post a Comment

Total Pageviews