Grafana Dashboards

Overview

Kelvin uses Grafana dashboards to view many types of log and telemetry analytics about the Kelvin Infrastructure.

There are four main data sources where you can retrieve information to build dashboards and alerts in Grafana;

To learn more about creating the query for the different data sources, check out their documentation for more details.

Data Source	Description
OpenSearch	Log data for Cluster/Node/Workload
Thanos	Telemetry (health) data for Cluster/Node/Workload
InfluxDB	Data on Data Streams, Control Changes and Recommendations
PostgreSQL	Information on Assets, Apps, Workloads, etc.

Grafana Alerts

One of the advantages of using Grafana is the ability to create alerts on specific telemetry or log data.

This helps you react faster to emerging situations or to compile records of certain events for analysis at a later stage.

In this document we only show you some commonly used examples of alerts that can help Developers and Platform Administrators keep themselves informed about the operations on Kelvin.

Thanos Alerts

Thanos holds all the telemetry information about the Clusters, Nodes and Workloads on a Kelvin Platform.

Click here to see a full list of data source options

Not all items listed yet. This is still under development. You can see a full list in Grafana itself.

Data Source Name	Description
node_cpu_seconds_total	Total amount of CPU time consumed by a node
node_memory_MemAvailable_bytes	Amount of memory that is available for starting new Kelvin SmartApps™, without swapping, on a node
node_filesystem_size_bytes	Total size of a filesystem on a node, measured in bytes
node_filesystem_avail_bytes	Amount of disk space available on a filesystem on a node, measured in bytes
kube_pod_container_status_restarts_total	Counts the total number of restarts for all containers within a specific pod in a Kubernetes cluster

Here we will go through four different Grafana alert rules that monitor different aspects of workloads and nodes.

Workload Restarts AlertNode CPU AlertNode Memory AlertNode Disk Alert

This will initiate an alert if any workload restarts.

To create a new alert rule, you can open grafana by clicking on the Monitoring menu item in Administration.

You will then see a list of all the dashboards available in Grafana.

In the top right corner, click on the plus icon in the top right corner and select create alert rule.

In section 1 add a suitable name.

In section 2 we add the actual query and conditions for the alert.

In section A, you can use the builder to create the query or you can change from builder to code and just type in the metrics browser.

You can copy and paste this query code into your query for the alert.

sum(kube_pod_container_status_restarts_total{namespace="app"}) by (pod)

In section 3 we then link it to a folder and an evaluation group. If you don't have one, then you can create a new one here.

In section 4 we then add a standard summary text that will be sent in the alert. The curly brackets are the variables that will be replaced with the actual text of the workload and number of restarts.

And finally in section 5 we configure notifications.

When finished, click Save rule and exit at the top right hand corner.

Your rule is now active.

This will initiate an alert if any nodes CPU usage exceeds a threshold.

To create a new alert rule, you can open grafana by clicking on the Monitoring menu item in Administration.

You will then see a list of all the dashboards available in Grafana.

In the top right corner, click on the plus icon in the top right corner and select create alert rule.

In section 1 add a suitable name.

In section 2 we add the actual query and conditions for the alert.

In section A, you can use the builder to create the query or you can change from builder to code and just type in the metrics browser.

You can copy and paste this query code into your query for the alert.

(
    (1 - sum without (mode) (rate(node_cpu_seconds_total{mode=~"idle|iowait|steal"}[$__rate_interval]))) / ignoring(cpu) group_left
    count without (cpu. mode) (node_cpu_seconds_total{mode="idle"})
)

In section 3 we then link it to a folder and an evaluation group. If you don't have one, then you can create a new one here.

In section 4 we then add a standard summary text that will be sent in the alert. The curly brackets are the variables that will be replaced with the actual text of the workload and number of restarts.

And finally in section 5 we configure notifications.

When finished, click Save rule and exit at the top right hand corner.

Your rule is now active.

This will initiate an alert if any nodes Memory usage exceeds a threshold.

To create a new alert rule, you can open grafana by clicking on the Monitoring menu item in Administration.

You will then see a list of all the dashboards available in Grafana.

In the top right corner, click on the plus icon in the top right corner and select create alert rule.

In section 1 add a suitable name.

In section 2 we add the actual query and conditions for the alert.

In section A, you can use the builder to create the query or you can change from builder to code and just type in the metrics browser.

You can copy and paste this query code into your query for the alert.

100 -
(
    avg(node_memory_MemAvailable_bytes{}) by (kubernetes_node) / avg(node_memory_MemTotal_bytes{}) by (kubernetes_node)
* 100
)

In section 3 we then link it to a folder and an evaluation group. If you don't have one, then you can create a new one here.

In section 4 we then add a standard summary text that will be sent in the alert. The curly brackets are the variables that will be replaced with the actual text of the workload and number of restarts.

And finally in section 5 we configure notifications.

When finished, click Save rule and exit at the top right hand corner.

Your rule is now active.

This will initiate an alert if any nodes Disk usage exceeds a threshold.

To create a new alert rule, you can open grafana by clicking on the Monitoring menu item in Administration.

You will then see a list of all the dashboards available in Grafana.

In the top right corner, click on the plus icon in the top right corner and select create alert rule.

In section 1 add a suitable name.

In section 2 we add the actual query and conditions for the alert.

In section A, you can use the builder to create the query or you can change from builder to code and just type in the metrics browser.

You can copy and paste these query codes into your query for the alert.

max by (mountpoint, kubernetes_node) (node_filesystem_size_bytes{cluster!="aws-eks-01", fstype!="", mountpoint !="", device+~(/dev/)?(mmcblk.p.+|nvme.+|ebd.+|sd.+|vd.+|xvd.+|dm-.+|md..+|dasd.+)"} )

and

max by (mountpoint, kubernetes_node) (node_filesystem_avail_bytes{cluster!="aws-eks-01", fstype!="", mountpoint !="", device+~(/dev/)?(mmcblk.p.+|nvme.+|ebd.+|sd.+|vd.+|xvd.+|dm-.+|md..+|dasd.+)"} )

In section 3 we then link it to a folder and an evaluation group. If you don't have one, then you can create a new one here.

In section 4 we then add a standard summary text that will be sent in the alert. The curly brackets are the variables that will be replaced with the actual text of the workload and number of restarts.

And finally in section 5 we configure notifications.

When finished, click Save rule and exit at the top right hand corner.

Your rule is now active.

OpenSearch Alerts

OpenSearch holds all the log information about the Clusters, Nodes and Workloads on a Kelvin Platform.

Examples coming soon.

InfluxDB Alerts

InfluxDB holds all the data for Data Streams, Control Changes and Recommendations on a Kelvin Platform.

Examples coming soon.

PostgreSQL Alerts

PostgreSQL holds all the information on Assets, Apps, Workloads, etc. on a Kelvin Platform.

Examples coming soon.