Kubernetes Cluster Monitoring through Prometheus: External Monitoring
Are you tired of constantly checking the health of your Kubernetes cluster and struggling to identify performance issues before they become major problems? If so, then it’s time to take your cluster monitoring to the next level.
Kubernetes cluster monitoring is essential for ensuring the optimal performance of your cluster and applications. By monitoring your cluster, you can detect and resolve issues before they cause significant problems and impact your users. And, with external monitoring, you can take your cluster monitoring to the next level.
What is External Monitoring?
External monitoring is a method of monitoring your Kubernetes cluster from the outside. This approach provides a complete picture of your cluster’s health and performance, including network traffic, resource usage, and more.
Why External Monitoring is Essential for Your Cluster
External monitoring provides a comprehensive view of your cluster, including the health and performance of your nodes, pods, and containers. With this information, you can quickly identify potential issues and resolve them before they cause significant problems.
Additionally, external monitoring helps you detect performance bottlenecks, optimize your cluster’s resource usage, and ensure that your applications are running smoothly.
How to Set Up External Monitoring for Your Cluster?
Setting up external monitoring for your Kubernetes cluster is simple and straightforward. Here are the steps you need to follow:
- Choose a monitoring tool: There are several tools available for external monitoring, including Prometheus, Grafana, and Datadog. Choose a tool that meets your needs and fits your budget.
- Install the monitoring tool: Follow the instructions provided by the monitoring tool to install it on your cluster.
- Set up the monitoring tool: Configure the monitoring tool to collect data from your cluster and display it in an easy-to-read format.
- Monitor your cluster: Start monitoring your cluster and check the data to ensure that everything is working as expected.
- Continuously monitor and improve: Regularly check the data collected by the monitoring tool and make changes as necessary to ensure that your cluster is running optimally.
Enabling Kubernetes Cluster Monitoring with Prometheus
Kubernetes cluster monitoring is a crucial aspect of ensuring the health and performance of your cluster and applications. With Prometheus, you can take your monitoring to the next level and gain a comprehensive view of your cluster’s performance.
Prometheus is an open-source monitoring solution that is designed to collect and store data from your cluster, providing valuable insights into the health and performance of your nodes, pods, and containers. With its user-friendly interface and powerful features, Prometheus makes it easy to monitor your cluster and identify potential issues before they become major problems.
Enabling cluster monitoring with Prometheus is simple and straightforward. Follow these steps to get started:
- Create a Kubernetes token for Prometheus.
vim token.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: external-prometheus-monitoring
rules:
- apiGroups:
- ""
resources:
- nodes
- nodes/proxy
- nodes/stats
- nodes/metrics
- services
- endpoints
- pods
- namespaces
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: external-prometheus-monitoring
subjects:
- kind: ServiceAccount
name: external-prometheus-monitoring
namespace: kube-system
roleRef:
kind: ClusterRole
name: external-prometheus-monitoring
apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: external-prometheus-monitoring
namespace: kube-system
---
kubectl create -f token.yaml
# After successfully created. Please run below command.
TOKEN_NAME=$(kubectl -n kube-system get serviceaccount external-prometheus-monitoring -o=jsonpath='{.secrets[0].name}')
TOKEN_VALUE=$(kubectl -n kube-system get secret/${TOKEN_NAME} -o=go-template='{{.data.token}}' | base64 --decode)
echo $TOKEN_VALUE
- Install Exporters: like kube-state-metrics, node-exporter, etc.
- Install the kube-state-metrics agent.
# kube-state-metrics install through helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install [RELEASE_NAME] prometheus-community/kube-state-metrics [flags]
# Example
helm install test prometheus-community/kube-state-metrics -n monitoring
# Check Pods
kubectl get po
############### Output ####################
NAME READY STATUS RESTARTS AGE
test-kube-state-metrics-gnfytcuy-jhgjbj 1/1 Running 0 25s
- Install the node-exporter
# node-exporter install through helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install [RELEASE_NAME] prometheus-community/prometheus-node-exporter [flags]
# Example
helm install test-node-exporter prometheus-community/prometheus-node-exporter -n monitoring
# Check Pods
kubectl get po
############### Output ####################
NAME READY STATUS RESTARTS AGE
test-node-exporter-prometheus-node-exporter-nftz6 1/1 Running 0 10s
test-node-exporter-prometheus-node-exporter-tbx5t 1/1 Running 0 10s
- Add the below config file in prometheus.yml.
- Enable cadvisor discovery
- job_name: kubernetes-cadvisor
scrape_interval: '15s'
scrape_timeout: '10s'
metrics_path: /metrics/cadvisor
scheme: https
kubernetes_sd_configs:
- api_server: cluster-endpoint
role: node
bearer_token: "cluster-token"
tls_config:
insecure_skip_verify: true
tls_config:
insecure_skip_verify: true
bearer_token: "cluster-token"
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- source_labels: [__address__]
target_label: Role
replacement: "eks"
- source_labels: [__address__]
target_label: account
replacement: "account-name"
action: replace
- source_labels: [__address__]
target_label: cluster
replacement: "cluster-name"
action: replace
- Enable Kubernetes svc (service) discovery
Note: Make sure prometheus.io/scrape: “true” should in-applications service annotations. An example mentioned is below for reference.
annotations:
prometheus.io/scrape: "true"
- job_name: 'kubernetes-service-endpoints'
sample_limit: 300000
kubernetes_sd_configs:
- api_server: cluster-endpoint
role: endpoints
bearer_token: "cluster-token"
tls_config:
insecure_skip_verify: true
tls_config:
insecure_skip_verify: true
bearer_token: "cluster-token"
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)(?::\d+);(\d+)
replacement: $1:$2
- source_labels: [__address__]
target_label: techteam
replacement: ""
action: replace
- source_labels: [__address__]
target_label: account
replacement: "account-name"
action: replace
- source_labels: [__address__]
target_label: environment
replacement: "prod"
action: replace
- source_labels: [__address__]
target_label: cluster
replacement: "cluster-name"
action: replace
Note: Please find and replace these values accordingly.
- cluster-endpoint with cluster url
- cluster-token with the cluster token which you created earlier from the above steps.
- cluster-name with your cluster name
etc ….
- Create Prometheus Rules for cluster monitoring.
groups:
- name: kubernetes alerts
rules:
- alert: pod_cpu_more_than_80
expr: avg by(namespace, pod, cluster) (rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[1m]) * 100) / avg by(namespace, pod, cluster) (kube_pod_container_resource_requests_cpu_cores{container!="POD",container!=""}) > 80
for: 5m
labels:
severity: critical
annotations:
description: 'cpu is high for {{ $labels.pod }} in namespace: {{ $labels.namespace }} value is {{ humanize $value }}%'
summary: 'cpu is high for {{ $labels.pod }} in Cluster: {{ $labels.cluster }}, namespace: {{ $labels.namespace }}'
- alert: pod_cpu_more_than_70
expr: avg by(namespace, pod, cluster) (rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[1m]) * 100) / avg by(namespace, pod, cluster) (kube_pod_container_resource_requests_cpu_cores{container!="POD",container!=""}) > 70 < 80
for: 5m
labels:
severity: warning
annotations:
description: 'cpu is high for {{ $labels.pod }} in namespace: {{ $labels.namespace }} value is {{ humanize $value }}%'
summary: 'cpu is high for {{ $labels.pod }} in Cluster: {{ $labels.cluster }}, namespace: {{ $labels.namespace }}'
- alert: pod_memory_more_than_90
expr: (((sum (container_memory_working_set_bytes{image!="",name=~"^k8s_.*",container!="POD" }) by (pod, cluster, namespace)) / (sum(kube_pod_container_resource_requests{resource="memory"}) by (pod, cluster, namespace)))) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
description: 'memory is high for {{ $labels.pod }} in namespace: {{ $labels.namespace }} value is {{ humanize $value }}%'
summary: 'memory is high for {{ $labels.pod }} in Cluster: {{ $labels.cluster }}, namespace: {{ $labels.namespace }}'
- alert: pod_memory_more_than_80
expr: (((sum (container_memory_working_set_bytes{image!="",name=~"^k8s_.*",container!="POD" }) by (pod, cluster, namespace)) / (sum(kube_pod_container_resource_requests{resource="memory"}) by (pod, cluster, namespace)))) * 100 > 80 < 90
for: 5m
labels:
severity: warning
annotations:
description: 'memory is high for {{ $labels.pod }} in namespace: {{ $labels.namespace }} value is {{ humanize $value }}%'
summary: 'memory is high for {{ $labels.pod }} in Cluster: {{ $labels.cluster }}, namespace: {{ $labels.namespace }}'
- alert: pod_status_change_alert
expr: min_over_time((sum by(pod, namespace, phase, cluster) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed|CrashLoopBackOff|Error"})[5m:1m])) > 0
for: 1m
labels:
severity: critical
annotations:
description: 'pod_name: {{ $labels.pod }}, namespace: {{ $labels.namespace }} phase: {{ $labels.phase }}'
summary: 'pod_name: {{ $labels.pod }}, Cluster: {{ $labels.cluster }}, namespace: {{ $labels.namespace }} phase: {{ $labels.phase }}'
- alert: KubePodCrashLoopingReason
expr: max_over_time(kube_pod_container_status_waiting_reason{reason=~"ErrImagePull|ImagePullBackOff|InvalidImageName|CreateContainerError|CreateContainerConfigError|CrashLoopBackOff|ContainerCreating"}[2m]) >= 1
for: 30s
labels:
severity: critical
annotations:
description: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is in waiting state (reason: "CrashLoopBackOff").'
summary: 'Cluster: {{ $labels.cluster }}, Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is in waiting state (reason: "CrashLoopBackOff").'
- Validate your Prometheus configuration file
./promtool check config prometheus
- Reload Prometheus
systemctl reload prometheus
# or
kill -HUP $(pgrep prometheus)
By following these steps, you can start monitoring your cluster. Check the data to ensure that everything is running optimally.
By enabling cluster monitoring with Prometheus, you can gain valuable insights into your cluster’s performance and ensure that it is running smoothly. Start taking control of your cluster’s health and performance today!
Final Thoughts
External monitoring is powerful for ensuring the optimal performance of your Kubernetes cluster. By monitoring your cluster from the outside, you can quickly identify and resolve issues, optimize resource usage, and ensure that your applications are running smoothly. So, why wait? Start monitoring your cluster today and unlock its full potential!
Happy monitoring!