Kubernetes Cluster Monitoring through Prometheus: External Monitoring

6 min readJan 29, 2023

Are you tired of constantly checking the health of your Kubernetes cluster and struggling to identify performance issues before they become major problems? If so, then it’s time to take your cluster monitoring to the next level.

Kubernetes cluster monitoring is essential for ensuring the optimal performance of your cluster and applications. By monitoring your cluster, you can detect and resolve issues before they cause significant problems and impact your users. And, with external monitoring, you can take your cluster monitoring to the next level.

What is External Monitoring?

External monitoring is a method of monitoring your Kubernetes cluster from the outside. This approach provides a complete picture of your cluster’s health and performance, including network traffic, resource usage, and more.

Why External Monitoring is Essential for Your Cluster

External monitoring provides a comprehensive view of your cluster, including the health and performance of your nodes, pods, and containers. With this information, you can quickly identify potential issues and resolve them before they cause significant problems.

Additionally, external monitoring helps you detect performance bottlenecks, optimize your cluster’s resource usage, and ensure that your applications are running smoothly.

How to Set Up External Monitoring for Your Cluster?

Setting up external monitoring for your Kubernetes cluster is simple and straightforward. Here are the steps you need to follow:

Choose a monitoring tool: There are several tools available for external monitoring, including Prometheus, Grafana, and Datadog. Choose a tool that meets your needs and fits your budget.
Install the monitoring tool: Follow the instructions provided by the monitoring tool to install it on your cluster.
Set up the monitoring tool: Configure the monitoring tool to collect data from your cluster and display it in an easy-to-read format.
Monitor your cluster: Start monitoring your cluster and check the data to ensure that everything is working as expected.
Continuously monitor and improve: Regularly check the data collected by the monitoring tool and make changes as necessary to ensure that your cluster is running optimally.

Enabling Kubernetes Cluster Monitoring with Prometheus

Kubernetes cluster monitoring is a crucial aspect of ensuring the health and performance of your cluster and applications. With Prometheus, you can take your monitoring to the next level and gain a comprehensive view of your cluster’s performance.

Prometheus is an open-source monitoring solution that is designed to collect and store data from your cluster, providing valuable insights into the health and performance of your nodes, pods, and containers. With its user-friendly interface and powerful features, Prometheus makes it easy to monitor your cluster and identify potential issues before they become major problems.

Enabling cluster monitoring with Prometheus is simple and straightforward. Follow these steps to get started:

Create a Kubernetes token for Prometheus.

vim token.yaml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: external-prometheus-monitoring
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  - nodes/proxy
  - nodes/stats
  - nodes/metrics
  - services
  - endpoints
  - pods
  - namespaces
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
- nonResourceURLs:
  - /metrics
  verbs:
  - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: external-prometheus-monitoring
subjects:
  - kind: ServiceAccount
    name: external-prometheus-monitoring
    namespace: kube-system
roleRef:
  kind: ClusterRole
  name: external-prometheus-monitoring
  apiGroup: rbac.authorization.k8s.io

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: external-prometheus-monitoring
  namespace: kube-system

---

kubectl create -f token.yaml

# After successfully created. Please run below command.
TOKEN_NAME=$(kubectl -n kube-system get serviceaccount external-prometheus-monitoring -o=jsonpath='{.secrets[0].name}')
TOKEN_VALUE=$(kubectl -n kube-system get secret/${TOKEN_NAME} -o=go-template='{{.data.token}}' | base64 --decode)
echo $TOKEN_VALUE

Install Exporters: like kube-state-metrics, node-exporter, etc.
Install the kube-state-metrics agent.

# kube-state-metrics install through helm 
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install [RELEASE_NAME] prometheus-community/kube-state-metrics [flags]

# Example
helm install test prometheus-community/kube-state-metrics -n monitoring

# Check Pods
kubectl get po

############### Output ####################
NAME                                              READY   STATUS    RESTARTS   AGE
test-kube-state-metrics-gnfytcuy-jhgjbj   1/1     Running   0          25s

Install the node-exporter

# node-exporter install through helm 
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install [RELEASE_NAME] prometheus-community/prometheus-node-exporter [flags]

# Example
helm install test-node-exporter prometheus-community/prometheus-node-exporter -n monitoring

# Check Pods
kubectl get po

############### Output ####################
NAME                                                              READY   STATUS    RESTARTS   AGE
test-node-exporter-prometheus-node-exporter-nftz6                      1/1     Running   0          10s
test-node-exporter-prometheus-node-exporter-tbx5t                      1/1     Running   0          10s

Add the below config file in prometheus.yml.
Enable cadvisor discovery

  - job_name: kubernetes-cadvisor
    scrape_interval: '15s'
    scrape_timeout: '10s'
    metrics_path: /metrics/cadvisor
    scheme: https
    kubernetes_sd_configs:
    - api_server: cluster-endpoint
      role: node
      bearer_token: "cluster-token"
      tls_config:
        insecure_skip_verify: true
    tls_config:
      insecure_skip_verify: true
    bearer_token: "cluster-token"
    relabel_configs:
    - action: labelmap
      regex: __meta_kubernetes_node_label_(.+)
    - source_labels: [__address__]
      target_label: Role
      replacement: "eks"
    - source_labels: [__address__]
      target_label: account
      replacement: "account-name"
      action: replace
    - source_labels: [__address__]
      target_label: cluster
      replacement: "cluster-name"
      action: replace

Enable Kubernetes svc (service) discovery

Note: Make sure prometheus.io/scrape: “true” should in-applications service annotations. An example mentioned is below for reference.

annotations:
    prometheus.io/scrape: "true"

  - job_name: 'kubernetes-service-endpoints'
    sample_limit: 300000
    kubernetes_sd_configs:
      - api_server: cluster-endpoint
        role: endpoints
        bearer_token: "cluster-token"
        tls_config:
          insecure_skip_verify: true
    tls_config:
      insecure_skip_verify: true
    bearer_token: "cluster-token" 
    relabel_configs:
    - action: labelmap
      regex: __meta_kubernetes_service_label_(.+)
    - source_labels: [__meta_kubernetes_namespace]
      action: replace
      target_label: kubernetes_namespace
    - source_labels: [__meta_kubernetes_service_name]
      action: replace
      target_label: kubernetes_name
    - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
      action: keep
      regex: true
    - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
      action: replace
      target_label: __scheme__
      regex: (https?)
    - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
      action: replace
      target_label: __metrics_path__
      regex: (.+)
    - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
      action: replace
      target_label: __address__
      regex: (.+)(?::\d+);(\d+)
      replacement: $1:$2
    - source_labels: [__address__]
      target_label: techteam
      replacement: ""
      action: replace
    - source_labels: [__address__]
      target_label: account
      replacement: "account-name"
      action: replace
    - source_labels: [__address__]
      target_label: environment
      replacement: "prod"
      action: replace
    - source_labels: [__address__]
      target_label: cluster
      replacement: "cluster-name"
      action: replace

Note: Please find and replace these values accordingly.
- cluster-endpoint with cluster url
- cluster-token with the cluster token which you created earlier from the above steps.
- cluster-name with your cluster name
etc ….

Create Prometheus Rules for cluster monitoring.

groups:
  - name: kubernetes alerts
    rules:
      - alert: pod_cpu_more_than_80
        expr: avg by(namespace, pod, cluster) (rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[1m]) * 100) / avg by(namespace, pod, cluster) (kube_pod_container_resource_requests_cpu_cores{container!="POD",container!=""}) > 80
        for: 5m
        labels:
          severity: critical
        annotations:
          description: 'cpu is high for {{ $labels.pod }} in  namespace: {{ $labels.namespace }} value is {{ humanize $value }}%'
          summary: 'cpu is high for {{ $labels.pod }} in  Cluster: {{ $labels.cluster }}, namespace: {{ $labels.namespace }}'

      - alert: pod_cpu_more_than_70
        expr: avg by(namespace, pod, cluster) (rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[1m]) * 100) / avg by(namespace, pod, cluster) (kube_pod_container_resource_requests_cpu_cores{container!="POD",container!=""}) > 70 < 80
        for: 5m
        labels:
          severity: warning
        annotations:
          description: 'cpu is high for {{ $labels.pod }} in  namespace: {{ $labels.namespace }} value is {{ humanize $value }}%'
          summary: 'cpu is high for {{ $labels.pod }} in  Cluster: {{ $labels.cluster }},  namespace: {{ $labels.namespace }}'

      - alert: pod_memory_more_than_90
        expr: (((sum (container_memory_working_set_bytes{image!="",name=~"^k8s_.*",container!="POD" }) by (pod, cluster, namespace)) / (sum(kube_pod_container_resource_requests{resource="memory"}) by (pod, cluster, namespace)))) * 100   > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          description: 'memory is high for {{ $labels.pod }} in  namespace: {{ $labels.namespace }} value is {{ humanize $value }}%'
          summary: 'memory is high for {{ $labels.pod }} in  Cluster: {{ $labels.cluster }}, namespace: {{ $labels.namespace }}'

      - alert: pod_memory_more_than_80
        expr: (((sum (container_memory_working_set_bytes{image!="",name=~"^k8s_.*",container!="POD" }) by (pod, cluster, namespace)) / (sum(kube_pod_container_resource_requests{resource="memory"}) by (pod, cluster, namespace)))) * 100   > 80 < 90
        for: 5m
        labels:
          severity: warning
        annotations:
          description: 'memory is high for {{ $labels.pod }} in  namespace: {{ $labels.namespace }} value is {{ humanize $value }}%'
          summary: 'memory is high for {{ $labels.pod }} in  Cluster: {{ $labels.cluster }},  namespace: {{ $labels.namespace }}'

      - alert: pod_status_change_alert
        expr: min_over_time((sum by(pod, namespace, phase, cluster) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed|CrashLoopBackOff|Error"})[5m:1m])) > 0
        for: 1m
        labels:          
          severity: critical
        annotations:
          description: 'pod_name: {{ $labels.pod }}, namespace: {{ $labels.namespace }} phase: {{ $labels.phase }}'
          summary: 'pod_name: {{ $labels.pod }}, Cluster: {{ $labels.cluster }}, namespace: {{ $labels.namespace }} phase: {{ $labels.phase }}'

      - alert: KubePodCrashLoopingReason
        expr: max_over_time(kube_pod_container_status_waiting_reason{reason=~"ErrImagePull|ImagePullBackOff|InvalidImageName|CreateContainerError|CreateContainerConfigError|CrashLoopBackOff|ContainerCreating"}[2m]) >= 1
        for: 30s
        labels:
          severity: critical
        annotations:
          description: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is in waiting state (reason: "CrashLoopBackOff").'
          summary: 'Cluster: {{ $labels.cluster }}, Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is in waiting state (reason: "CrashLoopBackOff").'

Validate your Prometheus configuration file

./promtool check config prometheus

Reload Prometheus

systemctl reload prometheus

# or

kill -HUP $(pgrep prometheus)

By following these steps, you can start monitoring your cluster. Check the data to ensure that everything is running optimally.

By enabling cluster monitoring with Prometheus, you can gain valuable insights into your cluster’s performance and ensure that it is running smoothly. Start taking control of your cluster’s health and performance today!

Final Thoughts

External monitoring is powerful for ensuring the optimal performance of your Kubernetes cluster. By monitoring your cluster from the outside, you can quickly identify and resolve issues, optimize resource usage, and ensure that your applications are running smoothly. So, why wait? Start monitoring your cluster today and unlock its full potential!

Happy monitoring!