Skip to content

Phase 5: Observability Stack

What This Is

This runbook documents how I deployed the monitoring stack (Prometheus, Grafana, AlertManager with Slack) and the logging stack (Elasticsearch, Filebeat, Kibana) on the EKS cluster, and exposed all dashboards via HTTPRoutes on custom subdomains. This phase also hit a node capacity issue that required scaling the self-managed node group from 3 to 4 nodes.

This is Phase 5 of a 6-phase project.

Phase Title What It Covers
1 CI Pipeline and DevSecOps GitHub Actions workflows, Trivy scanning, GHCR image and chart publish
2 AWS Infrastructure DNS, ACM certificate, VPC, EKS cluster, bastion host, self-managed nodes
3 Cluster Add-ons and Gateway API ALB Controller, EBS CSI, Gateway API, ExternalDNS
4 GitOps with ArgoCD ArgoCD, Application manifest, Image Updater, CI-CD integration
5 Observability Stack (this runbook) kube-prometheus-stack, ELK stack, Slack alerting, HTTPRoutes
6 Autoscaling, Load Testing, and Final Verification Metrics Server, HPA, scaling validation, full cluster audit

At the end of this phase, the cluster has full metrics monitoring (Prometheus + Grafana), log aggregation (Elasticsearch + Filebeat + Kibana), and critical alert routing to Slack, all accessible via custom subdomains through the shared ALB.

What I Did

Step 1   Deployed kube-prometheus-stack via Helm with Slack AlertManager config
Step 2   Created Slack webhook secret, installed the chart
Step 3   Applied HTTPRoutes + TargetGroupConfigs for Grafana and Prometheus
Step 4   Verified Grafana at grafana.ibtisam.qzz.io, Prometheus at prometheus.ibtisam.qzz.io
Step 5   Deployed ELK stack: ECK operator, Elasticsearch, Filebeat, Kibana
Step 6   Hit node capacity: Kibana pod stuck in Pending (3 nodes full)
Step 7   Scaled ASG from 3 to 4 nodes, 4th node joined, Kibana scheduled
Step 8   Applied HTTPRoute + TargetGroupConfig for Kibana
Step 9   Verified Kibana at kibana.ibtisam.qzz.io with container logs from all pods
Item Value
Codebase addons/kube-prometheus/ and addons/elastic-logging/
Grafana grafana.ibtisam.qzz.io
Prometheus prometheus.ibtisam.qzz.io
Kibana kibana.ibtisam.qzz.io

Monitoring: kube-prometheus-stack

Files Used

addons/kube-prometheus/
├── patch-values.yaml              # AlertManager Slack config, email default receiver
├── httproute-grafana.yaml         # grafana.ibtisam.qzz.io -> Grafana Service
├── httproute-prometheus.yaml      # prometheus.ibtisam.qzz.io -> Prometheus Service
├── target-grp-grafana.yaml        # ALB target group for Grafana (targetType: ip)
└── target-grp-prometheus.yaml     # ALB target group for Prometheus (targetType: ip)

AlertManager Configuration

The patch-values.yaml configures AlertManager with two receivers: Slack for critical alerts and email as the default fallback. The Slack webhook URL is stored in a Kubernetes Secret, not in the values file.

cd addons/kube-prometheus/

# Create the Slack webhook secret before installing the chart
kubectl create namespace monitoring

kubectl create secret generic alertmanager-slack-webhook \
  --from-literal=slack-webhook-url="<REDACTED>" \
  -n monitoring

The AlertManager config in patch-values.yaml mounts this secret and routes critical alerts to the #alertmanager Slack channel:

alertmanager:
  alertmanagerSpec:
    secrets:
      - alertmanager-slack-webhook
  config:
    route:
      receiver: 'email-default'
      routes:
        - receiver: 'slack-notification'
          matchers:
            - severity = "critical"
    receivers:
      - name: 'slack-notification'
        slack_configs:
          - api_url_file: /etc/alertmanager/secrets/alertmanager-slack-webhook/slack-webhook-url
            channel: '#alertmanager'
            send_resolved: true
      - name: 'email-default'

Installation

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update prometheus-community

helm upgrade -i kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --version 86.2.3 \
  -f patch-values.yaml \
  -n monitoring --create-namespace

After the chart deployed, I retrieved the Grafana admin password:

kubectl get secret --namespace monitoring \
  -l app.kubernetes.io/component=admin-secret \
  -o jsonpath="{.items[0].data.admin-password}" | base64 --decode ; echo

Exposing Grafana and Prometheus

I applied the HTTPRoutes and TargetGroupConfigurations via kustomize:

kubectl apply -k .
# httproute.gateway.networking.k8s.io/grafana-route created
# httproute.gateway.networking.k8s.io/prometheus-route created
# targetgroupconfiguration.gateway.k8s.aws/grafana-tg-config created
# targetgroupconfiguration.gateway.k8s.aws/prometheus-tg-config created

ExternalDNS picked up the new HTTPRoutes and created DNS records automatically. Both dashboards were accessible within a few minutes:

kubectl get httproutes.gateway.networking.k8s.io -A
# NAMESPACE      NAME              HOSTNAMES                         AGE
# argocd         argocd-server     ["argocd.ibtisam.qzz.io"]         45m
# boutique-app   http-app-route    ["app.ibtisam.qzz.io"]            35m
# monitoring     grafana-route     ["grafana.ibtisam.qzz.io"]        2m15s
# monitoring     prometheus-route  ["prometheus.ibtisam.qzz.io"]     2m15s

Prometheus target health

Grafana pod metrics dashboard


Logging: Elastic Stack (ELK)

Files Used

addons/elastic-logging/
├── patch-values-elasticsearch.yaml    # Elasticsearch CR config
├── patch-values-filebeat.yaml         # Filebeat DaemonSet with Kubernetes autodiscover
├── patch-values-kibana.yaml           # Kibana CR referencing Elasticsearch
├── httproute-kibana.yaml              # kibana.ibtisam.qzz.io -> Kibana Service
├── target-grp-kibana.yaml             # ALB target group for Kibana (targetType: ip)
└── storage-class-gp3.yaml             # gp3 StorageClass (already created in Phase 3)

Components

Component What It Does How It Is Deployed
ECK Operator Manages Elasticsearch and Kibana CRs as Kubernetes-native resources Helm chart elastic/eck-operator
Elasticsearch Search and analytics engine, stores all log data Helm chart elastic/eck-elasticsearch (1 node, gp3 PVC)
Filebeat DaemonSet that runs on every node, collects container logs from /var/log/containers/ and ships to Elasticsearch Helm chart elastic/eck-beats with patch-values-filebeat.yaml
Kibana Web UI for searching and visualizing logs Helm chart elastic/eck-kibana

Installation

cd addons/elastic-logging/

kubectl create namespace logging

helm repo add elastic https://helm.elastic.co
helm repo update

# ECK Operator (manages ES and Kibana CRs)
helm upgrade -i eck-operator elastic/eck-operator \
  --version 3.4.0 -n logging

# Elasticsearch (1 node, uses gp3 PVC)
helm upgrade -i eck-elasticsearch elastic/eck-elasticsearch \
  --version 0.19.0 -n logging

# Filebeat DaemonSet (container log collection)
helm upgrade -i eck-beats elastic/eck-beats \
  --version 0.19.0 -f patch-values-filebeat.yaml -n logging

# Kibana (log visualization UI)
helm upgrade -i eck-kibana elastic/eck-kibana \
  --version 0.19.0 -f patch-values-kibana.yaml -n logging

Node Capacity Issue

After the ELK stack deployed, Filebeat ran on all 3 nodes (DaemonSet), Elasticsearch started (1 StatefulSet pod), but Kibana was stuck in Pending:

kubectl get pods -n logging
# eck-kibana-kb-65699f8b-sq8bq    0/1     Pending     0          3m18s

kubectl describe pod -n logging eck-kibana-kb-65699f8b-sq8bq
# Conditions:
#   PodScheduled   False
# Events:
#   FailedScheduling: 0/3 nodes are available: insufficient cpu/memory

The 3 self-managed nodes were at capacity. The boutique app (10 services), monitoring stack (Prometheus, Grafana, AlertManager, node-exporter, kube-state-metrics), and now Elasticsearch + Filebeat had consumed all available resources.

Bug: Kibana Pending Due to Node Capacity

3 nodes were not enough to run the full stack (10 app services + monitoring + logging). The Kibana pod requested 2Gi memory and could not be scheduled. I scaled the Auto Scaling Group from 3 to 4 nodes via the AWS Console. The 4th node joined the cluster within 2 minutes and Kibana was scheduled on it.

# After scaling ASG to 4 nodes
kubectl get nodes
# NAME                           STATUS   ROLES    AGE   VERSION
# ip-10-0-1-106.ec2.internal     Ready    <none>   81m   v1.36.1-eks-0de9cde
# ip-10-0-2-91.ec2.internal      Ready    <none>   81m   v1.36.1-eks-0de9cde
# ip-10-0-3-20.ec2.internal      Ready    <none>   81m   v1.36.1-eks-0de9cde
# ip-10-0-3-xxx.ec2.internal     Ready    <none>   2m    v1.36.1-eks-0de9cde

kubectl get pods -n logging
# eck-kibana-kb-65699f8b-sq8bq    1/1     Running   0          ...

Exposing Kibana

kubectl apply -f httproute-kibana.yaml
kubectl apply -f target-grp-kibana.yaml

Kibana was accessible at kibana.ibtisam.qzz.io. I navigated to Discover, created a data view for filebeat-*, and confirmed container logs were flowing from all nodes and pods across the cluster.

Kibana logs discover view


Final State

At the end of Phase 5, all observability components were operational:

monitoring namespace
  ├── Prometheus (scraping all targets)
  ├── Grafana (pod metrics dashboards)
  ├── AlertManager (Slack integration for critical alerts)
  ├── node-exporter, kube-state-metrics
  ├── HTTPRoute: grafana.ibtisam.qzz.io
  └── HTTPRoute: prometheus.ibtisam.qzz.io

logging namespace
  ├── ECK Operator
  ├── Elasticsearch (1 node, gp3 PVC)
  ├── Filebeat DaemonSet (4 pods, one per node)
  ├── Kibana (log search and visualization)
  ├── HTTPRoute: kibana.ibtisam.qzz.io
  └── All container logs flowing: /var/log/containers/* -> Filebeat -> Elasticsearch -> Kibana

Cluster nodes: 4 (scaled from 3 during this phase)

HTTPRoutes (all served by single shared ALB):
  - argocd.ibtisam.qzz.io      (Phase 4)
  - app.ibtisam.qzz.io          (Phase 4)
  - grafana.ibtisam.qzz.io      (this phase)
  - prometheus.ibtisam.qzz.io   (this phase)
  - kibana.ibtisam.qzz.io       (this phase)

All 5 subdomains served through a single ALB, DNS records auto-created by ExternalDNS, TLS terminated by the ACM wildcard certificate from Phase 2.

Decision: Monitoring and Logging Deployed via Helm, Not ArgoCD

The monitoring and logging stacks were installed directly via Helm from the bastion host, not managed by ArgoCD Applications. This was a deliberate choice for two reasons.

First, independence from the thing being monitored. If ArgoCD breaks or enters a crash loop, Prometheus, Grafana, and the ELK stack remain operational because they have no dependency on ArgoCD. Debugging a broken ArgoCD deployment requires observability tools that are not themselves managed by ArgoCD.

Second, platform vs. application separation. In production microservices architectures, companies typically split ownership: the platform/SRE team manages observability infrastructure (monitoring, logging, service mesh, ingress controllers) via Helm, Terraform, or a dedicated platform ArgoCD project. Application teams manage their workloads via ArgoCD. This project follows the same pattern: the boutique app is ArgoCD-managed (Phase 4), the platform stack is Helm-managed (this phase).

The alternative is the app-of-apps pattern where ArgoCD manages everything including itself. Both approaches are valid. For this project scope, the manual Helm approach was simpler and demonstrated the separation of concerns clearly.


Terminal Sessions and Evidence

# Session What It Covers Link
1 kube-prometheus-stack Monitoring Slack secret creation, Helm install, Grafana password retrieval, HTTPRoute/TargetGroup apply, verification 05_kube_prometheus_stack_monitoring.txt
2 Elastic Stack Logging ECK operator install, Elasticsearch/Filebeat/Kibana deploy, Kibana Pending debug, ASG scaling, HTTPRoute apply 06_elastic_stack_logging.txt
# Screenshot What It Shows Link
1 Prometheus Targets All kube-prometheus-stack targets healthy and scraping 05_prometheus_kube_prometheus_stack_target_health.png
2 Grafana Dashboard Boutique app pod metrics (CPU, memory, network) 06_grafana_boutique_app_pod_metrics_dashboard.png
3 Kibana Discover Container logs from all pods flowing through Filebeat 07_kibana_boutique_app_logs_discover_view.png

Next Phase

Phase 6: Autoscaling and Load Testing covers deploying Metrics Server, configuring HPA on the frontend deployment, generating load, and verifying horizontal scaling.