End-to-End DevOps: CI/CD, GitOps, and Observability on Amazon EKS¶
Overview¶
I deployed Google's Online Boutique (10-service polyglot monorepo) on Amazon EKS with a production-grade DevOps pipeline built from scratch: GitHub Actions CI with Trivy security scanning, Helm chart packaging to GHCR, ArgoCD GitOps with Image Updater for continuous delivery, full observability (Prometheus, Grafana, ELK, Slack alerting), Gateway API networking, and HPA autoscaling.
| Item | Value |
|---|---|
| Source repo (CI) | ibtisam-iq/microservices-demo |
| CD repo | ibtisam-iq/platform-engineering-systems |
| Live app | app.ibtisam.qzz.io |
| ArgoCD | argocd.ibtisam.qzz.io |
| Terminal sessions | terminal-session/ |
| Screenshots | assets/ |
Architecture at a Glance¶
Developer pushes code to src/
│
▼
GitHub Actions (CI)
├── Trivy filesystem + image scan
├── Docker build (BuildKit + GHA cache)
└── Push 3 tags to GHCR (:sha-<40>, :sha-<7>, :latest)
│
▼
ArgoCD Image Updater (on cluster)
├── Polls GHCR every 2 min
├── Detects new :latest digest
└── Patches ArgoCD Application
│
▼
ArgoCD syncs, pods roll
├── 10 microservices in boutique-app namespace
├── Gateway API HTTPRoute -> shared ALB -> app.ibtisam.qzz.io
└── ExternalDNS auto-creates Route 53 records
Observability
├── Prometheus + Grafana + AlertManager (Slack)
└── Elasticsearch + Filebeat + Kibana
5 subdomains, 1 ALB, 1 wildcard cert, 0 manual DNS records
Phases¶
The project is documented across 6 phases. Each phase has its own runbook with step-by-step commands, decisions, bugs encountered, and terminal session recordings as evidence.
-
Phase 1: CI Pipeline and DevSecOps GitHub Actions workflows, Trivy scanning, GHCR image and chart publish
-
Phase 2: AWS Infrastructure DNS, ACM certificate, VPC, EKS cluster, bastion host, self-managed nodes
-
Phase 3: Cluster Add-ons and Gateway API ALB Controller, EBS CSI, Gateway API, ExternalDNS
-
Phase 4: GitOps with ArgoCD ArgoCD, Application manifest, Image Updater, CI-CD integration
-
Phase 5: Observability Stack kube-prometheus-stack, ELK stack, Slack alerting, HTTPRoutes
-
Phase 6: Autoscaling, Load Testing, and Final Verification Metrics Server, HPA, scaling validation, full cluster audit
Key Decisions¶
Architectural and engineering decisions made across all 6 phases. Each links to the runbook where it is documented in detail with full rationale.
Repository and Code Strategy¶
- Never modify upstream files. The fork stays pristine and syncable. All customization lives in files added alongside upstream. (Phase 1)
- Two-repo separation. CI repo owns code and pipelines. CD repo owns deployment intent. CI has zero knowledge of the cluster for code pushes. (Phase 1, Phase 4)
- All deployment manifests in CD repo, nothing in CI repo. The source repo contains no Kubernetes resources, no values files, no Helm overrides. (Phase 4)
- Deleted 2 upstream workflows, kept 3.
ci-main.yamlandci-pr.yamlwere replaced with a monorepo-aware matrix pipeline.helm-chart-ci.yamlwas kept for chart validation. (Phase 1)
Image Tagging and Continuous Delivery¶
- Image tagging evolved through 3 iterations. Chart version tags (broken for CD), immutable SHA tags (noisy), and finally ArgoCD Image Updater with digest strategy. (Phase 1)
- Chose Approach B (Image Updater) over SHA-based GitOps. CI pushes images and stops. Image Updater watches GHCR and handles deployments. Eliminated
reusable-gitops.yamlentirely and removed theGIT_TOKENdependency for code pushes. (Phase 1) - Digest strategy, not newest-build. BuildKit sets the image config's
createdtimestamp to epoch (1970-01-01) for reproducible builds.newest-buildcannot differentiate tags.digesttracks:latesttag's digest directly, no timestamps involved. (Phase 4) - tag: "latest" in values, not empty string.
tag: ""falls back to Chart.AppVersion, but CI never pushes images with the chart version tag. Image Updater needs:latestto track. (Phase 4)
Helm Chart and Values¶
- Chart packaged from upstream as-is. No upstream files modified. The chart on GHCR ships with Google's defaults. EKS overrides live in the CD repo's
values-eks.yaml. (Phase 1) - Patch-only values file. Only 5 fields that differ from upstream: registry, tag, externalService, platform, loadGenerator. Everything else uses upstream defaults. (Phase 4)
- loadGenerator disabled. Excluded from CI matrix (no image in GHCR). Setting
create: falsein values preventsImagePullBackOff. (Phase 4) - chart-release.yaml absorbs CD repo update.
reusable-gitops.yamlwas deleted. Chart version writes moved intochart-release.yaml. Image tag writes eliminated entirely. (Phase 1)
Networking and DNS¶
- Gateway API instead of Ingress. The Kubernetes project recommends Gateway and states the Ingress API has been frozen. (Phase 3)
- Single shared ALB via Gateway API. One Gateway, one ALB, 5 HTTPRoutes across namespaces. All subdomains served through one load balancer. (Phase 3)
- Wildcard ACM certificate.
*.ibtisam.qzz.iocovers all subdomains. No new certificate needed when a subdomain is added. (Phase 2) - Free domain from digitalplat.org. No cost for the
qzz.iodomain used across the project. (Phase 2) - ExternalDNS with Gateway API sources. Auto-creates Route 53 records from HTTPRoutes. Zero manual DNS records in the entire project. (Phase 3)
- HTTPRoute defaults included in Git. Gateway API controller injects
group,kind,weight, andmatchesdefaults post-creation. Including them in the manifest prevents ArgoCD OutOfSync drift. (Phase 4)
Platform Architecture¶
- ArgoCD manages the app, not the platform. Monitoring and logging stacks deployed via Helm, not ArgoCD. If ArgoCD breaks, observability tools remain operational for debugging. (Phase 4, Phase 5)
- gp3 as default StorageClass. Cheaper and better baseline than gp2. Required for Elasticsearch PVCs. (Phase 3)
- Self-managed nodes (managed blocked by SCP). KodeKloud AWS Playground enforces SCPs that block managed node groups. Self-managed nodes via CloudFormation. (Phase 2)
- Scaled ASG from 3 to 4 nodes. 3 nodes could not fit the full stack (app + monitoring + logging). Kibana pod stuck Pending until the 4th node joined. (Phase 5)
Security and Scanning¶
- Trivy CRITICAL gate temporarily relaxed. Designed as exit-code 1 (hard gate). Currently exit-code 0 because upstream base images carry known CRITICAL CVEs. Restore once patched. (Phase 1)
- GIT_TOKEN scoped to CD repo only. Fine-grained PAT with Contents: Read+Write on
platform-engineering-systemsonly. Token never in process argv. (Phase 1)
Terminal Sessions¶
Every phase was recorded. The terminal sessions capture the exact commands, outputs, and errors encountered.
| # | Session | Phase | Link |
|---|---|---|---|
| 01 | DNS and SSL Certificate Setup | 2 | 01_dns_and_ssl_certificate_setup.txt |
| 01a | Cluster Provisioning with Terraform | 2 | 01a_cluster_provisioning_with_terraform.txt |
| 02 | Bastion Access, Tools, Self-Managed Nodes | 2 | 02_bastion_access_tool_installation_and_self_managed_nodes.txt |
| 03 | Cluster Add-ons Installation | 3 | 03_cluster_addons_installation.txt |
| 04 | Application Deployment and CI Trigger | 4 | 04_application_deployment_and_ci_trigger.txt |
| 05 | kube-prometheus-stack Monitoring | 5 | 05_kube_prometheus_stack_monitoring.txt |
| 06 | Elastic Stack Logging | 5 | 06_elastic_stack_logging.txt |
| 07 | Scaling Behavior and Reliability | 6 | 07_observe_scaling_behavior_and_validate_reliability.txt |
| 08 | Full Cluster Verification | 6 | 08_verification_of_pods_services_and_resources.txt |
Screenshots¶
| # | Screenshot | Phase | Link |
|---|---|---|---|
| 01 | SilverStack Dev Machine | 2 | 01_silverstack_dev_machine.png |
| 02 | EKS Cluster Self-Managed Nodes | 2 | 02_aws_eks_cluster_compute_self_managed_nodes.png |
| 03 | Online Boutique Web View | 4 | 03_online_boutique_web_view.png |
| 04 | ArgoCD App Tree (Image Updater Revision) | 4 | 04_argo_app_tree_image_updater_frontend_revision.png |
| 05 | Prometheus Target Health | 5 | 05_prometheus_kube_prometheus_stack_target_health.png |
| 06 | Grafana Pod Metrics Dashboard | 5 | 06_grafana_boutique_app_pod_metrics_dashboard.png |
| 07 | Kibana Logs Discover View | 5 | 07_kibana_boutique_app_logs_discover_view.png |
| 08 | ArgoCD HPA Scale-Out | 6 | 08_argo_app_tree_hpa_frontend_scale_out.png |
| 09 | Route 53 ExternalDNS Records | 3 | 09_route53_records_externaldns_reconciliation.png |
| 10 | CloudFormation EKS Add-ons and Node Stacks | 2 | 10_cloudformation_eks_addons_and_nodes_stacks.png |
| 11 | ALB Listeners and Rules Overview | 3 | 11_aws_alb_listeners_and_rules_overview.png |
| 12 | ALB Resource Map Routing Targets | 3 | 12_aws_alb_resource_map_routing_targets.png |