Build a Monitoring Stack

This guide walks you through building a complete observability stack with Prometheus for metrics, Loki for logs, Grafana for visualization, and Alertmanager for notifications.

AI Prompts

Use these prompts with the AI Assistant (⌘+J) to get recommendations for building your stack.

What You’ll Build

A production-ready observability stack:

Component	Purpose
kube-prometheus-stack	Prometheus, Grafana, Alertmanager, and exporters in one chart
Loki	Log aggregation-like Prometheus, but for logs
Promtail	Ships logs from pods to Loki

Prerequisites

A cluster imported into Ankra with the agent connected
Helm registries added for:
- Prometheus Community (https://prometheus-community.github.io/helm-charts)
- Grafana (https://grafana.github.io/helm-charts)

Step 1: Create the Stack

Open Stack Builder

Navigate to your cluster → Stacks → Create Stack.

Name Your Stack

Name it observability or monitoring-and-logging.

Step 2: Add kube-prometheus-stack

This chart bundles everything you need for metrics: Prometheus, Grafana, Alertmanager, node-exporter, and kube-state-metrics.

Add the Chart

Click + Add → search for kube-prometheus-stack from the Prometheus Community repository.

Configure Prometheus

Click the component and set these values:

prometheus:
  prometheusSpec:
    retention: 15d
    resources:
      requests:
        memory: 1Gi
        cpu: 500m
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi

Configure Grafana

grafana:
  adminPassword: "your-secure-password"  # Change this
  persistence:
    enabled: true
    size: 10Gi
  # Add Loki as a data source (we'll deploy it next)
  additionalDataSources:
    - name: Loki
      type: loki
      url: http://loki-gateway.monitoring.svc.cluster.local
      access: proxy
      isDefault: false

Configure Alertmanager for Slack

alertmanager:
  config:
    global:
      slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
    route:
      receiver: 'slack'
      group_by: ['alertname', 'namespace']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
    receivers:
      - name: 'slack'
        slack_configs:
          - channel: '#alerts'
            send_resolved: true

Encrypt sensitive values with SOPS: In the manifest edit view, click the SOPS button to encrypt secrets like grafana.adminPassword and slack_api_url. This ensures sensitive values are stored encrypted in your GitOps repository. See SOPS Encryption for setup instructions.

Step 3: Add Loki for Logs

Loki is a log aggregation system designed to work seamlessly with Grafana. It’s lightweight because it only indexes metadata, not the full log content.

Add Loki

Click + Add → search for loki from the Grafana repository.Use the loki chart (not loki-distributed for simpler setups).

Configure Loki

loki:
  auth_enabled: false
  commonConfig:
    replication_factor: 1
  storage:
    type: filesystem
  schemaConfig:
    configs:
      - from: "2024-01-01"
        store: tsdb
        object_store: filesystem
        schema: v13
        index:
          prefix: index_
          period: 24h

# For production, configure object storage:
# storage:
#   type: s3
#   bucketNames:
#     chunks: loki-chunks
#     ruler: loki-ruler
#   s3:
#     endpoint: s3.amazonaws.com
#     region: us-east-1

singleBinary:
  replicas: 1
  resources:
    requests:
      memory: 256Mi
      cpu: 100m
  persistence:
    enabled: true
    size: 20Gi

gateway:
  enabled: true

Connect Dependency

In the Stack Builder, draw a connection from loki to kube-prometheus-stack to ensure Loki deploys first (so Grafana can connect to it).

Step 4: Add Promtail for Log Collection

Promtail runs as a DaemonSet on every node, collecting logs from all pods and shipping them to Loki.

Add Promtail

Click + Add → search for promtail from the Grafana repository.

Configure Promtail

config:
  clients:
    - url: http://loki-gateway.monitoring.svc.cluster.local/loki/api/v1/push
  
  snippets:
    # Add useful labels from pod metadata
    pipelineStages:
      - cri: {}
      - labeldrop:
          - filename
      - match:
          selector: '{app=~".+"}'
          stages:
            - json:
                expressions:
                  level: level
            - labels:
                level:

resources:
  requests:
    memory: 64Mi
    cpu: 50m
  limits:
    memory: 128Mi
    cpu: 100m

Connect Dependency

Draw a connection from promtail to loki-Promtail needs Loki running to ship logs.

Step 5: Deploy

Review the Stack

Your Stack Builder should show:

promtail → loki → kube-prometheus-stack

This ensures correct deployment order.

Save and Deploy

Click Save, then Deploy. Watch progress in Operations.

Verify Deployment

After 3-5 minutes, all pods should be running:

prometheus-*
grafana-*
alertmanager-*
loki-*
promtail-* (one per node)

Step 6: Explore in Grafana

Access Grafana

Port-forward to access locally:

kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring

Or configure an ingress in the values.

Username: admin
Password: The value you set in grafana.adminPassword

Query Metrics

Go to Explore → Select Prometheus → Try:

sum(rate(container_cpu_usage_seconds_total{namespace="default"}[5m])) by (pod)

Query Logs

Go to Explore → Select Loki → Try:

{namespace="default"} |= "error"

Correlate Metrics and Logs

The power of this stack: when you see a spike in metrics, click through to see logs from that exact time range.

Production Considerations

Scale Loki for High Volume

For clusters generating >100GB/day of logs, use distributed mode:

# Use loki-distributed chart instead
loki:
  schemaConfig:
    configs:
      - from: "2024-01-01"
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: index_
          period: 24h
  storage:
    type: s3
    s3:
      endpoint: s3.amazonaws.com
      region: us-east-1
      bucketnames:
        chunks: your-loki-chunks-bucket
        ruler: your-loki-ruler-bucket

Increase Prometheus Retention

prometheus:
  prometheusSpec:
    retention: 30d
    retentionSize: 80GB
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: fast-ssd
          resources:
            requests:
              storage: 100Gi

Add Recording Rules

Pre-compute expensive queries:

additionalPrometheusRulesMap:
  recording-rules:
    groups:
      - name: resource-usage
        interval: 30s
        rules:
          - record: namespace:container_cpu_usage:sum_rate
            expr: sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace)

Configure Log Retention

Set how long Loki keeps logs:

loki:
  limits_config:
    retention_period: 168h  # 7 days
  compactor:
    retention_enabled: true

Adding Custom Alerts

Pod Restart Alert

additionalPrometheusRulesMap:
  pod-alerts:
    groups:
      - name: pod-health
        rules:
          - alert: PodRestartingTooMuch
            expr: increase(kube_pod_container_status_restarts_total[1h]) > 3
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "Pod {{ $labels.pod }} restarting frequently"
              description: "Pod has restarted {{ $value }} times in the last hour"

High Error Rate Alert

additionalPrometheusRulesMap:
  app-alerts:
    groups:
      - name: application
        rules:
          - alert: HighErrorRate
            expr: |
              sum(rate(http_requests_total{status=~"5.."}[5m])) 
              / sum(rate(http_requests_total[5m])) > 0.05
            for: 5m
            labels:
              severity: critical
            annotations:
              summary: "High 5xx error rate"
              description: "Error rate is {{ $value | humanizePercentage }}"

Disk Space Alert

additionalPrometheusRulesMap:
  node-alerts:
    groups:
      - name: node-health
        rules:
          - alert: DiskSpaceLow
            expr: |
              (node_filesystem_avail_bytes{mountpoint="/"} 
              / node_filesystem_size_bytes{mountpoint="/"}) < 0.1
            for: 10m
            labels:
              severity: critical
            annotations:
              summary: "Low disk space on {{ $labels.instance }}"
              description: "Less than 10% disk space remaining"

Troubleshooting

Logs Not Appearing in Loki

Check Promtail pods are running on all nodes:

kubectl get pods -n monitoring -l app.kubernetes.io/name=promtail

Check Promtail logs for errors:

kubectl logs -n monitoring -l app.kubernetes.io/name=promtail --tail=50

Verify Loki is reachable from Promtail:

kubectl exec -n monitoring -it $(kubectl get pod -n monitoring -l app.kubernetes.io/name=promtail -o name | head -1) -- wget -q -O- http://loki-gateway.monitoring.svc.cluster.local/ready

Grafana Can't Connect to Loki

Verify the Loki data source URL matches your service name

Check Loki gateway is running:

kubectl get svc -n monitoring | grep loki

Test from Grafana pod:

kubectl exec -n monitoring -it $(kubectl get pod -n monitoring -l app.kubernetes.io/name=grafana -o name) -- curl http://loki-gateway.monitoring.svc.cluster.local/ready

High Memory Usage

Prometheus: Reduce scrape frequency, shorten retention, drop unused metrics
Loki: Reduce retention period, use object storage instead of filesystem
Promtail: Limit which logs are collected using pipelineStages to drop verbose logs

Slow Log Queries

Add more labels in Promtail for better filtering
Use time range filters in queries
For production, use Loki distributed mode with more queriers

AI Prompts

Press ⌘+J to open the AI Assistant and use these prompts to get recommendations for your stack:

Complete Observability Stack

Build an observability stack with:
- kube-prometheus-stack for metrics
- Loki for logs with 7 day retention
- Promtail to collect logs from all pods
- Configure Grafana with both data sources
- Send alerts to Slack

Production Stack with Object Storage

Create a production monitoring stack:
- Prometheus with 30 day retention on 100GB storage
- Loki configured for S3 storage in us-east-1
- Alertmanager with Slack notifications to #platform-alerts
- Include alerts for pod restarts, high CPU, and disk space

Lightweight Stack for Dev Clusters

I need a lightweight observability stack for a dev cluster.
Keep total memory under 2GB. Include Prometheus, Loki, and 
Grafana but with minimal retention (3 days for both).

Add Logging to Existing Prometheus

I already have kube-prometheus-stack running. Add Loki and 
Promtail to my stack and configure the Loki data source in 
my existing Grafana.

Debug Log Collection Issues

My Promtail isn't sending logs to Loki. Help me troubleshoot
and fix the configuration.

The AI provides recommendations for components, dependencies, and values. Just describe what you need, build based on the guidance, and deploy.

Next Steps

Configure Ankra Alerts

Set up Ankra alerts alongside Prometheus Alertmanager.

GitOps Sync

Store your observability stack configuration in Git.

Add Tracing

Complete the observability trifecta with Tempo for distributed tracing.

Explore with AI

Use the AI to query your logs and metrics in natural language.

Get Started

Guides

Clusters

Stacks & Add-ons

GitOps

Kubernetes Resources

Platform Features

Team & Settings

Integrations

AI Prompts

What You’ll Build

Prerequisites

Step 1: Create the Stack

Step 2: Add kube-prometheus-stack

Step 3: Add Loki for Logs

Step 4: Add Promtail for Log Collection

Step 5: Deploy

Step 6: Explore in Grafana

Production Considerations

Adding Custom Alerts

Troubleshooting

AI Prompts

Next Steps

Configure Ankra Alerts

GitOps Sync

Add Tracing

Explore with AI

Get Started

Guides

Clusters

Stacks & Add-ons

GitOps

Kubernetes Resources

Platform Features

Team & Settings

Integrations

AI Prompts

​What You’ll Build

​Prerequisites

​Step 1: Create the Stack

​Step 2: Add kube-prometheus-stack

​Step 3: Add Loki for Logs

​Step 4: Add Promtail for Log Collection

​Step 5: Deploy

​Step 6: Explore in Grafana

​Production Considerations

​Adding Custom Alerts

​Troubleshooting

​AI Prompts

​Next Steps

Configure Ankra Alerts

GitOps Sync

Add Tracing

Explore with AI

What You’ll Build

Prerequisites

Step 1: Create the Stack

Step 2: Add kube-prometheus-stack

Step 3: Add Loki for Logs

Step 4: Add Promtail for Log Collection

Step 5: Deploy

Step 6: Explore in Grafana

Production Considerations

Adding Custom Alerts

Troubleshooting

AI Prompts

Next Steps