Skip to main content
This guide walks you through building a complete observability stack with Prometheus for metrics, Loki for logs, Grafana for visualization, and Alertmanager for notifications.

AI Prompts

Skip the manual steps-use these prompts with the AI Assistant (⌘+J) to build your stack automatically.

What You’ll Build

A production-ready observability stack:
ComponentPurpose
kube-prometheus-stackPrometheus, Grafana, Alertmanager, and exporters in one chart
LokiLog aggregation-like Prometheus, but for logs
PromtailShips logs from pods to Loki
Grafana with Prometheus and Loki

Metrics and logs unified in Grafana


Prerequisites

  • A cluster imported into Ankra with the agent connected
  • Helm registries added for:
    • Prometheus Community (https://prometheus-community.github.io/helm-charts)
    • Grafana (https://grafana.github.io/helm-charts)

Step 1: Create the Stack

1

Open Stack Builder

Navigate to your cluster → StacksCreate Stack.
2

Name Your Stack

Name it observability or monitoring-and-logging.

Step 2: Add kube-prometheus-stack

This chart bundles everything you need for metrics: Prometheus, Grafana, Alertmanager, node-exporter, and kube-state-metrics.
1

Add the Chart

Click + Add → search for kube-prometheus-stack from the Prometheus Community repository.
2

Configure Prometheus

Click the component and set these values:
prometheus:
  prometheusSpec:
    retention: 15d
    resources:
      requests:
        memory: 1Gi
        cpu: 500m
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi
3

Configure Grafana

grafana:
  adminPassword: "your-secure-password"  # Change this
  persistence:
    enabled: true
    size: 10Gi
  # Add Loki as a data source (we'll deploy it next)
  additionalDataSources:
    - name: Loki
      type: loki
      url: http://loki-gateway.monitoring.svc.cluster.local
      access: proxy
      isDefault: false
4

Configure Alertmanager for Slack

alertmanager:
  config:
    global:
      slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
    route:
      receiver: 'slack'
      group_by: ['alertname', 'namespace']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
    receivers:
      - name: 'slack'
        slack_configs:
          - channel: '#alerts'
            send_resolved: true

Step 3: Add Loki for Logs

Loki is a log aggregation system designed to work seamlessly with Grafana. It’s lightweight because it only indexes metadata, not the full log content.
1

Add Loki

Click + Add → search for loki from the Grafana repository.Use the loki chart (not loki-distributed for simpler setups).
2

Configure Loki

loki:
  auth_enabled: false
  commonConfig:
    replication_factor: 1
  storage:
    type: filesystem
  schemaConfig:
    configs:
      - from: "2024-01-01"
        store: tsdb
        object_store: filesystem
        schema: v13
        index:
          prefix: index_
          period: 24h

# For production, configure object storage:
# storage:
#   type: s3
#   bucketNames:
#     chunks: loki-chunks
#     ruler: loki-ruler
#   s3:
#     endpoint: s3.amazonaws.com
#     region: us-east-1

singleBinary:
  replicas: 1
  resources:
    requests:
      memory: 256Mi
      cpu: 100m
  persistence:
    enabled: true
    size: 20Gi

gateway:
  enabled: true
3

Connect Dependency

In the Stack Builder, draw a connection from loki to kube-prometheus-stack to ensure Loki deploys first (so Grafana can connect to it).

Step 4: Add Promtail for Log Collection

Promtail runs as a DaemonSet on every node, collecting logs from all pods and shipping them to Loki.
1

Add Promtail

Click + Add → search for promtail from the Grafana repository.
2

Configure Promtail

config:
  clients:
    - url: http://loki-gateway.monitoring.svc.cluster.local/loki/api/v1/push
  
  snippets:
    # Add useful labels from pod metadata
    pipelineStages:
      - cri: {}
      - labeldrop:
          - filename
      - match:
          selector: '{app=~".+"}'
          stages:
            - json:
                expressions:
                  level: level
            - labels:
                level:

resources:
  requests:
    memory: 64Mi
    cpu: 50m
  limits:
    memory: 128Mi
    cpu: 100m
3

Connect Dependency

Draw a connection from promtail to loki-Promtail needs Loki running to ship logs.

Step 5: Deploy

1

Review the Stack

Your Stack Builder should show:
promtail → loki → kube-prometheus-stack
This ensures correct deployment order.
2

Save and Deploy

Click Save, then Deploy. Watch progress in Operations.
3

Verify Deployment

After 3-5 minutes, all pods should be running:
  • prometheus-*
  • grafana-*
  • alertmanager-*
  • loki-*
  • promtail-* (one per node)

Step 6: Explore in Grafana

1

Access Grafana

Port-forward to access locally:
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring
Or configure an ingress in the values.
2

Log In

  • Username: admin
  • Password: The value you set in grafana.adminPassword
3

Query Metrics

Go to Explore → Select Prometheus → Try:
sum(rate(container_cpu_usage_seconds_total{namespace="default"}[5m])) by (pod)
4

Query Logs

Go to Explore → Select Loki → Try:
{namespace="default"} |= "error"
5

Correlate Metrics and Logs

The power of this stack: when you see a spike in metrics, click through to see logs from that exact time range.

Production Considerations

For clusters generating >100GB/day of logs, use distributed mode:
# Use loki-distributed chart instead
loki:
  schemaConfig:
    configs:
      - from: "2024-01-01"
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: index_
          period: 24h
  storage:
    type: s3
    s3:
      endpoint: s3.amazonaws.com
      region: us-east-1
      bucketnames:
        chunks: your-loki-chunks-bucket
        ruler: your-loki-ruler-bucket
prometheus:
  prometheusSpec:
    retention: 30d
    retentionSize: 80GB
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: fast-ssd
          resources:
            requests:
              storage: 100Gi
Pre-compute expensive queries:
additionalPrometheusRulesMap:
  recording-rules:
    groups:
      - name: resource-usage
        interval: 30s
        rules:
          - record: namespace:container_cpu_usage:sum_rate
            expr: sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace)
Set how long Loki keeps logs:
loki:
  limits_config:
    retention_period: 168h  # 7 days
  compactor:
    retention_enabled: true

Adding Custom Alerts

additionalPrometheusRulesMap:
  pod-alerts:
    groups:
      - name: pod-health
        rules:
          - alert: PodRestartingTooMuch
            expr: increase(kube_pod_container_status_restarts_total[1h]) > 3
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "Pod {{ $labels.pod }} restarting frequently"
              description: "Pod has restarted {{ $value }} times in the last hour"
additionalPrometheusRulesMap:
  app-alerts:
    groups:
      - name: application
        rules:
          - alert: HighErrorRate
            expr: |
              sum(rate(http_requests_total{status=~"5.."}[5m])) 
              / sum(rate(http_requests_total[5m])) > 0.05
            for: 5m
            labels:
              severity: critical
            annotations:
              summary: "High 5xx error rate"
              description: "Error rate is {{ $value | humanizePercentage }}"
additionalPrometheusRulesMap:
  node-alerts:
    groups:
      - name: node-health
        rules:
          - alert: DiskSpaceLow
            expr: |
              (node_filesystem_avail_bytes{mountpoint="/"} 
              / node_filesystem_size_bytes{mountpoint="/"}) < 0.1
            for: 10m
            labels:
              severity: critical
            annotations:
              summary: "Low disk space on {{ $labels.instance }}"
              description: "Less than 10% disk space remaining"

Troubleshooting

  1. Check Promtail pods are running on all nodes:
    kubectl get pods -n monitoring -l app.kubernetes.io/name=promtail
    
  2. Check Promtail logs for errors:
    kubectl logs -n monitoring -l app.kubernetes.io/name=promtail --tail=50
    
  3. Verify Loki is reachable from Promtail:
    kubectl exec -n monitoring -it $(kubectl get pod -n monitoring -l app.kubernetes.io/name=promtail -o name | head -1) -- wget -q -O- http://loki-gateway.monitoring.svc.cluster.local/ready
    
  1. Verify the Loki data source URL matches your service name
  2. Check Loki gateway is running:
    kubectl get svc -n monitoring | grep loki
    
  3. Test from Grafana pod:
    kubectl exec -n monitoring -it $(kubectl get pod -n monitoring -l app.kubernetes.io/name=grafana -o name) -- curl http://loki-gateway.monitoring.svc.cluster.local/ready
    
  • Prometheus: Reduce scrape frequency, shorten retention, drop unused metrics
  • Loki: Reduce retention period, use object storage instead of filesystem
  • Promtail: Limit which logs are collected using pipelineStages to drop verbose logs
  1. Add more labels in Promtail for better filtering
  2. Use time range filters in queries
  3. For production, use Loki distributed mode with more queriers

AI Prompts

Press ⌘+J to open the AI Assistant and use these prompts to build your stack:
Build an observability stack with:
- kube-prometheus-stack for metrics
- Loki for logs with 7 day retention
- Promtail to collect logs from all pods
- Configure Grafana with both data sources
- Send alerts to Slack
Create a production monitoring stack:
- Prometheus with 30 day retention on 100GB storage
- Loki configured for S3 storage in us-east-1
- Alertmanager with Slack notifications to #platform-alerts
- Include alerts for pod restarts, high CPU, and disk space
I need a lightweight observability stack for a dev cluster.
Keep total memory under 2GB. Include Prometheus, Loki, and 
Grafana but with minimal retention (3 days for both).
I already have kube-prometheus-stack running. Add Loki and 
Promtail to my stack and configure the Loki data source in 
my existing Grafana.
My Promtail isn't sending logs to Loki. Help me troubleshoot
and fix the configuration.
The AI builds the entire Stack for you-components, dependencies, and values configured. Just describe what you need and deploy.

Next Steps