AI Troubleshooting

Ankra’s AI Assistant analyzes your cluster’s real-time state (pods, events, logs, and configurations) to diagnose issues and provide actionable solutions without leaving the platform.

What is AI Troubleshooting?

AI Troubleshooting is an integrated assistant that helps you debug Kubernetes issues by:

Analyzing live cluster data - Pods, events, logs, nodes, and resource configurations
Identifying root causes - Not just symptoms, but why problems occur
Providing actionable steps - Direct links to fix issues in the Ankra UI
Maintaining context - Follow-up questions understand your conversation history

Intelligent Analysis

AI understands Kubernetes patterns like CrashLoopBackOff, ImagePullBackOff, OOMKilled, and provides targeted solutions.

Real-Time Data

Fetches current pod status, container logs, events, and node conditions for accurate diagnosis.

Context Aware

Knows what resource you’re viewing and adapts responses accordingly. No need to repeat context.

Platform Integrated

Provides clickable links to Ankra UI pages instead of kubectl commands.

How It Works

When you ask a question, the AI Assistant:

1. Intelligent Resource Planning

The AI first determines what information to gather based on your question:

Question Type	Resources Fetched
”Why is my pod crashing?”	Pod status, container logs, events
”Are all deployments healthy?”	Deployments, pods, replica status
”What’s wrong with ingress?”	Ingress config, services, endpoints
”Node issues”	Node conditions, capacity, pod distribution

2. Data Collection

Based on the plan, Ankra fetches:

Pod information - Phase, restart count, container states, conditions
Events - Warning events, scheduling failures, image pull errors
Logs - Container stdout/stderr (last 50 lines by default)
Node status - Ready conditions, capacity, allocatable resources
Related resources - Deployments, services, configmaps as needed

3. AI Analysis

Claude AI analyzes the collected data to:

Identify the specific issue (e.g., exit code 137 = OOMKilled)
Explain the root cause (e.g., memory limit too low for workload)
Assess severity (critical, warning, info)
Suggest fixes with direct Ankra UI links

Accessing AI Troubleshooting

AI Incidents (Alert-Triggered Analysis)

When alerts trigger, AI analysis results appear in: Alerts → AI Incidents tab This shows all automatically generated analyses with root cause, affected resources, and recommended actions. Learn more about AI Incidents.

Global AI Assistant (On-Demand)

Press ⌘ + I (Mac) or Ctrl + I (Windows/Linux) to open the AI Assistant from anywhere in the platform for on-demand troubleshooting.

Resource Detail Pages

When viewing a specific resource (pod, deployment, etc.), the AI Assistant automatically knows what you’re looking at:

Pod Details → Ask “Why is this crashing?” without specifying the pod name
Logs Tab → Ask “What do these errors mean?”
Events Tab → Ask “What caused these warnings?”

Dedicated Troubleshooting Page

Navigate to Cluster → Troubleshooting for a full-screen AI chat experience with conversation history.

Example Questions

Pod Issues

Why is my nginx-pod-xyz crashing?

The AI will analyze:

Container exit codes and restart counts
Recent warning events
Container logs for error messages
Memory/CPU limits vs actual usage

Cluster Health

Are all pods healthy?

Response format:

Pod Health Summary: 43/46 pods healthy (3 with issues)

Pods with Issues:
- web-services/api-server-xyz: CrashLoopBackOff (11 restarts)
- monitoring/prometheus-abc: ImagePullBackOff
- database/postgres-def: Pending (insufficient memory)

Resource Counting

How many namespaces are in my cluster?

The AI provides cluster-wide data, regardless of what page you’re on.

Debugging Specific Issues

Why is the customer-api deployment not starting?

The AI traces through:

Deployment replica status
ReplicaSet events
Pod scheduling attempts
Container startup errors

Add-on Troubleshooting

For Helm add-ons, the AI provides specialized analysis:

Add-on specific diagnostics

When troubleshooting add-ons, the AI also checks:

ArgoCD sync status - OutOfSync, Degraded, Healthy
Helm release state - Deployed, Failed, Pending
Configuration values - Misconfigurations in values.yaml
Latest job results - Installation/update failures
CRD dependencies - Missing Custom Resource Definitions

Example:

Why did cert-manager fail to install?

Response includes:

Helm chart version compatibility
Missing CRDs or prerequisites
RBAC permission issues
Specific error from the Helm job

Common Failure Patterns

The AI recognizes and explains these Kubernetes patterns:

Pattern	Cause	AI Diagnosis
CrashLoopBackOff	App exits with error	Analyzes logs for exit code and error messages
ImagePullBackOff	Can’t pull container image	Checks image name, registry, and credentials
Pending	Can’t schedule pod	Reviews node resources, taints, tolerations
OOMKilled	Out of memory	Compares limits vs actual usage
Evicted	Node under pressure	Checks node conditions and pod priority
CreateContainerError	Container config issue	Examines volume mounts, secrets, configmaps

Response Format

AI responses follow a consistent structure:

Problem Summary

Brief overview of what’s happening.

Root Cause

Technical explanation of why it’s occurring, with specific details from logs/events.

Impact Assessment

Severity indicator:

🔴 Critical - Service down, data loss risk
⚠️ Warning - Degraded performance, needs attention
ℹ️ Info - Informational, no action needed

Suggested Actions

Numbered steps with direct links to Ankra UI:

View pod logs at Pod Logs
Check resource limits in pod configuration
Update memory limit to 512Mi
Restart the deployment

The AI prioritizes Ankra UI actions over kubectl commands. You can fix most issues directly in the platform.

Conversation Context

The AI maintains context throughout your session:

You: Why is the api-server pod failing?

AI: The api-server-xyz pod is in CrashLoopBackOff due to missing 
    DATABASE_URL environment variable...

You: How do I add that variable?

AI: Since you're using the api-server deployment, you can add the 
    environment variable in the pod configuration:
    1. Navigate to the deployment at [link]
    2. Edit the container spec...

Follow-up questions reference the previous context automatically.

Stack-Based Fixes

When the solution requires creating Kubernetes resources, the AI guides you through Stack-based creation:

Why Stacks instead of kubectl?

Benefits of Stack-based resource creation:

GitOps workflow - Version controlled, auditable changes
Declarative management - Resources defined as code
Rollback capability - Easy to revert if needed
Dependency tracking - Resources managed alongside related manifests

Instead of:

kubectl create secret generic db-creds --from-literal=password=xxx

The AI suggests:

Navigate to Stacks page
Create a new Stack or edit existing
Add the Secret manifest
Deploy the stack

Tips for Best Results

Be specific - “Why is nginx-pod-xyz crashing?” works better than “pods not working”

Use context - When on a resource detail page, ask “What’s wrong with this?” instead of repeating the name

Follow up - Ask clarifying questions like “How do I fix that?” or “Show me the logs”

Include logs - For crash issues, the AI automatically fetches logs, but you can specify “include logs” for other queries

Privacy & Data

AI analysis happens on Ankra’s secure infrastructure
Logs and configurations are processed in real-time, not stored for AI training
Conversation history is saved per-cluster for your convenience
You can start a new conversation at any time to clear context

AI Assistant - General AI capabilities in Ankra
AI Incidents - AI-powered alert analysis
Kubernetes Insights - Resource monitoring overview
Command Palette - Quick access to AI and navigation

Need help? Join our Slack community for support.

Get Started

Guides

Clusters

Stacks & Add-ons

GitOps

Kubernetes Resources

Platform Features

Team & Settings

Integrations

What is AI Troubleshooting?

Intelligent Analysis

Real-Time Data

Context Aware

Platform Integrated

How It Works

1. Intelligent Resource Planning

2. Data Collection

3. AI Analysis

Accessing AI Troubleshooting

AI Incidents (Alert-Triggered Analysis)

Global AI Assistant (On-Demand)

Resource Detail Pages

Dedicated Troubleshooting Page

Example Questions

Pod Issues

Cluster Health

Resource Counting

Debugging Specific Issues

Add-on Troubleshooting

Common Failure Patterns

Response Format

Problem Summary

Root Cause

Impact Assessment

Suggested Actions

Conversation Context

Stack-Based Fixes

Tips for Best Results

Privacy & Data

Get Started

Guides

Clusters

Stacks & Add-ons

GitOps

Kubernetes Resources

Platform Features

Team & Settings

Integrations

​What is AI Troubleshooting?

Intelligent Analysis

Real-Time Data

Context Aware

Platform Integrated

​How It Works

​1. Intelligent Resource Planning

​2. Data Collection

​3. AI Analysis

​Accessing AI Troubleshooting

​AI Incidents (Alert-Triggered Analysis)

​Global AI Assistant (On-Demand)

​Resource Detail Pages

​Dedicated Troubleshooting Page

​Example Questions

​Pod Issues

​Cluster Health

​Resource Counting

​Debugging Specific Issues

​Add-on Troubleshooting

​Common Failure Patterns

​Response Format

​Problem Summary

​Root Cause

​Impact Assessment

​Suggested Actions

​Conversation Context

​Stack-Based Fixes

​Tips for Best Results

​Privacy & Data

​Related

What is AI Troubleshooting?

How It Works

1. Intelligent Resource Planning

2. Data Collection

3. AI Analysis

Accessing AI Troubleshooting

AI Incidents (Alert-Triggered Analysis)

Global AI Assistant (On-Demand)

Resource Detail Pages

Dedicated Troubleshooting Page

Example Questions

Pod Issues

Cluster Health

Resource Counting

Debugging Specific Issues

Add-on Troubleshooting

Common Failure Patterns

Response Format

Problem Summary

Root Cause

Impact Assessment

Suggested Actions

Conversation Context

Stack-Based Fixes

Tips for Best Results

Privacy & Data

Related