Runbook: High Error Rate

Alert: VectraHighErrorRate
Severity: Critical
Threshold: Error rate >5% for 5 minutes

Symptoms

Alert firing for high error rate
Users reporting failed operations
Increased latency alongside errors

Quick Diagnosis

# Check recent errors in logs
grep -i "vectra.*error" /var/log/app.log | tail -50

# Check error breakdown by type
curl -s localhost:9090/api/v1/query?query=sum(vectra_errors_total)by(error_type) | jq

Investigation Steps

1. Identify Error Type

# In Rails console
Vectra::Client.new.stats(index: "your-index")

Error Type	Likely Cause	Action
`AuthenticationError`	Invalid/expired API key	Check credentials
`RateLimitError`	Too many requests	Implement backoff
`ServerError`	Provider outage	Check provider status
`ConnectionError`	Network issues	Check connectivity
`ValidationError`	Bad request data	Check input validation

2. Check Provider Status

Pinecone: status.pinecone.io
Qdrant: Check self-hosted logs or cloud dashboard
pgvector: SELECT * FROM pg_stat_activity WHERE state = 'active';

3. Check Application Logs

# Filter by error class
grep "Vectra::RateLimitError" /var/log/app.log | wc -l
grep "Vectra::ServerError" /var/log/app.log | wc -l
grep "Vectra::AuthenticationError" /var/log/app.log | wc -l

Resolution Steps

Authentication Errors

# Verify API key is set
puts ENV['PINECONE_API_KEY'].nil? ? "MISSING" : "SET"

# Test connection
client = Vectra::Client.new
client.list_indexes

Rate Limit Errors

# Implement exponential backoff
Vectra.configure do |config|
  config.max_retries = 5
  config.retry_delay = 2  # Start with 2s delay
end

# Or use batch operations with concurrency limit
batch = Vectra::Batch.new(client, concurrency: 2)  # Reduce from 4

Server Errors

Check provider status page
If provider is down, enable fallback or circuit breaker
Consider failover to backup provider

# Simple circuit breaker
class VectraCircuitBreaker
  def self.call
    return cached_response if circuit_open?
    
    yield
  rescue Vectra::ServerError
    open_circuit!
    cached_response
  end
end

Connection Errors

# Test network connectivity
curl -I https://api.pinecone.io/health

# Check DNS resolution
nslookup api.pinecone.io

# Check firewall rules
iptables -L -n | grep -i pinecone

Prevention

Set up retry logic:

config.max_retries = 3
config.retry_delay = 1

Monitor error rate trends:
```
increase(vectra_errors_total[1h])
```
Implement circuit breakers for provider outages

Cache frequently accessed data:

cached_client = Vectra::CachedClient.new(client)

Escalation

Time	Action
5 min	Page on-call engineer
15 min	Escalate to team lead
30 min	Consider provider failover
1 hour	Engage provider support

Runbook: High Error Rate

Symptoms

Quick Diagnosis

Investigation Steps

1. Identify Error Type

2. Check Provider Status

3. Check Application Logs

Resolution Steps

Authentication Errors

Rate Limit Errors

Server Errors

Connection Errors

Prevention

Escalation

Related