Runbook: High Latency

Alert: VectraHighLatency
Severity: Warning
Threshold: P95 latency >2s for 5 minutes

Symptoms

Slow vector operations
Request timeouts
User-facing latency issues
Queue backlog building up

Quick Diagnosis

# Check current latency by operation
histogram_quantile(0.95, 
  sum(rate(vectra_request_duration_seconds_bucket[5m])) by (le, operation)
)

# Test latency in console
require 'benchmark'

time = Benchmark.realtime do
  client.query(index: "test", vector: [0.1] * 384, top_k: 10)
end
puts "Query latency: #{(time * 1000).round}ms"

Investigation Steps

1. Identify Slow Operations

# Which operations are slow?
topk(5, 
  histogram_quantile(0.95, 
    sum(rate(vectra_request_duration_seconds_bucket[5m])) by (le, operation)
  )
)

Operation	Expected P95	Alert Threshold
query	<500ms	>2s
upsert (single)	<200ms	>1s
upsert (batch 100)	<2s	>5s
fetch	<100ms	>500ms
delete	<200ms	>1s

2. Check Provider Status

# Test provider connectivity
curl -w "@curl-format.txt" -o /dev/null -s https://api.pinecone.io/health

# curl-format.txt:
# time_namelookup: %{time_namelookup}\n
# time_connect: %{time_connect}\n
# time_starttransfer: %{time_starttransfer}\n
# time_total: %{time_total}\n

3. Check Network Latency

# Ping provider endpoint
ping -c 10 api.pinecone.io

# Check for packet loss
mtr api.pinecone.io

# DNS resolution time
time nslookup api.pinecone.io

4. Check Vector Dimensions

# Large vectors = slower operations
client.describe_index(index: "my-index")
# => { dimension: 1536, ... }

# Consider using smaller embeddings:
# - text-embedding-3-small: 512-1536 dims
# - text-embedding-ada-002: 1536 dims
# - all-MiniLM-L6-v2: 384 dims (faster!)

5. Check Index Size

stats = client.stats(index: "my-index")
puts "Vector count: #{stats[:total_vector_count]}"
puts "Index fullness: #{stats[:index_fullness]}"

# Large indexes may need optimization
# - Pinecone: Check pod type
# - pgvector: Check IVFFlat parameters
# - Qdrant: Check HNSW parameters

Resolution Steps

Immediate: Increase Timeouts

Vectra.configure do |config|
  config.timeout = 60       # Increase from 30
  config.open_timeout = 20  # Increase from 10
end

Enable Caching

cache = Vectra::Cache.new(ttl: 300, max_size: 1000)
cached_client = Vectra::CachedClient.new(client, cache: cache)

# Repeat queries will be instant

Optimize Batch Operations

# Use smaller batches for faster responses
batch = Vectra::Batch.new(client, concurrency: 2)

result = batch.upsert_async(
  index: "my-index",
  vectors: vectors,
  chunk_size: 50  # Smaller chunks = faster individual operations
)

Reduce top_k

# Fewer results = faster query
results = client.query(
  index: "my-index",
  vector: query_vec,
  top_k: 5  # Instead of 100
)

Provider-Specific Optimizations

Pinecone

# Use serverless for auto-scaling
# Or upgrade pod type for more capacity

pgvector

-- Check if index exists
SELECT indexname FROM pg_indexes WHERE tablename = 'your_table';

-- Create IVFFlat index for faster queries
CREATE INDEX ON your_table 
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- Increase probes for accuracy vs speed trade-off
SET ivfflat.probes = 10;  -- Default: 1

Qdrant

# Optimize HNSW parameters
client.provider.create_index(
  name: "optimized",
  dimension: 384,
  metric: "cosine",
  hnsw_config: {
    m: 16,           # Connections per node
    ef_construct: 100 # Build-time accuracy
  }
)

Connection Pooling (pgvector)

# Warmup connections to avoid cold start latency
client.provider.warmup_pool(5)

# Increase pool size for parallel queries
Vectra.configure do |config|
  config.pool_size = 20
end

Prevention

1. Monitor Latency Trends

# Alert on increasing latency trend
rate(vectra_request_duration_seconds_sum[1h]) /
rate(vectra_request_duration_seconds_count[1h]) > 1

2. Implement Request Timeouts

# Fail fast instead of hanging
Vectra.configure do |config|
  config.timeout = 10  # Strict timeout
end

3. Use Async Operations

# Don't block on upserts
Thread.new do
  batch.upsert_async(index: "bg-index", vectors: vectors)
end

4. Index Maintenance

-- pgvector: Reindex periodically
REINDEX INDEX your_ivfflat_index;

-- Analyze for query planner
ANALYZE your_table;

5. Geographic Optimization

# Use closest region to your servers
# Pinecone: us-east-1, us-west-2, eu-west-1
# Qdrant Cloud: Select nearest region

Benchmarking

# Run benchmark to establish baseline
require 'benchmark'

results = Benchmark.bm do |x|
  x.report("query") do
    100.times { client.query(index: "test", vector: vec, top_k: 10) }
  end
  
  x.report("upsert") do
    client.upsert(index: "test", vectors: vectors_100)
  end
  
  x.report("fetch") do
    100.times { client.fetch(index: "test", ids: ["id1"]) }
  end
end

Escalation

Time	Action
5 min	Enable caching, increase timeouts
15 min	Check provider status, optimize queries
30 min	Scale up provider resources
1 hour	Engage provider support

Runbook: High Latency

Symptoms

Quick Diagnosis

Investigation Steps

1. Identify Slow Operations

2. Check Provider Status

3. Check Network Latency

4. Check Vector Dimensions

5. Check Index Size

Resolution Steps

Immediate: Increase Timeouts

Enable Caching

Optimize Batch Operations

Reduce top_k

Provider-Specific Optimizations

Pinecone

pgvector

Qdrant

Connection Pooling (pgvector)

Prevention

1. Monitor Latency Trends

2. Implement Request Timeouts

3. Use Async Operations

4. Index Maintenance

5. Geographic Optimization

Benchmarking

Escalation

Related