Skip to content

Observability

Valter ships with structured JSON logging, 30+ Prometheus metrics, and OpenTelemetry tracing. The instrumentation is implemented across three modules in src/valter/observability/: logging.py, metrics.py, and tracing.py.

Valter uses structlog for structured JSON logging. Every log entry is machine-parseable and includes contextual fields for correlation.

# From observability/logging.py
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
structlog.processors.add_log_level,
structlog.processors.StackInfoRenderer(),
structlog.dev.set_exc_info,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.JSONRenderer(),
],
logger_factory=structlog.PrintLoggerFactory(file=sys.stderr),
)

Key design decisions:

  • JSON format to stdout/stderr — Railway captures logs automatically
  • Logs go to stderr so they do not interfere with MCP stdio transport (which uses stdout for JSON-RPC)
  • Context variables via structlog.contextvars.merge_contextvars inject trace_id and other request-scoped data into every log entry
  • Log level is configurable via VALTER_LOG_LEVEL (default: INFO)

Every incoming request generates a trace_id that is injected into context variables and propagated through all log entries for that request. This allows correlating logs across the full request lifecycle — from API handler through store queries to response serialization.

Valter defines 30+ Prometheus metrics using the prometheus_client library. Metrics are exposed via GET /metrics, with access restricted by VALTER_METRICS_IP_ALLOWLIST.

MetricTypeLabelsDescription
valter_request_duration_secondsHistogramendpoint, method, statusRequest duration with buckets from 10ms to 10s
valter_requests_totalCounterendpoint, method, statusTotal request count
MetricTypeLabelsDescription
valter_mcp_rpc_requests_totalCounterrpc_method, tool, status_classTotal MCP JSON-RPC requests
valter_mcp_rpc_duration_secondsHistogramrpc_method, tool, status_classMCP request duration (buckets to 60s)
valter_mcp_tool_calls_totalCountertool, outcomeTool call count by outcome
valter_mcp_tool_call_duration_secondsHistogramtool, outcomeTool call duration
valter_mcp_auth_failures_totalCounterreasonAuthentication failures
valter_mcp_rate_limit_blocks_totalCounterRate-limited MCP requests
MetricTypeLabelsDescription
valter_api_rate_limit_blocks_totalCounterreasonAPI requests blocked by rate limit
valter_api_rate_limit_failsafe_blocks_totalCounterAPI requests blocked when rate limit backend is unavailable
valter_mcp_rate_limit_failsafe_blocks_totalCounterMCP requests blocked when rate limit backend is unavailable
valter_rate_limit_redis_errors_totalCountersurfaceRedis errors in rate limiting middleware
MetricTypeLabelsDescription
valter_cache_hits_totalCounterstoreCache hit count
valter_cache_misses_totalCounterstoreCache miss count
MetricTypeLabelsDescription
valter_store_healthGaugestorePer-store health (1=up, 0=down)
valter_queue_depthGaugequeuePending jobs in the ARQ queue
MetricTypeLabelsDescription
valter_ingest_stage_duration_secondsHistogramstage, statusDuration per ingestion stage
valter_artifact_put_latency_msHistogrambackendArtifact write latency
valter_artifact_get_latency_msHistogrambackendArtifact read latency
valter_artifact_put_fail_totalCounterbackend, content_typeFailed artifact writes
valter_presign_issued_totalCounterbackendSigned URL generation count
MetricTypeLabelsDescription
valter_kg_boost_errors_totalCountersourceKG boost errors
valter_kg_enrichment_totalCounteroutcomeKG enrichment attempts
valter_kg_boost_candidates_totalCounterstrategySearch results eligible for KG boost
valter_kg_boost_enriched_totalCounterstrategySearch results actually enriched by KG boost
valter_kg_boost_scoreHistogramDistribution of raw KG boost scores
MetricTypeLabelsDescription
valter_operation_failures_totalCountersurface, stage, error_classOperational failures with low-cardinality taxonomy

OpenTelemetry tracing is configured in observability/tracing.py with FastAPI auto-instrumentation.

# From observability/tracing.py
def setup_tracing(service_name: str = "valter") -> None:
resource = Resource.create({"service.name": service_name})
provider = TracerProvider(resource=resource)
provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)

The current configuration exports spans to the console via ConsoleSpanExporter. This is useful for development but not for production monitoring.

# From observability/tracing.py
def instrument_fastapi_app(app: FastAPI) -> None:
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
FastAPIInstrumentor.instrument_app(app)

This auto-instruments all FastAPI endpoints with trace context propagation. The instrumentation is idempotent — calling it multiple times on the same app has no effect.

Endpoint: GET /v1/health

The health endpoint checks connectivity to all backend stores with a 5-second timeout per store:

StoreWhat It Checks
qdrantVector store connectivity
neo4jGraph database connectivity
postgresDocument store connectivity
redisCache store connectivity
artifact_storageArtifact backend (local or R2)

Each store returns up or down with measured latency in milliseconds. The overall status is healthy when all stores are up, or degraded when any store is down. The response also includes the Valter version number and uptime.

Store health is tracked by the valter_store_health Prometheus gauge, enabling monitoring systems to detect degradation over time.

Six alert rules are defined in the project documentation, targeting:

  • Latency p95 exceeding threshold
  • Error rate exceeding threshold
  • Store health degradation
  • Cache hit rate drops
  • Queue depth increases
  • KG boost error spikes