BizFirst Observe Documentation

Full observability in minutes.

Getting Started with BizFirst Observe

BizFirst Observe is the unified observability platform for the BizFirstAi platform. It provides logs via Grafana Loki, metrics via Prometheus, distributed traces via Grafana Tempo, and health checks — all wired up with three method calls in your Program.cs.

Production-verified: BizFirst Observe entered production on March 24, 2026 running .NET 9.0 with OpenTelemetry SDK 1.11, Grafana v12.4.1, and Grafana Loki on port 3100. The configuration below reflects that verified stack.

3-Line Setup

Add the following three calls to your Program.cs to register the full observability stack:

C# — Program.cs
// 1. Register OTEL SDK, Prometheus exporter, Loki exporter, Tempo exporter
builder.Services.RegisterService_Observability(builder.Configuration);

// 2. Register Serilog with Loki sink and enrichment
builder.Host.RegisterSerilog_Observability(builder.Configuration);

// 3. Register middleware: TelemetryEnrichmentMiddleware + /metrics + /health endpoints
app.RegisterApp_Observability();

appsettings.json

Add the Observability section to your appsettings.json. Adjust endpoint URLs for your environment:

JSON — appsettings.json
{
  "Observability": {
    "Tracing": {
      "OtlpEndpoint": "http://localhost:4317",
      "SamplingRate": 1.0,
      "AlwaysSampleErrors": true
    },
    "Metrics": {
      "PrometheusEndpoint": "http://localhost:9090",
      "ScrapeIntervalSeconds": 15
    },
    "Logging": {
      "LokiEndpoint": "http://localhost:3100",
      "MinimumLevel": "Information",
      "RetentionDays": 3
    },
    "Grafana": {
      "Endpoint": "http://localhost:3000"
    }
  }
}

Production values: Set SamplingRate to 0.01 (1%) in production and 0.10 (10%) in staging. Set RetentionDays to 30 for production and 7 for staging. AlwaysSampleErrors must always be true.

Verify Your Installation

After starting your application, verify each component is running:

Shell
# Check Prometheus /metrics endpoint on your service
curl http://localhost:5000/metrics

# Check Prometheus is scraping (replace with your Prometheus host)
curl http://localhost:9090/api/v1/targets

# Check Grafana is running
curl http://localhost:3000/api/health

# Check Loki is running
curl http://localhost:3100/ready

# Check your service health endpoints
curl http://localhost:5000/health
curl http://localhost:5000/health/live
curl http://localhost:5000/health/ready

Logging

Serilog Registration

BizFirst Observe registers Serilog with three sinks automatically: Console/File (L0, development), Loki HTTP push (L1, all environments), and the SQL Server SecurityAuditLog table (L2, compliance events). You do not need to configure Serilog manually.

C# — Manual Registration (if needed)
// Normally handled by RegisterSerilog_Observability — shown for reference
Log.Logger = new LoggerConfiguration()
    .Enrich.FromLogContext()
    .WriteTo.Console()
    .WriteTo.GrafanaLoki(
        "http://localhost:3100",
        labels: new[] { new LokiLabel { Key = "app", Value = "bizfirst" } })
    .CreateLogger();

Automatic Enrichment Fields

TelemetryEnrichmentMiddleware adds the following fields to every log entry automatically. These fields are also used as Loki labels:

Field Source Purpose
TenantId JWT claim / request header Tenant isolation — all queries filter by this field
ServerId Machine name / pod name Identify which instance produced the log
RequestId ASP.NET Core TraceIdentifier Correlate all logs within a single HTTP request
TraceId OpenTelemetry Activity TraceId Link log entries to the corresponding distributed trace in Tempo

LogQL Example Queries

Use these queries in Grafana Explore (Loki data source) to investigate logs:

LogQL
# All errors for a specific tenant
{tenant_id="acme"} |= "ERROR"

# Fatal logs from the payroll service
{service="payroll", tenant_id="acme"} | json | level="Fatal"

# Logs for a specific request trace
{app="bizfirst"} | json | TraceId="4bf92f3577b34da6"

# Logs from a specific server in the last hour
{server_id="prod-node-01"} | json | level=~"Error|Fatal"

# Security audit events for a tenant
{tenant_id="acme", service="auth"} | json | EventType="DENY"

Metrics

Custom Counter & Histogram in C#

Use the IMeterFactory provided by the OTEL SDK. All custom metrics must include a tenant_id tag:

C#
using System.Diagnostics.Metrics;

public class PayrollService
{
    private static readonly Meter _meter = new("BizFirst.Payroll", "1.0");

    // Counter — increment each time a payroll run completes
    private static readonly Counter<long> _payrollCounter =
        _meter.CreateCounter<long>(
            "bizfirst_payroll_processed_total",
            description: "Total number of payroll runs processed");

    // Histogram — record duration of each payroll run
    private static readonly Histogram<double> _payrollDuration =
        _meter.CreateHistogram<double>(
            "bizfirst_payroll_duration_seconds",
            unit: "s",
            description: "Duration of payroll run processing");

    public async Task ProcessPayroll(string tenantId)
    {
        var stopwatch = Stopwatch.StartNew();
        try
        {
            // ... payroll processing logic ...

            _payrollCounter.Add(1,
                new KeyValuePair<string, object?>("tenant_id", tenantId),
                new KeyValuePair<string, object?>("status", "success"));
        }
        finally
        {
            stopwatch.Stop();
            _payrollDuration.Record(stopwatch.Elapsed.TotalSeconds,
                new KeyValuePair<string, object?>("tenant_id", tenantId));
        }
    }
}

PromQL Example Queries

Use these queries in Grafana Explore (Prometheus data source) to build dashboards and alerts:

PromQL
# Request rate (requests per second over last 5 minutes)
rate(http_server_request_duration_seconds_count[5m])

# P99 latency per tenant
histogram_quantile(0.99,
  sum(rate(http_server_request_duration_seconds_bucket{tenant_id="acme"}[5m]))
  by (le))

# 5xx error rate
rate(http_server_request_duration_seconds_count{http_response_status_code=~"5.."}[5m])
/
rate(http_server_request_duration_seconds_count[5m])

# Health status for Kafka (2=Healthy, 1=Degraded, 0=Unhealthy)
bizfirst_health_check_status{component="kafka"}

# Kafka consumer lag
edgestream_kafka_consumer_lag_messages{consumer_group="payroll-processor"}

# Active HTTP requests right now
http_server_active_requests

Tracing

Custom Span in C#

Use ActivitySource to create custom spans. BizFirst Observe pre-registers ActivitySources for all BizFirstAi products:

C#
using System.Diagnostics;

public class WorkflowExecutor
{
    private static readonly ActivitySource _activitySource =
        new("BizFirst.ProcessEngine");

    public async Task ExecuteWorkflow(string workflowId, string tenantId)
    {
        using var activity = _activitySource.StartActivity("workflow.execute");

        // Add span attributes — tenant_id is required
        activity?.SetTag("tenant_id", tenantId);
        activity?.SetTag("workflow.id", workflowId);
        activity?.SetTag("workflow.version", "1.0");

        try
        {
            // Execute each node as a child span
            foreach (var node in workflow.Nodes)
            {
                using var nodeActivity = _activitySource.StartActivity(
                    "node.execute",
                    ActivityKind.Internal,
                    activity?.Context ?? default);

                nodeActivity?.SetTag("node.id", node.Id);
                nodeActivity?.SetTag("node.type", node.Type);
                nodeActivity?.SetTag("tenant_id", tenantId);

                await node.ExecuteAsync();
            }
        }
        catch (Exception ex)
        {
            activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
            activity?.RecordException(ex);
            throw;
        }
    }
}

TraceQL Example Queries

Use these queries in Grafana Explore (Tempo data source) to find traces:

TraceQL
# Slow traces (duration over 2 seconds)
{duration > 2s}

# Error traces
{status=error}

# Tenant-specific traces
{.tenant_id="acme"}

# Database query spans over 500ms
{span.db.system="mssql" && duration > 500ms}

# Workflow execution traces for a specific tenant
{name="workflow.execute" && .tenant_id="acme"}

# All traces containing an error span from the payroll service
{resource.service.name="bizfirst-payroll"} | select(status=error)

Health Checks

Registration

BizFirst Observe registers health checks for all six core dependencies automatically via RegisterService_Observability. The equivalent manual registration looks like this:

C# — Manual Registration (reference)
builder.Services.AddHealthChecks()
    .AddKafka(config, name: "kafka", tags: new[] { "ready" })
    .AddRedis(redisConnectionString, name: "redis", tags: new[] { "ready" })
    .AddSqlServer(sqlConnectionString, name: "sqlserver", tags: new[] { "ready" })
    .AddUrlGroup(new Uri("http://localhost:3100/ready"), name: "loki", tags: new[] { "ready" })
    .AddUrlGroup(new Uri("http://localhost:4317"), name: "tempo", tags: new[] { "ready" })
    .AddUrlGroup(new Uri("http://localhost:3000/api/health"), name: "grafana", tags: new[] { "ready" });

Health Response Format

GET /health returns a JSON body with the overall status and a breakdown per component:

JSON — /health response
{
  "status": "Healthy",
  "totalDuration": "00:00:00.1234567",
  "entries": {
    "kafka":     { "status": "Healthy",   "duration": "00:00:00.0120000" },
    "redis":     { "status": "Healthy",   "duration": "00:00:00.0030000" },
    "sqlserver": { "status": "Healthy",   "duration": "00:00:00.0450000" },
    "loki":      { "status": "Healthy",   "duration": "00:00:00.0080000" },
    "tempo":     { "status": "Degraded",  "duration": "00:00:00.3210000",
                   "description": "Connection timeout" },
    "grafana":   { "status": "Healthy",   "duration": "00:00:00.0110000" }
  }
}

Kubernetes Probe Configuration

Use the dedicated liveness and readiness endpoints in your Kubernetes pod spec:

YAML — Kubernetes pod spec
livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 15
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 2

Multi-Tenancy

TelemetryEnrichmentMiddleware is registered automatically by app.RegisterApp_Observability(). It reads the TenantId from the authenticated user's JWT claims and adds it as a property to every log entry via Serilog's log context, as a label on every Loki log, as a tag on every OpenTelemetry span, and as a label on every Prometheus metric.

Enforcement rule: Every custom metric you create must include a tenant_id label. The platform enforces this at code review. Metrics without tenant_id cannot be used in per-tenant dashboards and will fail tenant isolation audits.

Alerting

BizFirst Observe ships with four pre-configured AlertManager rules. These are activated when you deploy the included docker-compose.observability.yml:

Rule Condition Duration Routing
HighErrorRate 5xx rate > 5% 5 minutes PagerDuty (Critical)
HighLatency P95 > 1 second 10 minutes Slack (Warning)
HighKafkaLag Lag > 10,000 messages 5 minutes Slack (Warning)
ComponentDown health_check_status == 0 2 minutes PagerDuty (Critical)

Deployment

For local development and single-server deployments, use the included Docker Compose file to run the full observability stack alongside your application:

Shell
# Start the full observability stack (Prometheus, Loki, Tempo, Grafana, AlertManager)
docker compose -f docker-compose.observability.yml up -d

# Verify all services are healthy
docker compose -f docker-compose.observability.yml ps

# Service port map:
# Grafana:      http://localhost:3000  (dashboards + alerting UI)
# Prometheus:   http://localhost:9090  (metrics query + targets)
# Loki:         http://localhost:3100  (log storage)
# Tempo OTLP:   grpc://localhost:4317  (trace ingestion)
# AlertManager: http://localhost:9093  (alert routing UI)

Next Steps

Ready to go further? Explore these resources: