Features — BizFirst Observe by BizFirstAi

Structured Logs via Grafana Loki

Serilog-powered structured logging with multi-tier storage, automatic context enrichment, and LogQL querying.

Three-Tier Logging

L0 Console/File for local development. L1 Loki HTTP push for all environments — structured, queryable, retained. L2 SecurityAuditLog table for compliance — every ALLOW/DENY/ERROR RBAC decision persisted to SQL Server, indexed by TenantId and CreatedAt.

Automatic Context Enrichment

Every log entry is automatically enriched with TenantId, ServerId, RequestId, and TraceId by TelemetryEnrichmentMiddleware — no manual instrumentation needed. Five log levels: Debug, Information, Warning, Error, Fatal.

LogQL Queries & Retention

Query logs directly in Grafana using LogQL. Configurable retention: Development 3 days, Staging 7 days, Production 30 days. Example: {tenant_id="acme"} |= "ERROR" or {service="payroll"} | json | level="Fatal".

Metrics via Prometheus

50+ built-in metrics exposed at /metrics. Custom BizFirst metrics plus full ASP.NET Core auto-instrumentation. Scrape every 15 seconds.

50+ Built-In Metrics

Custom BizFirst metrics: bizfirst_health_check_status (gauge), bizfirst_payroll_processed_total (counter), bizfirst_form_submission_duration_seconds (histogram). ASP.NET Core auto-instrumented: http_server_request_duration_seconds P50/P95/P99, http_server_active_requests, http_client_request_duration_seconds. Metric naming convention: {product}_{entity}_{operation}_{unit}.

Kafka Metrics

EdgeStream Kafka metrics built in: edgestream_kafka_consumer_lag_messages, edgestream_kafka_consume_errors_total, and edgestream_kafka_messages_produced_total. All labelled by topic, partition, and consumer_group for granular PromQL filtering.

Dashboard Creation

Build Grafana dashboards in minutes with PromQL. Request rate: rate(http_server_request_duration_seconds_count[5m]). P99 latency: histogram_quantile(0.99, ...). Error rate: rate(http_server_request_duration_seconds_count{status=~"5.."}[5m]). Health status: bizfirst_health_check_status{component="kafka"}.

Distributed Traces via Grafana Tempo

Full-chain tracing via OTLP gRPC with adaptive sampling, automatic service dependency graphs, and TraceQL querying.

Full-Chain Tracing

Traces span the full request path: HTTP request → service layer → database queries → Kafka → SignalR. Service dependency graph auto-built from span relationships. TraceId is automatically correlated to Loki logs for one-click log navigation from any trace.

Adaptive Sampling

Environment-aware sampling rates: 100% in development, 10% in staging, 1% in production. Always-sample-errors policy ensures no error trace is ever dropped regardless of environment. Configurable via appsettings.json without code changes.

TraceQL Queries

Query traces in Grafana with TraceQL. Slow traces: {duration > 2s}. Error traces: {status=error}. Tenant-specific: {.tenant_id="acme"}. Database query spans: {span.db.system="mssql" && duration > 500ms}.

Health Checks & Readiness

Live component status with Kubernetes-compatible probes and automatic Prometheus publication.

6 Component Checks

Checks cover all critical dependencies: Kafka, Redis, SQL Server, Loki, Tempo, and Grafana. Each check reports Healthy, Degraded, or Unhealthy status. GET /health returns the full status report with detail per component.

Kubernetes Probes

GET /health/live for liveness probe — returns 200 when the process is alive. GET /health/ready for readiness probe — returns 200 only when all critical dependencies are Healthy. Drop-in compatible with Kubernetes liveness and readiness probe configuration.

Prometheus Integration

Health check results published automatically as the bizfirst_health_check_status gauge: 2 = Healthy, 1 = Degraded, 0 = Unhealthy. Alert on bizfirst_health_check_status == 0 to trigger ComponentDown alerting rules.

Multi-Tenant Observability

Every signal — every log, metric, and trace — automatically carries tenant context. Per-tenant dashboards out of the box.

Automatic Enrichment

TelemetryEnrichmentMiddleware intercepts every request and adds TenantId and ServerId to every log entry, metric label, and trace span — automatically, with no per-service instrumentation required. All custom metrics must include a tenant_id label to enforce per-tenant isolation.

Per-Tenant Dashboards

Grafana dashboards use the $TenantId template variable to filter all panels to a single tenant. Per-tenant PromQL: rate(http_server_request_duration_seconds_count{tenant_id="$TenantId"}[5m]). Per-tenant LogQL: {tenant_id="$TenantId"} |= "ERROR".

Tenant Isolation Audit

Compliance checklist enforced by the platform: every log entry includes TenantId, every metric includes tenant_id label, every trace span includes tenant_id attribute, every database query is tenant-scoped. SecurityAuditLog indexed by (TenantId, CreatedAt) for sub-second compliance queries.

Security & Compliance Logging

A complete RBAC decision trail, GDPR-ready data handling, and configurable compliance event tracking.

SecurityAuditLog

Every RBAC decision is recorded in the SecurityAuditLog table: EventType (ALLOW/DENY/ERROR), PolicyId, PrivilegeKey, PrincipalId, ResourceNodeId, ActionType, Reason, and TenantId. Full audit trail for every authorisation decision across the platform.

GDPR Compliance

User IDs are hashed before logging — no PII ever reaches Loki or Tempo. Configurable TTL retention per environment. Structured log fields are schema-validated to prevent accidental PII leakage. Data residency guaranteed — all storage runs in your own infrastructure.

Compliance Event Tracking

Five compliance event categories automatically tracked: authentication events (login/logout/failed), data access (reads of sensitive resources), data modifications (creates/updates/deletes), authorisation changes (privilege grants/revocations), and admin actions (tenant config, user management).

Alerting & Incident Response

Pre-built alert rules, intelligent routing, and a structured escalation policy — from first alert to war room in 30 minutes.

Pre-Built Alert Rules

Four production-ready rules included: HighErrorRate (>5% 5xx for 5 minutes), HighLatency (P95 > 1 second for 10 minutes), HighKafkaLag (>10,000 messages for 5 minutes), ComponentDown (health check == 0 for 2 minutes). All configurable via AlertManager.

Alert Routing

Critical severity routes to PagerDuty immediately. Warning severity routes to Slack. Alert grouping and inhibition rules configured in AlertManager. Routing is environment-aware — production critical alerts escalate faster than staging warnings.

Escalation Policy

Structured 30-minute escalation ladder: 5 minutes — log alert and notify on-call. 15 minutes — page via PagerDuty. 20 minutes — escalate to engineering manager. 30 minutes — open a war room with all stakeholders. All timings configurable per alert rule.

Product Integrations

Deep observability built into every BizFirstAi product — not bolted on after the fact.

FormMaker

Form validation counters and submission duration histograms. Track submission rates, validation failure rates, and end-to-end form processing latency per tenant and per form definition.

EdgeStream

Kafka consumer lag metrics and SignalR connected client count gauge. Monitor real-time event processing health, detect consumer stalls before they become incidents, and track active client connections per hub.

ExecutorNodes

Individual node execution tracing via node.execute spans. Success and failure counters per node type. Trace the full execution lifecycle of every automation node, including duration histograms and error breakdowns by node category.

ProcessEngine

Workflow execution tracing with parent/child span hierarchy: workflow.execute spans contain node.execute child spans for every node in the workflow. Visualise the complete execution graph of any workflow run in Grafana Tempo.

Octopus

Agent function execution spans covering every tool call and LLM invocation. Token usage metrics — input and output tokens per tenant — for cost attribution. Full trace from agent request to response, including any sub-agent delegations.

Competitive Advantages

Open standards, open-source, and your data in your infrastructure. A different approach from the big vendors.

vs Datadog

BizFirst Observe is free open-source — $0 vs Datadog's $0.50+/GB ingestion pricing. No vendor lock-in: switch or add backends without re-instrumenting. On-prem capable — your data never leaves your infrastructure. Datadog requires sending all telemetry to their cloud.

vs New Relic

OpenTelemetry SDK is lightweight — no heavy APM agent to deploy and maintain. Standard OpenTelemetry protocols mean your instrumentation works with any OTLP-compatible backend. Flexible sampling configured in code, not locked to New Relic's proprietary sampling policies.

vs Splunk

Standard LogQL and TraceQL — not Splunk's proprietary SPL. Engineers already know these query languages from the broader Grafana ecosystem. Implementation takes days, not weeks. No expensive Splunk licensing or infrastructure to maintain.

Roadmap

Where BizFirst Observe is going — from adaptive sampling to Observability Hub.

Q2 2026

Adaptive trace sampling (head-based decisions before span creation). Slack and Microsoft Teams native alert templates. Pre-built dashboard library with 10+ ready-to-use dashboards for all BizFirstAi products. Multi-tenant comparison views for SaaS operators.

Q3–Q4 2026

Istio service mesh integration for network-level observability. eBPF kernel-level tracing without SDK changes. ML-based anomaly detection for automatic baseline learning and alert suppression during known maintenance windows. Observability as Code — define dashboards and alerts in YAML/HCL.

2027+

Observability Hub for all 10 BizFirstAi products — one pane of glass across the entire platform. Automatic HIPAA and SOC2 compliance reporting generated from existing audit logs. Cost optimization engine with per-tenant cost attribution. Observability marketplace for community dashboards and alert templates.

Start observing.

Three lines of C#. Full-stack observability. Production-verified.

Read the Docs Join the Community

Every observability signal your platform needs.