Go语言:OpenTelemetry可观测性实战

在商城系统中接入了 OpenTelemetry,解决了之前排查问题全靠 grep 日志的痛苦。记录 Go 项目接入 OTel 的完整过程,包括 Traces、Metrics、Logs 三大支柱的接入方法。

OpenTelemetry 核心概念

OpenTelemetry(OTel)是 CNCF 的可观测性标准,统一了 Traces、Metrics、Logs 三种信号的采集和导出。

  • Traces(链路追踪):一个请求从入口到出口经过的所有服务和操作,形成一棵调用树。每个操作是一个 Span。
  • Metrics(指标):数值型度量,比如请求数、延迟分位数、错误率。
  • Logs(日志):结构化日志,可以关联到具体的 Trace。

OTel 的架构:应用通过 SDK 采集数据 → 发送到 OTel Collector → Collector 导出到后端(Jaeger/Prometheus/Loki 等)。

Go SDK 接入

先安装依赖:

go get go.opentelemetry.io/otel \
       go.opentelemetry.io/otel/sdk \
       go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp \
       go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp \
       go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp

初始化 TracerProvider 和 MeterProvider:

package otel

import (
    "context"
    "time"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
    "go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp"
    "go.opentelemetry.io/otel/sdk/metric"
    "go.opentelemetry.io/otel/sdk/nesource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
)

func InitOTel(ctx context.Context, serviceName, endpoint string) (func(), error) {
    res, err := resource.New(ctx,
        resource.WithAttributes(
            semconv.ServiceName(serviceName),
            semconv.ServiceVersion("1.0.0"),
        ),
    )
    if err != nil {
        return nil, err
    }

    // Trace exporter
    traceExporter, err := otlptracehttp.New(ctx,
        otlptracehttp.WithEndpoint(endpoint),
        otlptracehttp.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(traceExporter,
            sdktrace.WithBatchTimeout(5*time.Second),
        ),
        sdktrace.WithResource(res),
        sdktrace.WithSampler(sdktrace.TraceIDRatioBased(0.1)), // 采样 10%
    )
    otel.SetTracerProvider(tp)

    // Metric exporter
    metricExporter, err := otlpmetrichttp.New(ctx,
        otlpmetrichttp.WithEndpoint(endpoint),
        otlpmetrichttp.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }

    mp := metric.NewMeterProvider(
        metric.WithReader(metric.NewPeriodicReader(metricExporter,
            metric.WithInterval(15*time.Second),
        )),
        metric.WithResource(res),
    )
    otel.SetMeterProvider(mp)

    cleanup := func() {
        ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
        defer cancel()
        tp.Shutdown(ctx)
        mp.Shutdown(ctx)
    }

    return cleanup, nil
}

main.go 中调用:

func main() {
    ctx := context.Background()
    cleanup, err := otel.InitOTel(ctx, "order-service", "otel-collector:4318")
    if err != nil {
        log.Fatal(err)
    }
    defer cleanup()
    // ... 启动服务
}

HTTP 中间件

otelhttp 包提供了开箱即用的 HTTP 中间件,自动为每个请求创建 Span:

import "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"

func setupRouter() http.Handler {
    mux := http.NewServeMux()
    mux.HandleFunc("/api/orders", handleOrders)
    mux.HandleFunc("/api/products", handleProducts)

    // 包装整个 handler,自动追踪所有请求
    return otelhttp.NewHandler(mux, "http-server")
}

对外调用时,用 otelhttp.NewTransport 包装 HTTP client,自动传播 trace context:

var httpClient = &http.Client{
    Transport: otelhttp.NewTransport(http.DefaultTransport),
}

func callProductService(ctx context.Context, productID string) (*Product, error) {
    req, _ := http.NewRequestWithContext(ctx, "GET",
        fmt.Sprintf("http://product-service/api/products/%s", productID),
        nil,
    )
    // trace context 自动注入到请求头
    resp, err := httpClient.Do(req)
    // ...
}

数据库追踪

手动为数据库操作创建 Span:

import "go.opentelemetry.io/otel"

var tracer = otel.Tracer("order-service")

func GetOrderByID(ctx context.Context, id string) (*Order, error) {
    ctx, span := tracer.Start(ctx, "db.query.order",
        trace.WithAttributes(
            attribute.String("db.system", "postgresql"),
            attribute.String("db.statement", "SELECT * FROM orders WHERE id = $1"),
            attribute.String("db.operation", "SELECT"),
            attribute.String("order.id", id),
        ),
    )
    defer span.End()

    var order Order
    err := db.QueryRowContext(ctx,
        "SELECT id, user_id, total, status FROM orders WHERE id = $1", id,
    ).Scan(&order.ID, &order.UserID, &order.Total, &order.Status)

    if err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
        return nil, err
    }

    span.SetAttributes(attribute.String("order.status", order.Status))
    return &order, nil
}

自定义 Metrics

var meter = otel.Meter("order-service")

var (
    orderCounter    metric.Int64Counter
    orderDuration   metric.Float64Histogram
    activeOrders    metric.Int64UpDownCounter
)

func initMetrics() {
    var err error
    orderCounter, err = meter.Int64Counter("orders.created",
        metric.WithDescription("Number of orders created"),
    )
    // handle err...

    orderDuration, err = meter.Float64Histogram("orders.duration",
        metric.WithDescription("Order processing duration in ms"),
        metric.WithUnit("ms"),
    )

    activeOrders, err = meter.Int64UpDownCounter("orders.active",
        metric.WithDescription("Number of active orders"),
    )
}

func CreateOrder(ctx context.Context, req CreateOrderRequest) (*Order, error) {
    start := time.Now()
    defer func() {
        duration := float64(time.Since(start).Milliseconds())
        orderDuration.Record(ctx, duration,
            metric.WithAttributes(attribute.String("status", "success")),
        )
    }()

    // ... 创建订单逻辑

    orderCounter.Add(ctx, 1,
        metric.WithAttributes(attribute.String("source", req.Source)),
    )
    activeOrders.Add(ctx, 1)

    return order, nil
}

后端部署:Jaeger + Grafana

用 Docker Compose 部署 OTel Collector + Jaeger + Grafana:

# docker-compose.yml
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.91.0
    ports:
      - "4317:4317"   # gRPC
      - "4318:4318"   # HTTP
    volumes:
      - ./otel-config.yaml:/etc/otelcol-contrib/config.yaml

  jaeger:
    image: jaegertracing/all-in-one:1.52
    ports:
      - "16686:16686"  # UI
      - "4317"         # 接收 OTel 数据
    environment:
      COLLECTOR_OTLP_ENABLED: "true"

  prometheus:
    image: prom/prometheus:v2.48.0
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana:10.2.0
    ports:
      - "3000:3000"
    environment:
      GF_AUTH_ANONYMOUS_ENABLED: "true"

OTel Collector 配置:

# otel-config.yaml
receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheusremotewrite]

效果

接入 OTel 之后排查问题的效率提升非常明显:

  1. Jaeger 看链路:一个请求经过了哪些服务、每个环节耗时多少、在哪里报错,一目了然
  2. Grafana 看趋势:请求量、错误率、P99 延迟做成 dashboard,异常一眼就能发现
  3. Trace 和日志关联:在日志中打印 trace_id,从日志跳到 Jaeger 看完整链路

之前一个跨服务的慢查询问题,靠日志排查花了半天,有了链路追踪后 5 分钟定位到是商品服务的一个 N+1 查询。早该接入的。