在商城系统中接入了 OpenTelemetry,解决了之前排查问题全靠 grep 日志的痛苦。记录 Go 项目接入 OTel 的完整过程,包括 Traces、Metrics、Logs 三大支柱的接入方法。
OpenTelemetry 核心概念
OpenTelemetry(OTel)是 CNCF 的可观测性标准,统一了 Traces、Metrics、Logs 三种信号的采集和导出。
- Traces(链路追踪):一个请求从入口到出口经过的所有服务和操作,形成一棵调用树。每个操作是一个 Span。
- Metrics(指标):数值型度量,比如请求数、延迟分位数、错误率。
- Logs(日志):结构化日志,可以关联到具体的 Trace。
OTel 的架构:应用通过 SDK 采集数据 → 发送到 OTel Collector → Collector 导出到后端(Jaeger/Prometheus/Loki 等)。
Go SDK 接入
先安装依赖:
go get go.opentelemetry.io/otel \
go.opentelemetry.io/otel/sdk \
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp \
go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp \
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp
初始化 TracerProvider 和 MeterProvider:
package otel
import (
"context"
"time"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
"go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp"
"go.opentelemetry.io/otel/sdk/metric"
"go.opentelemetry.io/otel/sdk/nesource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
)
func InitOTel(ctx context.Context, serviceName, endpoint string) (func(), error) {
res, err := resource.New(ctx,
resource.WithAttributes(
semconv.ServiceName(serviceName),
semconv.ServiceVersion("1.0.0"),
),
)
if err != nil {
return nil, err
}
// Trace exporter
traceExporter, err := otlptracehttp.New(ctx,
otlptracehttp.WithEndpoint(endpoint),
otlptracehttp.WithInsecure(),
)
if err != nil {
return nil, err
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(traceExporter,
sdktrace.WithBatchTimeout(5*time.Second),
),
sdktrace.WithResource(res),
sdktrace.WithSampler(sdktrace.TraceIDRatioBased(0.1)), // 采样 10%
)
otel.SetTracerProvider(tp)
// Metric exporter
metricExporter, err := otlpmetrichttp.New(ctx,
otlpmetrichttp.WithEndpoint(endpoint),
otlpmetrichttp.WithInsecure(),
)
if err != nil {
return nil, err
}
mp := metric.NewMeterProvider(
metric.WithReader(metric.NewPeriodicReader(metricExporter,
metric.WithInterval(15*time.Second),
)),
metric.WithResource(res),
)
otel.SetMeterProvider(mp)
cleanup := func() {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
tp.Shutdown(ctx)
mp.Shutdown(ctx)
}
return cleanup, nil
}
在 main.go 中调用:
func main() {
ctx := context.Background()
cleanup, err := otel.InitOTel(ctx, "order-service", "otel-collector:4318")
if err != nil {
log.Fatal(err)
}
defer cleanup()
// ... 启动服务
}
HTTP 中间件
otelhttp 包提供了开箱即用的 HTTP 中间件,自动为每个请求创建 Span:
import "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
func setupRouter() http.Handler {
mux := http.NewServeMux()
mux.HandleFunc("/api/orders", handleOrders)
mux.HandleFunc("/api/products", handleProducts)
// 包装整个 handler,自动追踪所有请求
return otelhttp.NewHandler(mux, "http-server")
}
对外调用时,用 otelhttp.NewTransport 包装 HTTP client,自动传播 trace context:
var httpClient = &http.Client{
Transport: otelhttp.NewTransport(http.DefaultTransport),
}
func callProductService(ctx context.Context, productID string) (*Product, error) {
req, _ := http.NewRequestWithContext(ctx, "GET",
fmt.Sprintf("http://product-service/api/products/%s", productID),
nil,
)
// trace context 自动注入到请求头
resp, err := httpClient.Do(req)
// ...
}
数据库追踪
手动为数据库操作创建 Span:
import "go.opentelemetry.io/otel"
var tracer = otel.Tracer("order-service")
func GetOrderByID(ctx context.Context, id string) (*Order, error) {
ctx, span := tracer.Start(ctx, "db.query.order",
trace.WithAttributes(
attribute.String("db.system", "postgresql"),
attribute.String("db.statement", "SELECT * FROM orders WHERE id = $1"),
attribute.String("db.operation", "SELECT"),
attribute.String("order.id", id),
),
)
defer span.End()
var order Order
err := db.QueryRowContext(ctx,
"SELECT id, user_id, total, status FROM orders WHERE id = $1", id,
).Scan(&order.ID, &order.UserID, &order.Total, &order.Status)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
return nil, err
}
span.SetAttributes(attribute.String("order.status", order.Status))
return &order, nil
}
自定义 Metrics
var meter = otel.Meter("order-service")
var (
orderCounter metric.Int64Counter
orderDuration metric.Float64Histogram
activeOrders metric.Int64UpDownCounter
)
func initMetrics() {
var err error
orderCounter, err = meter.Int64Counter("orders.created",
metric.WithDescription("Number of orders created"),
)
// handle err...
orderDuration, err = meter.Float64Histogram("orders.duration",
metric.WithDescription("Order processing duration in ms"),
metric.WithUnit("ms"),
)
activeOrders, err = meter.Int64UpDownCounter("orders.active",
metric.WithDescription("Number of active orders"),
)
}
func CreateOrder(ctx context.Context, req CreateOrderRequest) (*Order, error) {
start := time.Now()
defer func() {
duration := float64(time.Since(start).Milliseconds())
orderDuration.Record(ctx, duration,
metric.WithAttributes(attribute.String("status", "success")),
)
}()
// ... 创建订单逻辑
orderCounter.Add(ctx, 1,
metric.WithAttributes(attribute.String("source", req.Source)),
)
activeOrders.Add(ctx, 1)
return order, nil
}
后端部署:Jaeger + Grafana
用 Docker Compose 部署 OTel Collector + Jaeger + Grafana:
# docker-compose.yml
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:0.91.0
ports:
- "4317:4317" # gRPC
- "4318:4318" # HTTP
volumes:
- ./otel-config.yaml:/etc/otelcol-contrib/config.yaml
jaeger:
image: jaegertracing/all-in-one:1.52
ports:
- "16686:16686" # UI
- "4317" # 接收 OTel 数据
environment:
COLLECTOR_OTLP_ENABLED: "true"
prometheus:
image: prom/prometheus:v2.48.0
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana:10.2.0
ports:
- "3000:3000"
environment:
GF_AUTH_ANONYMOUS_ENABLED: "true"
OTel Collector 配置:
# otel-config.yaml
receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
exporters:
otlp/jaeger:
endpoint: jaeger:4317
tls:
insecure: true
prometheusremotewrite:
endpoint: http://prometheus:9090/api/v1/write
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp/jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheusremotewrite]
效果
接入 OTel 之后排查问题的效率提升非常明显:
- Jaeger 看链路:一个请求经过了哪些服务、每个环节耗时多少、在哪里报错,一目了然
- Grafana 看趋势:请求量、错误率、P99 延迟做成 dashboard,异常一眼就能发现
- Trace 和日志关联:在日志中打印 trace_id,从日志跳到 Jaeger 看完整链路
之前一个跨服务的慢查询问题,靠日志排查花了半天,有了链路追踪后 5 分钟定位到是商品服务的一个 N+1 查询。早该接入的。