Microservices Observability with Spring Cloud: Beyond Logs and Metrics

It's 3:00 AM. Your checkout flow is returning 500 errors. You check the logs:

Payment Service: 200 OK
Inventory Service: 200 OK
Order Service: 200 OK

Every service claims success, yet customers can't complete purchases.

In monoliths, stack traces reveal the exact line of failure. In microservices, the failure is distributed across network calls, message queues, and database transactions. Traditional logs tell you what happened inside each service. Metrics tell you how much CPU or memory was used. Neither tells you where the request died across service boundaries.

Distributed Tracing reconstructs the complete request journey, turning isolated logs into a connected narrative. This guide implements production-grade observability using Spring Boot 3, Micrometer, OpenTelemetry, and Grafana Tempo.

The Vocabulary of Observability

Before we write code, we must agree on the language. Distributed tracing introduces three core concepts that act as the "connective tissue" for your request logs.

1. Trace

A Trace represents the entire journey of a request through your distributed system. Every service that touches the request logs the same Trace ID (65b8e6a09ca6343), allowing you to reconstruct the full path.

Example: POST /checkout → API Gateway → Order Service → Payment Service → Inventory Service. All five services log the same Trace ID.

2. Span

A Span is a single unit of work within that trace. It could be an HTTP request, a database query, or a Kafka message processing event. Spans have a start time, an end time, and metadata (tags).

3. Context Propagation

This is the mechanism that passes the Trace ID and Span ID from one service to another. In the past, we used headers like X-B3-TraceId. Today, the industry has standardized on the W3C Trace Context (traceparent header), ensuring that a Spring Boot service can talk to a .NET service and maintain the trace.

The Stack: Spring Boot 3's Observability Layer

Spring Boot 3 introduced a massive overhaul to observability, moving away from Spring Cloud Sleuth and embracing the Micrometer ecosystem.

Micrometer Tracing: This is the facade. Just as SLF4J is a facade for logging (Logback, Log4j2), Micrometer Tracing is a facade for tracing. You code against this API.
OpenTelemetry (OTel): This is the standard. It defines the protocol (OTLP) for how trace data is formatted and transmitted.
Grafana (Tempo/Loki): This is the visualization. It ingests the OTLP data and renders the waterfall charts that allow you to pinpoint latency.

Implementation Guide

Let's set up a production-grade tracing configuration.

1. Dependencies

We need the Micrometer facade, the bridge to OpenTelemetry, and the exporter to send data out.

<dependencies>
    <!-- The Facade -->
    <dependency>
        <groupId>io.micrometer</groupId>
        <artifactId>micrometer-tracing-bridge-otel</artifactId>
    </dependency>
    <!-- The Exporter -->
    <dependency>
        <groupId>io.opentelemetry</groupId>
        <artifactId>opentelemetry-exporter-otlp</artifactId>
    </dependency>
    <!-- Actuator for Metrics/Health -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-actuator</artifactId>
    </dependency>
</dependencies>

2. Configuration

Configure application.yml to enable tracing and define where to send the data.

management:
  tracing:
    sampling:
      probability: 1.0 # Log 100% of requests (Use 0.1 or lower for Prod)
  otlp:
    tracing:
      endpoint: http://localhost:4318/v1/traces # Your OTel Collector or Tempo

3. Automatic vs. Manual Instrumentation

Spring Boot automatically instruments RestTemplate, WebClient, JdbcTemplate, and KafkaTemplate. However, sometimes you need to trace complex internal logic.

Manual Span Example:

Spring Boot auto-instruments HTTP calls, but internal business logic needs manual spans. Here's how to trace a multi-step payment flow:

import io.micrometer.tracing.Tracer;
import io.micrometer.tracing.Span;
import org.springframework.stereotype.Service;

@Service
public class PaymentService {

    private final Tracer tracer;
    private final FraudDetectionService fraudService;
    private final StripeClient stripeClient;

    public PaymentService(Tracer tracer, 
                          FraudDetectionService fraudService,
                          StripeClient stripeClient) {
        this.tracer = tracer;
        this.fraudService = fraudService;
        this.stripeClient = stripeClient;
    }

    public PaymentResult processPayment(String orderId, BigDecimal amount) {
        // Create a custom span for the entire payment flow
        Span span = tracer.nextSpan().name("payment.process");
        span.tag("order.id", orderId);
        span.tag("payment.amount", amount.toString());
        
        try (var ws = tracer.withSpan(span.start())) {
            // This internal logic isn't auto-traced
            fraudService.checkForFraud(orderId); // Custom span inside
            var result = stripeClient.charge(amount); // HTTP call auto-traced
            span.tag("payment.status", result.getStatus());
            return result;
        } catch (FraudException e) {
            span.tag("fraud.detected", "true");
            span.error(e);
            throw e;
        } catch (Exception e) {
            span.error(e);
            throw new PaymentException("Payment failed", e);
        } finally {
            span.end();
        }
    }
}

Production Considerations

Sampling Strategies

Tracing every request is expensive. In production, you must balance visibility with cost and performance.

Probability-Based Sampling:

management:
  tracing:
    sampling:
      probability: 0.1  # Trace 10% of requests

When to sample 100%:

Development and staging environments
Low-traffic services (<100 requests/minute)
During incident investigation (temporarily)

When to sample 1-10%:

High-throughput services (>1000 requests/minute)
Production environments with cost constraints
Services with predictable behavior

Head-Based vs. Tail-Based Sampling:

Head-Based (what we configured): Decision made at the first service. Fast but may miss rare errors.
Tail-Based: Decision made after seeing the full trace. Requires a collector like Grafana Tempo that supports tail sampling.

Performance Impact

With proper configuration, tracing adds <1ms of latency per request. The overhead comes from:

Creating span objects (negligible)
Adding HTTP headers (negligible)
Exporting trace data (use async exporters)

Advanced Patterns

Log Correlation

The true power of tracing is unlocked when you link it to your logs. By default, Spring Boot 3 injects the traceId and spanId into the MDC (Mapped Diagnostic Context).

Your logs will transform from this: INFO: Processing order 123

To this: INFO [order-service,65b8e6a09ca6343,65b8e6a09ca6343]: Processing order 123

This allows you to copy a Trace ID from Grafana Tempo, paste it into your log aggregator (Splunk, ELK, Loki), and see every log line generated by that specific request across all services.

Baggage: Cross-Service Business Context

Baggage propagates business metadata across the entire trace without polluting method signatures. Common use cases: multi-tenancy, user context, feature flags.

// API Gateway: Set baggage from auth token
@RestController
public class GatewayController {
    
    @GetMapping("/api/orders")
    public Orders getOrders(@RequestHeader("Authorization") String token) {
        var userId = jwtParser.extractUserId(token);
        var tenantId = jwtParser.extractTenantId(token);
        
        // This propagates to ALL downstream services
        tracer.createBaggage("user.id", userId);
        tracer.createBaggage("tenant.id", tenantId);
        
        return orderService.fetchOrders();
    }
}

// Order Service: Read baggage without explicit parameters
@Service
public class OrderService {
    
    public Orders fetchOrders() {
        // No userId parameter needed!
        String tenantId = tracer.getBaggage("tenant.id").get();
        return orderRepository.findByTenantId(tenantId);
    }
}

Warning: Baggage increases trace size. Only propagate essential business context (user ID, tenant ID). Avoid large JSON payloads.

Common Pitfalls

1. Forgetting to close spans

// BAD: Span never closes, causes memory leaks
var span = tracer.nextSpan().start();
doWork();

// GOOD: Always use try-finally or try-with-resources
try (var ws = tracer.withSpan(span.start())) {
    doWork();
}

2. Blocking the tracer thread

# BAD: Synchronous export blocks request threads
management.otlp.tracing.export.blocking: true

# GOOD: Async export (default)
management.otlp.tracing.export.blocking: false

3. Not propagating context to async tasks

// BAD: Loses trace context in @Async methods
@Async
public void processAsync() { /* No trace context */ }

// GOOD: Use @NewSpan or propagate context manually
@Async
@NewSpan
public void processAsync() { /* Trace context preserved */ }

4. Over-instrumenting hot paths Don't create spans for methods called thousands of times per request (e.g., getters, simple validation). Reserve spans for meaningful operations: HTTP calls, database queries, message publishing.

OneCube Insight: At OneCube, we've seen engineering teams reduce their Mean Time To Resolution (MTTR) by over 60% simply by enforcing strict Trace ID propagation. When a developer can click a button and see the exact SQL query that caused a 2-second latency spike in a downstream service, the "blame game" stops, and fixing begins.

Conclusion

Distributed tracing transforms debugging from guesswork into precise investigation. When a production issue occurs, you can:

Copy the Trace ID from the error log
Paste it into Grafana Tempo
See the exact service, method, and database query that failed

This reduces MTTR from hours to minutes.

Next Steps

Add the dependencies to your Spring Boot 3 project
Set up Grafana Tempo locally via Docker: docker run -d -p 4318:4318 grafana/tempo:latest
Configure sampling to 100% in development
Generate traffic and view traces at http://localhost:3200
Implement custom spans for your business-critical operations

Start small. Enable tracing in one service, verify it works, then propagate to dependent services. Observability is not an all-or-nothing migration.

Frequently Asked Questions

Does distributed tracing hurt performance?

There is a small overhead, but it is generally negligible for most applications. The key is to use asynchronous exporters (like the OTLP exporter) and configure an appropriate sampling rate. For high-throughput systems, sampling 1-10% of requests is usually sufficient to detect patterns.

Can I use this with Spring Boot 2?

It is significantly harder. Spring Boot 2 uses Spring Cloud Sleuth, which has been end-of-lifed. We strongly recommend upgrading to Spring Boot 3 to leverage the unified Micrometer Tracing ecosystem.

What is the difference between Zipkin and OTLP?

Zipkin is an older tracing format and collector. OTLP (OpenTelemetry Protocol) is the modern, vendor-neutral industry standard. While Spring Boot supports Zipkin, OTLP is preferred as it allows you to switch backends (Tempo, Jaeger, Honeycomb) without changing your code.

How do I trace database calls?

If you use Spring Data or JdbcTemplate, basic tracing is automatic. For deeper insights (like seeing the actual SQL query in the span), you may need to use a datasource proxy library or an OpenTelemetry-instrumented JDBC driver.

References

Spring Docs: Spring Boot Observability Documentation
OpenTelemetry: OpenTelemetry Java Guide
Micrometer: Micrometer Tracing Docs

Microservices Observability with Spring Cloud: Beyond Logs and Metrics

Microservices Observability with Spring Cloud: Beyond Logs and Metrics

The Vocabulary of Observability

1. Trace

2. Span

3. Context Propagation

The Stack: Spring Boot 3's Observability Layer

Implementation Guide

1. Dependencies

2. Configuration

3. Automatic vs. Manual Instrumentation

Production Considerations

Sampling Strategies

Performance Impact

Advanced Patterns

Log Correlation

Baggage: Cross-Service Business Context

Common Pitfalls

Conclusion

Next Steps

Frequently Asked Questions

Does distributed tracing hurt performance?

Can I use this with Spring Boot 2?

What is the difference between Zipkin and OTLP?

How do I trace database calls?

References

Looking for Your Next Role?