Skip to content

fix: propagate tracing context across async boundaries to prevent orphan spans#6528

Open
sinianluoye wants to merge 8 commits intolance-format:mainfrom
TheDeltaLab:fix/propagate-trace-context-upstream
Open

fix: propagate tracing context across async boundaries to prevent orphan spans#6528
sinianluoye wants to merge 8 commits intolance-format:mainfrom
TheDeltaLab:fix/propagate-trace-context-upstream

Conversation

@sinianluoye
Copy link
Copy Markdown

Summary

When Lance queries are executed through external callers (e.g. NAPI bindings) with distributed tracing enabled, many internal Rust spans appear as disconnected root traces in backends like Tempo instead of nesting under the caller's span tree.

Root causes:

  1. execute_plan() did not wrap the DataFusion plan with TracedExec, so lazy plan nodes captured Span::current() at poll time (often <none>) instead of at plan creation time.
  2. Numerous async boundaries (boxed futures, boxed streams, spawned tasks) did not propagate the current span.

Changes:

  • Add TracedExec wrapper in execute_plan() to capture and propagate the caller's span through lazy DataFusion plan execution
  • Register LanceJoinSetTracer via set_join_set_tracer() so DataFusion's internal spawned tasks also carry the current span
  • Add FutureTracingExt trait (future_in_current_span, boxed_in_current_span, boxed_in_span) alongside the existing StreamTracingExt in lance-core
  • Add spawn_in_current_span / spawn_in_span helpers for tokio::spawn
  • Convert all tracing-sensitive execution paths to use these shared helpers: decoder, scheduler, fragment reader, scan, filtered read, take, FTS, pushdown scan, cache
  • Add tracing context propagation documentation

Test plan

  • 7 regression tests added in rust/lance/tests/query/tracing.rs using a SpanTreeRecorder layer that asserts every span is reachable from the test root (no orphans):
    • Multi-fragment scan (memory + local filesystem)
    • FTS with filter
    • Vector search (IVF-PQ)
    • Hybrid FTS + vector reranker
    • Local filesystem variants for scan, vector, and hybrid paths
  • cargo test -p lance --features slow_tests --test integration_tests query::tracing:: — 7/7 pass
  • cargo fmt --all -- --check — clean
  • cargo check -p lance --features slow_tests --tests — clean

🤖 Generated with Claude Code

…han spans

When Lance queries execute through NAPI or other external callers with
distributed tracing enabled, many internal spans appear as disconnected
root traces in backends like Tempo instead of nesting under the caller's
span tree.

Root causes:
1. `execute_plan()` did not wrap the DataFusion plan with `TracedExec`,
   so lazy plan nodes captured `Span::current()` at poll time (often
   `<none>`) instead of at plan creation time.
2. Numerous async boundaries (boxed futures, boxed streams, spawned
   tasks) did not propagate the current span, causing work on those
   boundaries to lose its parent context.

Changes:
- Add `TracedExec` wrapper in `execute_plan()` to capture and propagate
  the caller's span through lazy DataFusion plan execution
- Register `LanceJoinSetTracer` via `set_join_set_tracer()` so DataFusion
  internal spawned tasks also carry the current span
- Add `FutureTracingExt` trait (`future_in_current_span`,
  `boxed_in_current_span`, `boxed_in_span`) alongside the existing
  `StreamTracingExt` in lance-core
- Add `spawn_in_current_span` / `spawn_in_span` helpers for
  `tokio::spawn`
- Convert all tracing-sensitive execution paths to use these helpers:
  decoder, scheduler, fragment reader, scan, filtered read, take, FTS,
  pushdown scan, cache
- Add 7 regression tests covering scan, FTS+filter, vector search,
  hybrid FTS+vector reranker, and local filesystem variants
- Add tracing context propagation documentation

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
@github-actions github-actions bot added the bug Something isn't working label Apr 15, 2026
chuang and others added 2 commits April 15, 2026 20:46
…ropagation

SpawnedTask::spawn (from DataFusion) aborts its inner task when dropped,
which is critical for scan cleanup — when a scan is abandoned, dropping
the stream must abort running fragment reads so ScanScheduler instances
can be freed. Our earlier change to spawn_in_current_span used a raw
JoinHandle that does NOT abort on drop, causing 999 tasks to deadlock
in test_scan_finishes_all_tasks.

Restore SpawnedTask::spawn with .future_in_current_span() to get both
abort-on-drop semantics AND proper tracing context propagation.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The span guard in spawn_cpu was held across the oneshot send, creating a
race where the caller could observe the span as "still active" after
receiving the result. Scope the guard so it drops before the send.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
chuang and others added 4 commits April 15, 2026 22:54
Cover the previously untested TracedExec display/debug/with_new_children
methods and LanceJoinSetTracer trace_future/trace_block methods to
improve patch coverage from ~57% to ~95%+ in exec.rs.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant