Skip to content

refactor!: add NamespaceClientTableContext for cached namespace info#6523

Open
jackye1995 wants to merge 13 commits intolance-format:mainfrom
jackye1995:ns-cached-info
Open

refactor!: add NamespaceClientTableContext for cached namespace info#6523
jackye1995 wants to merge 13 commits intolance-format:mainfrom
jackye1995:ns-cached-info

Conversation

@jackye1995
Copy link
Copy Markdown
Contributor

@jackye1995 jackye1995 commented Apr 15, 2026

This is a second part of the namespace SDK refactoring.

When user uses namespace client to get table information via namespace.describe_table or namesapce.declare_table, it is very important that the information can be reused, so that you don't have to call the API repeatedly to gain the same information. Instead, you can get info in driver, distribute it, and use it directly in worker.

There are 3 information today falls in this category:

  1. table location
  2. catalog-provided storage options
  3. managed-versioning boolean

Today, this is implemented by passing the known location, managed_versioning boolean, and merge storage_options with user provided ones. And this creates 2 problems:

  1. code repetition: we have to pass these information repeatedly for all information, today it is done for all SDKs and for all engine integrations. This is prune to error for engineers and code agents that don't have much context on this feature.
  2. missing intent: we don't know if the caller has really passed in cached info, or if it is just passed in by mistake. e.g. if I have passed in location, but not managed verisoning, do we call the API again to reconfirm?

This PR does the following:

  • Introduce NamespaceClientTableContext (Rust struct, Python class, Java class) that holds cached describe_table/declare_table response data (location, storage_options, managed_versioning)
  • Context is passed all the way to Rust layer where all decisions about storage options merging and managed versioning are made — Python and Java make no decisions

We also remove deprecated Dataset.create/open overloads and createWithFfiSchema JNI path in Java along the way.

…s Rust, Python, Java

Introduce NamespaceClientTableContext struct/class that holds cached
describe_table/declare_table response data (location, storage_options,
managed_versioning). This is passed all the way to Rust where all
decisions about storage options merging and managed versioning are made.

- Rust: NamespaceClientTableContext in lance-namespace crate, with
  from_describe_table_response/from_declare_table_response constructors.
  DatasetBuilder::from_namespace_context and
  Dataset::write_into_namespace_context accept the context.
- Python: NamespaceClientTableContext class in namespace module. All APIs
  (dataset, write_dataset, fragments, file reader/writer/session, TF)
  accept namespace_client_table_context parameter. PyO3 binding extracts
  context fields in Rust.
- Java: NamespaceClientTableContext class with static factory methods.
  All builders (OpenDatasetBuilder, WriteDatasetBuilder, CommitBuilder,
  WriteFragmentBuilder) accept the context. JNI binding extracts context
  fields in Rust.

Also removes deprecated Dataset.create/open overloads and
createWithFfiSchema JNI path in Java.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added enhancement New feature or request python java labels Apr 15, 2026
@jackye1995 jackye1995 changed the title feat: add NamespaceClientTableContext for cached namespace info refactor!: add NamespaceClientTableContext for cached namespace info Apr 15, 2026
jackye1995 and others added 7 commits April 15, 2026 00:41
…_namespace

- Rename all `ctx` variables to `namespace_client_table_context` across
  Rust, Python bindings, Java JNI, and Java API
- Merge from_namespace_context into from_namespace with optional context
  parameter
- Merge write_into_namespace_context into write_into_namespace with
  optional context parameter

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Python and Java no longer call describe_table/declare_table. They pass
namespace_client + table_id + optional context to Rust, which resolves
via from_namespace/write_into_namespace internally.

- Python binding: _Dataset::new uses DatasetBuilder::from_namespace
  when namespace_client is provided. _write_dataset uses
  Dataset::write_into_namespace. URI is now optional.
- Python API: dataset() and write_dataset() no longer resolve location
- Java API: OpenDatasetBuilder and WriteDatasetBuilder no longer call
  describeTable/declareTable, pass null path to Rust
- Java JNI: still needs update to use from_namespace/write_into_namespace
  (next commit)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
BlockingDataset::open uses DatasetBuilder::from_namespace when
namespace_client is provided. create_dataset uses
Dataset::write_into_namespace. Storage options provider and commit
handler setup is now fully handled by the Rust core functions.

Removes manual LanceNamespaceStorageOptionsProvider and
ExternalManifestCommitHandler setup from JNI layer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cargo fix removed LanceNamespaceStorageOptionsProvider and
StorageOptionsAccessor imports that are no longer needed since
from_namespace/write_into_namespace handle them internally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… ruff format

- Add None context parameter to from_namespace/write_into_namespace
  callers in lance-namespace-impls and lance-namespace-datafusion
- Re-add deprecated Dataset.create/open Java convenience methods that
  delegate to new paths (tests depend on them)
- Re-add createWithFfiSchema JNI for schema-only deprecated path
- Fix ruff formatting in Python
- Fix Java checkstyle line length

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove Preconditions.checkNotNull(path) in Dataset.create since path
  can be null when namespace is used
- Remove unused DescribeTableRequest import from Python __init__.py
- Move LanceNamespace/NamespaceClientTableContext to TYPE_CHECKING block
- Fix ruff import sorting

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Path can be null when namespace_client resolves location. Remove
checkNotNull(path) from Dataset.open and handle null JString in
inner_open_native.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 15, 2026

Codecov Report

❌ Patch coverage is 84.55882% with 21 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset.rs 76.92% 12 Missing and 3 partials ⚠️
rust/lance/src/dataset/builder.rs 84.00% 2 Missing and 2 partials ⚠️
rust/lance-namespace/src/context.rs 90.00% 0 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

jackye1995 and others added 2 commits April 15, 2026 08:16
…r, fix schema-only params

- Always create StorageOptionsProvider when namespace is provided, not
  only when storage_options are present (fixes credential refresh test)
- Set commit_handler on DatasetBuilder for APPEND/OVERWRITE path so
  the opened dataset has the correct handler (fixes managed versioning
  for non-CREATE modes)
- Pass user's WriteParams to schema-only create path (fixes data
  storage version not being forwarded)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Schema-only empty dataset creation is only available through the
deprecated Dataset.create(allocator, path, schema, params) method.
The WriteDatasetBuilder now requires reader() or stream() as data
source.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jackye1995 and others added 3 commits April 15, 2026 08:32
These are valid API methods, not deprecated. Remove @deprecated
annotations and fix testGetLanceFileFormatVersion to use
Dataset.create directly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
with_read_params overwrites builder.options including the
storage_options_accessor set by from_namespace. For namespace path,
apply read params individually to avoid overwriting the accessor.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…open

- PyLanceNamespace was missing declare_table delegation, causing
  "not implemented" error when Rust calls it on custom Python namespaces
- JNI BlockingDataset::open applies read params individually to avoid
  overwriting storage_options_accessor set by from_namespace

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review (from Claude Code)

This PR introduces NamespaceClientTableContext — a lightweight struct that caches namespace metadata (location, storage_options, managed_versioning) to avoid redundant describe_table()/declare_table() calls. It threads this context through Rust, Python, and Java layers.

Overall: The core design is sound — extracting a cacheable struct from namespace responses and threading it through is a clean approach. The refactoring significantly reduces duplication in write_into_namespace() (the old code duplicated the entire credential setup between Create and Append/Overwrite branches). There are a few issues worth addressing before merge.

P0: Bugs

1. namespace_client_table_context not forwarded to Rust in write_fragments()
python/python/lance/fragment.py:1196-1211 — The namespace_client_table_context parameter is accepted (line 1060) but never passed to _write_fragments() / _write_fragments_transaction(). This means the caching optimization is silently broken for the write_fragments API.

2. Wrong sentinel type in LanceDataset.__init__
python/python/lance/dataset.py:670self._namespace_client_table_context = False should be None. The type hint is Optional[NamespaceClientTableContext] but the sentinel is a boolean. This False will propagate through __copy__ and could cause subtle type-confusion bugs downstream (e.g., if context: vs if context is not None:).

P1: Logic Issues

3. Java builder validation has a gap
java/src/main/java/org/lance/OpenDatasetBuilder.java:171:

boolean hasNamespaceClient = namespaceClient != null || tableId != null;

This is OR not AND, so hasNamespaceClient is true if either is set. Then at line 177:

if (hasContext && !hasNamespaceClient)

This means providing a context with neither namespaceClient nor tableId is rejected, but providing context with only one of them passes validation. The subsequent null checks at lines 185-194 catch this at runtime, but the error path is confusing — you get "tableId is set but namespaceClient is missing" instead of a message about context requiring both. Same issue in WriteDatasetBuilder.

4. with_read_params workaround is fragile
python/src/dataset.rs (commit c3d409b) — The fix manually applies individual ReadParams fields to avoid with_read_params clobbering the namespace accessor. But this means any future ReadParams fields added to with_read_params won't be applied on the namespace path unless someone remembers to update this block too. Consider instead fixing with_read_params in the Rust builder to preserve the existing accessor when the incoming ReadParams doesn't set one (i.e., merge rather than overwrite), which would eliminate this divergence.

P2: Minor

5. NamespaceClientTableContext constructor allows empty location
java/src/main/java/org/lance/NamespaceClientTableContext.java:60 — The constructor checks location != null but not !location.isEmpty(), while the factory methods (fromDescribeTableResponse, fromDeclareTableResponse) do check for empty. An empty location string would silently pass the constructor and fail later with a confusing error.

6. Always-create provider when no storage options are vended
rust/lance/src/dataset/builder.rs:135-142 — The new code always creates a LanceNamespaceStorageOptionsProvider and StorageOptionsAccessor even when the namespace doesn't vend credentials (storage_options is None). The old code only did this when credentials were present. This is a minor behavioral change — it means every open will call describe_table() on credential refresh even for namespaces that don't vend credentials. Probably fine functionally but worth confirming intent.

7. Missing validation in fragment.py
python/python/lance/fragment.py:1166-1173 — No validation that namespace_client_table_context requires namespace_client and table_id to also be set. Compare to dataset.py:197-201 which does validate this.

Nits

  • The owned_context / shadowed namespace_client_table_context pattern in from_namespace() and write_into_namespace() is a clean way to handle the owned-vs-borrowed duality, nice.
  • Commit c3d409b is a fix-on-fix — consider squashing into the parent commit before merge for a cleaner history.

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants