refactor!: add NamespaceClientTableContext for cached namespace info#6523
refactor!: add NamespaceClientTableContext for cached namespace info#6523jackye1995 wants to merge 13 commits intolance-format:mainfrom
Conversation
…s Rust, Python, Java Introduce NamespaceClientTableContext struct/class that holds cached describe_table/declare_table response data (location, storage_options, managed_versioning). This is passed all the way to Rust where all decisions about storage options merging and managed versioning are made. - Rust: NamespaceClientTableContext in lance-namespace crate, with from_describe_table_response/from_declare_table_response constructors. DatasetBuilder::from_namespace_context and Dataset::write_into_namespace_context accept the context. - Python: NamespaceClientTableContext class in namespace module. All APIs (dataset, write_dataset, fragments, file reader/writer/session, TF) accept namespace_client_table_context parameter. PyO3 binding extracts context fields in Rust. - Java: NamespaceClientTableContext class with static factory methods. All builders (OpenDatasetBuilder, WriteDatasetBuilder, CommitBuilder, WriteFragmentBuilder) accept the context. JNI binding extracts context fields in Rust. Also removes deprecated Dataset.create/open overloads and createWithFfiSchema JNI path in Java. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…_namespace - Rename all `ctx` variables to `namespace_client_table_context` across Rust, Python bindings, Java JNI, and Java API - Merge from_namespace_context into from_namespace with optional context parameter - Merge write_into_namespace_context into write_into_namespace with optional context parameter Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Python and Java no longer call describe_table/declare_table. They pass namespace_client + table_id + optional context to Rust, which resolves via from_namespace/write_into_namespace internally. - Python binding: _Dataset::new uses DatasetBuilder::from_namespace when namespace_client is provided. _write_dataset uses Dataset::write_into_namespace. URI is now optional. - Python API: dataset() and write_dataset() no longer resolve location - Java API: OpenDatasetBuilder and WriteDatasetBuilder no longer call describeTable/declareTable, pass null path to Rust - Java JNI: still needs update to use from_namespace/write_into_namespace (next commit) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
BlockingDataset::open uses DatasetBuilder::from_namespace when namespace_client is provided. create_dataset uses Dataset::write_into_namespace. Storage options provider and commit handler setup is now fully handled by the Rust core functions. Removes manual LanceNamespaceStorageOptionsProvider and ExternalManifestCommitHandler setup from JNI layer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cargo fix removed LanceNamespaceStorageOptionsProvider and StorageOptionsAccessor imports that are no longer needed since from_namespace/write_into_namespace handle them internally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… ruff format - Add None context parameter to from_namespace/write_into_namespace callers in lance-namespace-impls and lance-namespace-datafusion - Re-add deprecated Dataset.create/open Java convenience methods that delegate to new paths (tests depend on them) - Re-add createWithFfiSchema JNI for schema-only deprecated path - Fix ruff formatting in Python - Fix Java checkstyle line length Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove Preconditions.checkNotNull(path) in Dataset.create since path can be null when namespace is used - Remove unused DescribeTableRequest import from Python __init__.py - Move LanceNamespace/NamespaceClientTableContext to TYPE_CHECKING block - Fix ruff import sorting Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Path can be null when namespace_client resolves location. Remove checkNotNull(path) from Dataset.open and handle null JString in inner_open_native. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
…r, fix schema-only params - Always create StorageOptionsProvider when namespace is provided, not only when storage_options are present (fixes credential refresh test) - Set commit_handler on DatasetBuilder for APPEND/OVERWRITE path so the opened dataset has the correct handler (fixes managed versioning for non-CREATE modes) - Pass user's WriteParams to schema-only create path (fixes data storage version not being forwarded) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Schema-only empty dataset creation is only available through the deprecated Dataset.create(allocator, path, schema, params) method. The WriteDatasetBuilder now requires reader() or stream() as data source. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3c374c5 to
f3c6676
Compare
These are valid API methods, not deprecated. Remove @deprecated annotations and fix testGetLanceFileFormatVersion to use Dataset.create directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
with_read_params overwrites builder.options including the storage_options_accessor set by from_namespace. For namespace path, apply read params individually to avoid overwriting the accessor. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…open - PyLanceNamespace was missing declare_table delegation, causing "not implemented" error when Rust calls it on custom Python namespaces - JNI BlockingDataset::open applies read params individually to avoid overwriting storage_options_accessor set by from_namespace Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
westonpace
left a comment
There was a problem hiding this comment.
PR Review (from Claude Code)
This PR introduces NamespaceClientTableContext — a lightweight struct that caches namespace metadata (location, storage_options, managed_versioning) to avoid redundant describe_table()/declare_table() calls. It threads this context through Rust, Python, and Java layers.
Overall: The core design is sound — extracting a cacheable struct from namespace responses and threading it through is a clean approach. The refactoring significantly reduces duplication in write_into_namespace() (the old code duplicated the entire credential setup between Create and Append/Overwrite branches). There are a few issues worth addressing before merge.
P0: Bugs
1. namespace_client_table_context not forwarded to Rust in write_fragments()
python/python/lance/fragment.py:1196-1211 — The namespace_client_table_context parameter is accepted (line 1060) but never passed to _write_fragments() / _write_fragments_transaction(). This means the caching optimization is silently broken for the write_fragments API.
2. Wrong sentinel type in LanceDataset.__init__
python/python/lance/dataset.py:670 — self._namespace_client_table_context = False should be None. The type hint is Optional[NamespaceClientTableContext] but the sentinel is a boolean. This False will propagate through __copy__ and could cause subtle type-confusion bugs downstream (e.g., if context: vs if context is not None:).
P1: Logic Issues
3. Java builder validation has a gap
java/src/main/java/org/lance/OpenDatasetBuilder.java:171:
boolean hasNamespaceClient = namespaceClient != null || tableId != null;This is OR not AND, so hasNamespaceClient is true if either is set. Then at line 177:
if (hasContext && !hasNamespaceClient)This means providing a context with neither namespaceClient nor tableId is rejected, but providing context with only one of them passes validation. The subsequent null checks at lines 185-194 catch this at runtime, but the error path is confusing — you get "tableId is set but namespaceClient is missing" instead of a message about context requiring both. Same issue in WriteDatasetBuilder.
4. with_read_params workaround is fragile
python/src/dataset.rs (commit c3d409b) — The fix manually applies individual ReadParams fields to avoid with_read_params clobbering the namespace accessor. But this means any future ReadParams fields added to with_read_params won't be applied on the namespace path unless someone remembers to update this block too. Consider instead fixing with_read_params in the Rust builder to preserve the existing accessor when the incoming ReadParams doesn't set one (i.e., merge rather than overwrite), which would eliminate this divergence.
P2: Minor
5. NamespaceClientTableContext constructor allows empty location
java/src/main/java/org/lance/NamespaceClientTableContext.java:60 — The constructor checks location != null but not !location.isEmpty(), while the factory methods (fromDescribeTableResponse, fromDeclareTableResponse) do check for empty. An empty location string would silently pass the constructor and fail later with a confusing error.
6. Always-create provider when no storage options are vended
rust/lance/src/dataset/builder.rs:135-142 — The new code always creates a LanceNamespaceStorageOptionsProvider and StorageOptionsAccessor even when the namespace doesn't vend credentials (storage_options is None). The old code only did this when credentials were present. This is a minor behavioral change — it means every open will call describe_table() on credential refresh even for namespaces that don't vend credentials. Probably fine functionally but worth confirming intent.
7. Missing validation in fragment.py
python/python/lance/fragment.py:1166-1173 — No validation that namespace_client_table_context requires namespace_client and table_id to also be set. Compare to dataset.py:197-201 which does validate this.
Nits
- The
owned_context/ shadowednamespace_client_table_contextpattern infrom_namespace()andwrite_into_namespace()is a clean way to handle the owned-vs-borrowed duality, nice. - Commit c3d409b is a fix-on-fix — consider squashing into the parent commit before merge for a cleaner history.
🤖 Generated with Claude Code
This is a second part of the namespace SDK refactoring.
When user uses namespace client to get table information via
namespace.describe_tableornamesapce.declare_table, it is very important that the information can be reused, so that you don't have to call the API repeatedly to gain the same information. Instead, you can get info in driver, distribute it, and use it directly in worker.There are 3 information today falls in this category:
Today, this is implemented by passing the known location, managed_versioning boolean, and merge storage_options with user provided ones. And this creates 2 problems:
This PR does the following:
NamespaceClientTableContext(Rust struct, Python class, Java class) that holds cacheddescribe_table/declare_tableresponse data (location, storage_options, managed_versioning)We also remove deprecated
Dataset.create/openoverloads andcreateWithFfiSchemaJNI path in Java along the way.