Skip to content

feat: add Dataset.sample() with optional seed for deterministic sampling#6502

Open
beinan wants to merge 2 commits intolance-format:mainfrom
beinan:beinan/deterministic-sample-v2
Open

feat: add Dataset.sample() with optional seed for deterministic sampling#6502
beinan wants to merge 2 commits intolance-format:mainfrom
beinan:beinan/deterministic-sample-v2

Conversation

@beinan
Copy link
Copy Markdown
Contributor

@beinan beinan commented Apr 13, 2026

Summary

  • Add Dataset.sample() API to both Rust core and Java JNI for random row sampling
  • Add optional seed parameter for deterministic sampling — same seed + same dataset state = same result
  • Useful for reproducible ML training/validation splits

Changes

  • rust/lance/src/dataset.rs: Add seed: Option<u64> param to Dataset::sample(); use StdRng::seed_from_u64 when seeded, rand::rng() otherwise
  • rust/lance/src/dataset/tests/dataset_io.rs: Add test_sample_deterministic_with_seed verifying same seed = same result, different seed = different result
  • rust/lance/src/index/vector/utils.rs, rust/lance/src/dataset/write/merge_insert.rs, rust/lance/benches/take.rs: Pass None for existing callers (backward compatible)
  • java/lance-jni/src/blocking_dataset.rs: Add nativeSample JNI function
  • java/src/main/java/org/lance/Dataset.java: Add sample(n, columns), sample(n, columns, fragmentIds), and sample(n, columns, fragmentIds, seed) overloads
  • java/src/test/java/org/lance/DatasetTest.java: Test basic sampling, fragment-filtered sampling, over-sampling, and deterministic sampling with seed

Test plan

  • Rust: test_sample_deterministic_with_seed — same seed = identical RecordBatch, different seed = different result
  • Rust: existing test_sample_with_fragment_ids, test_sample_with_empty_fragment_ids_rejected, test_sample_with_unknown_fragment_ids_rejected pass with None seed
  • Java: testSample — basic sample, fragment-filtered sample, over-sample, deterministic seed verification

🤖 Generated with Claude Code

Add Dataset.sample() API to both Rust core and Java JNI, with an
optional seed parameter for reproducible random sampling. When a seed
is provided, uses StdRng::seed_from_u64 for deterministic index
selection. Without a seed, sampling is non-deterministic using OS
entropy. Useful for reproducible ML training/validation splits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added enhancement New feature or request java labels Apr 13, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 13, 2026

Codecov Report

❌ Patch coverage is 81.81818% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset.rs 87.50% 1 Missing ⚠️
rust/lance/src/index/vector/utils.rs 50.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant