Skip to content

Add MemorySegment-based RAVV for Fvec files#601

Closed
ashkrisk wants to merge 3 commits intodatastax:mainfrom
ashkrisk:fvec-ravv
Closed

Add MemorySegment-based RAVV for Fvec files#601
ashkrisk wants to merge 3 commits intodatastax:mainfrom
ashkrisk:fvec-ravv

Conversation

@ashkrisk
Copy link
Contributor

@ashkrisk ashkrisk commented Jan 21, 2026

JVector provides methods for reading fvec and ivec files in jvector-examples/.../SiftLoader. These methods return a List by loading all the vectors into memory. This doesn't work in cases where the total size of the vectors exceeds the available memory, which in turn means that consuming applications like BenchYAML cannot work with larger-than-memory datasets.

This PR adds FvecSegmentReader to jvector-native as an experimental API. In the future, it's possible to integrate this with MultiFileDataSource to allow benchmarking of larger-than-memory datasets.

Alternate approaches

  • Keep using SiftLoader.readFvecs. This is fundamentally limited by being unable to process larger-than-memory datasets.
  • Use the existing MemorySegmentReader for IO, through ReaderSupplierFactory.open(). This re-uses existing code and will automatically fall back to MMapReader on lower JDK versions. However, this makes the implementation a bit clunky since the RandomAccessReader interface is not thread-safe, which would force us to use a thread-safe MemorySegment in a defensive manner. Using the MemorySegment API directly makes the code cleaner and more self-contained.

Possible next steps

  • Use this in BenchYAML to add support for larger-than-memory datasets.
  • Add a fallback implementation that works on lower JDK versions.

@ashkrisk ashkrisk marked this pull request as ready for review January 22, 2026 12:04
@ashkrisk ashkrisk changed the title Add MemorySegment-based reader for Fvec files Add MemorySegment-based RAVV for Fvec files Jan 22, 2026
Copy link
Contributor

@marianotepper marianotepper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @jshook you should comment on this one as it intersects your own work

@jshook
Copy link
Contributor

jshook commented Jan 26, 2026

@ashkrisk This is related to the work I showed you last week. I would like to get to gether Tuesday or Wednesday to form a plan together. I believe we can combine forces on common code.

I don't think there is any harm in merging this, but we may want to decide together what the best combined approach is. The library that I shared tackles this issue and several others in a cohesive way, including type safe ways of accessing all vector data facets. Let's get together and hash out the details.

Copy link
Contributor

@jshook jshook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have this integrated via the new DataSet and DataSetLoader path? This is a recently introduced modularization for this type of change.

@ashkrisk
Copy link
Contributor Author

After talking to @jshook I'm going to close this PR for now. The main reason for adding this in was to alleviate some of the issues I've been having manipulating large fvec files with JVector, but a more targeted solution like the one provided by the vectordata module in https://github.com/nosqlbench/nbdatatools might be more appropriate for inclusion into JVector's benchmarking system long term.

@ashkrisk ashkrisk closed this Jan 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants