Add MemorySegment-based RAVV for Fvec files#601
Add MemorySegment-based RAVV for Fvec files#601ashkrisk wants to merge 3 commits intodatastax:mainfrom
Conversation
marianotepper
left a comment
There was a problem hiding this comment.
LGTM. @jshook you should comment on this one as it intersects your own work
|
@ashkrisk This is related to the work I showed you last week. I would like to get to gether Tuesday or Wednesday to form a plan together. I believe we can combine forces on common code. I don't think there is any harm in merging this, but we may want to decide together what the best combined approach is. The library that I shared tackles this issue and several others in a cohesive way, including type safe ways of accessing all vector data facets. Let's get together and hash out the details. |
jshook
left a comment
There was a problem hiding this comment.
Can we have this integrated via the new DataSet and DataSetLoader path? This is a recently introduced modularization for this type of change.
|
After talking to @jshook I'm going to close this PR for now. The main reason for adding this in was to alleviate some of the issues I've been having manipulating large fvec files with JVector, but a more targeted solution like the one provided by the |
JVector provides methods for reading fvec and ivec files in
jvector-examples/.../SiftLoader. These methods return a List by loading all the vectors into memory. This doesn't work in cases where the total size of the vectors exceeds the available memory, which in turn means that consuming applications likeBenchYAMLcannot work with larger-than-memory datasets.This PR adds
FvecSegmentReadertojvector-nativeas an experimental API. In the future, it's possible to integrate this withMultiFileDataSourceto allow benchmarking of larger-than-memory datasets.Alternate approaches
SiftLoader.readFvecs. This is fundamentally limited by being unable to process larger-than-memory datasets.ReaderSupplierFactory.open(). This re-uses existing code and will automatically fall back toMMapReaderon lower JDK versions. However, this makes the implementation a bit clunky since theRandomAccessReaderinterface is not thread-safe, which would force us to use a thread-safe MemorySegment in a defensive manner. Using theMemorySegmentAPI directly makes the code cleaner and more self-contained.Possible next steps
BenchYAMLto add support for larger-than-memory datasets.