Skip to content

RFC #2 : Index Serialization and Backward Compatibility#807

Open
suhasjs wants to merge 1 commit intomainfrom
users/suhasjs/rfc-idx-version
Open

RFC #2 : Index Serialization and Backward Compatibility#807
suhasjs wants to merge 1 commit intomainfrom
users/suhasjs/rfc-idx-version

Conversation

@suhasjs
Copy link

@suhasjs suhasjs commented Mar 2, 2026

  • Does this PR have a descriptive title that could go in our release notes?
  • Does this PR add any new dependencies?
  • Does this PR modify any existing APIs?
  • Is the change to the API backwards compatible?
  • Should this result in any changes to our documentation, either updating existing docs or adding new ones?

Reference Issues/PRs

What does this implement/fix? Briefly explain your changes.

Summary

Introduces a unified serialization contract for DiskANN indices, replacing the current per-crate ad hoc approaches (SaveWith/LoadWith) with a single context-first, manifest-driven design.

Key Design Points

  • Version primitive in diskann-utils: simple (major, minor, patch) struct with SemVer compatibility checks.
  • Load / Save traits (async, context-first): each component declares a static VERSION, and load() dispatches to load_compatible() or load_legacy() based on the persisted version.
  • SerializationContext / DeserializationContext: orchestrate manifest navigation, per-component versioning, and artifact resolution via scoped/RAII APIs. Callers never deal with file paths directly.
  • Recursive UUID-metadata.json manifest: single JSON file with per-component versioning and files arrays. Avoids cascading version bumps when only one component evolves.
  • N-1 backward compatibility: guaranteed within the same major version. Dual-path loading (new manifest vs. legacy) with no required one-time migration.
  • File-per-artifact: each binary payload gets its own file with deterministic naming (UUID-lineage-key.data).
  • SaveValue / LoadValue: separate value-level traits for primitives, keeping generic containers (Vec<T>) uniform across primitive and composite types.
  • Optional preflight/summary: lightweight can_load and summary passes for runtime sizing before full materialization.

Any other comments?

Link to issue -- #737

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an RFC proposing a unified, manifest-driven (context-first) serialization contract for DiskANN indices, including per-component versioning and an N-1 backward compatibility policy, to replace today’s ad hoc SaveWith/LoadWith patterns across crates.

Changes:

  • Introduces RFC #2 describing the proposed serialization architecture (manifest schema, context APIs, artifact naming, and versioning rules).
  • Defines proposed Version semantics and load/save dispatch model (load_compatible vs load_legacy) at a design/spec level.
  • Documents example manifests for both in-memory and disk-based indices, including per-component versions and file registries.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +3 to +8
| | |
|---|---|
| **Authors** | Suhas Jayaram Subramanya |
| **Contributors** | Mark Hildebrand |
| **Created** | 2026-01-15 |
| **Updated** | 2026-03-02 |
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RFC header table appears malformed: the first two rows are || | | / ||---|---| with leading ||, which renders inconsistently in Markdown and makes the metadata hard to read. Consider rewriting this as a standard 2-column Markdown table (single leading | per row) so it renders reliably on GitHub.

Copilot uses AI. Check for mistakes.
Comment on lines +61 to +73
/// within the same major (e.g., 0.2.5 -> 0.3.0 or 1.4.2 -> 1.4.3).
/// Used to validate required N-1 compatibility.
pub fn is_prev_of(&self, other: &Version) -> bool {
if self.major != other.major {
return false;
}

if self.minor == other.minor {
self.patch + 1 == other.patch
} else {
self.minor + 1 == other.minor && other.patch == 0
}
}
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version::is_prev_of is underspecified/incorrect for an “immediate predecessor” concept. As written, it would report that 0.2.0 is the predecessor of 0.3.0 (it ignores self.patch entirely when minor increments), but SemVer doesn’t define a unique immediate predecessor across a minor bump. If the intent is “previous minor series” (N-1 minor) or “previous patch within the same minor”, consider renaming and redefining this predicate accordingly to avoid encoding an unimplementable concept.

Suggested change
/// within the same major (e.g., 0.2.5 -> 0.3.0 or 1.4.2 -> 1.4.3).
/// Used to validate required N-1 compatibility.
pub fn is_prev_of(&self, other: &Version) -> bool {
if self.major != other.major {
return false;
}
if self.minor == other.minor {
self.patch + 1 == other.patch
} else {
self.minor + 1 == other.minor && other.patch == 0
}
}
/// at the patch level within the same major and minor
/// (e.g., 1.4.2 -> 1.4.3).
/// Used to validate required N-1 patch compatibility.
pub fn is_prev_of(&self, other: &Version) -> bool {
self.major == other.major
&& self.minor == other.minor
&& self.patch + 1 == other.patch
}
}
/// Returns true if `self` is compatible with `other` by SemVer major.
/// This allows loading any version <= `other` within the same major.
pub fn is_compatible_with(&self, other: &Version) -> bool {
self.major == other.major && self <= other
}
}

Copilot uses AI. Check for mistakes.
Comment on lines +75 to +78
/// Returns true if `self` is compatible with `other` by SemVer major.
/// This allows loading any version <= `other` within the same major.
pub fn is_compatible_with(&self, other: &Version) -> bool {
self.major == other.major && self <= other
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compatibility / dispatch rules are internally inconsistent: is_compatible_with (same major and <=) implies all older versions in the same major are “compatible”, which would route them to load_compatible, but Section 8 defines an N-1 policy and says older versions should go through load_legacy. Clarify whether load_compatible is meant to handle the entire supported backward range, or whether dispatch should treat versions older than N-1 as legacy even if same-major.

Suggested change
/// Returns true if `self` is compatible with `other` by SemVer major.
/// This allows loading any version <= `other` within the same major.
pub fn is_compatible_with(&self, other: &Version) -> bool {
self.major == other.major && self <= other
/// Returns true if `self` is within the required compatibility window
/// relative to `other` (the "current" version), i.e., either equal to
/// `other` (N) or the immediate predecessor (N-1) within the same major.
///
/// Older versions within the same major that are not N or N-1 are *not*
/// considered compatible by this helper and should be handled via
/// explicit legacy conversion (e.g., `load_legacy`).
pub fn is_compatible_with(&self, other: &Version) -> bool {
self == other || self.is_prev_of(other)

Copilot uses AI. Check for mistakes.
Comment on lines +678 to +687
### 8. Backward Compatibility Policy

This RFC proposes **minimum N-1 compatibility** within the same major:

| Code Version | Required Support | Optional Support |
|--------------|------------------|------------------|
| `0.2.x` | `0.2.x`, `0.1.x` | none |
| `0.3.x` | `0.3.x`, `0.2.x` | `0.1.x` via explicit converters |
| `1.4.x` | `1.4.x`, `1.3.x` | older `1.x` via explicit converters |

Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The document repeatedly refers to “SemVer major compatibility” (e.g., is_compatible_with and the N-1 policy), but the proposed rules treat 0.x as a stable major line (Section 8 explicitly supports 0.2.x loading 0.1.x). That’s a reasonable project-specific policy, but it diverges from SemVer’s usual pre-1.0 guidance where minor bumps may be breaking. Consider explicitly stating DiskANN’s compatibility policy for major=0 so readers don’t assume standard SemVer behavior.

Copilot uses AI. Check for mistakes.
@suhasjs
Copy link
Author

suhasjs commented Mar 2, 2026

@microsoft-github-policy-service agree company="Microsoft"

@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.45%. Comparing base (9b2fd20) to head (47753eb).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #807      +/-   ##
==========================================
- Coverage   90.66%   89.45%   -1.21%     
==========================================
  Files         432      432              
  Lines       79433    79452      +19     
==========================================
- Hits        72019    71075     -944     
- Misses       7414     8377     +963     
Flag Coverage Δ
miri 89.45% <ø> (-1.21%) ⬇️
unittests 89.30% <ø> (-1.34%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 38 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@arrayka arrayka self-requested a review March 3, 2026 16:43
@harsha-simhadri
Copy link
Contributor

At a high level, we should consider 3 aspects.

  • Backward compat to avoid having to regenerate existing indices at scale @arrayka and @Yujia to sign off on this.
  • Whether its easy to support new incoming features like attributes, filtered diskann, diversity, multi-vectors. @suri-kumkaran and Ravi to sign off on this
  • Whether it is easy to work across Data products. Harsha and Suhas to look into this.

Copy link
Contributor

@hildebrandmw hildebrandmw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Suhas! I sketched out a potential way to unify the saving of values and components in a way to that hopefully helps keep the implementation easily auto-generatable.

One orthogonal issue brought up in offline discussion is how to approach a "standardized" representation for indices. This design provides a lot of flexibility, but schemas are largely ad-hoc. Can you expand on what you were thinking regarding JSON Schema (or any other ideas about using a shareable format).

7. If `T` is primitive, each element uses the value contract (`SaveValue` / `LoadValue`) under the element key.
8. If `T` is composite, each element uses the component contract (`Save` / `Load`) with normal version dispatch.
9. Field/key names SHOULD be supplied by the caller or derive macro expansion (for example, from field identifiers), not inferred at runtime from primitive values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few concerns I will try to list and I think I have solutions for most of them.

Value vs Component Split

The split between Save and SaveValue is going to be problematic. First, it will require code to look something like

struct Foo {
    a: usize,
    b: usize,
}

struct Bar {
    foo: Foo,
    baz: usize,
}

impl Save for Bar {
    const VERSION = /*...*/;
    fn save(&self, ctx: ComponentScope<'_>) {
        self.foo.save(ctx);
        self.baz.save_value(baz); // Different API
    }
}

The difference in calling API means that it will be impossible for a relatively simple macro to generate this code - it will need to have some awareness of types and then guess whether save or save_value should be used (macros do not have type information, just syntax).

Second, I don't know how we'd implement

impl<T> Save for Vec<T>
where
    T: ?

We cannot have one implementation for T: Save and another for T: SaveValue as this will run afoul of coherence. Rust does not supply sufficient metaprogramming tools to introspect which definition applies and dispatch to that.

My main point is that primitive, sequence, and aggregate types should all use the same API.

  1. There are potential issues with mutating the context when creating component scopes: (1) what happens if a scope is created but then nothing is ever inserted and (2) if the solution to #1 is to use Drop on the component, can we protect against std::mem::forget from ruining our day?

Potential Solution

This is what I was able to come up with to solve this problem.

Common Interface

Using saving as an example, we take a three-layered approach. The first layer entry point for saving any object is a free function:

fn save<T>(object: &T, context: Context<'_>) -> Result<Value>
where
    T: Persist,
{
    object.persist(context)
}

The free function uses the second level trait Persist (feel free to bike-shed - I am terrible at naming):

pub trait Persist {
    fn persist(&self, context: Context<'_>) -> Result<Value>;
}

The Persist trait is where the choice between "components" and "values" resides.

Components use the third-level Save trait:

pub trait Save {
    const VERSION: Version;
    fn save(&self, component: Component<'_>) -> Result<Value>;
}

Note that this takes a Component instead of a Context, and has the VERSION associated constant. The third layer is glued to the second layer with a blanket implementation:

impl<T> Persist for T
where
    T: Save,
{
    fn persist(&self, context: Context<'_>) -> Result<Value> {
        // Consume `context` to create a `Component`.
        // supply the `VERSION` upon creation.
        self.save(context.component(T::VERSION))
    }
}

Concrete primitive types can implement Persist directly:

impl Persist for usize {
    fn persist(&self, context: Context<'_>) -> Result<Value> {
        context.primitive().set(self)
    }
}

impl Persist for String {
    fn persist(&self, context: Context<'_>) -> Result<Value> {
        context.primitive().set(self)
    }
}

Additionally, containers like Vec can implement Persist directly:

// Note that we use just a single bound `Persist` for all savable types.
//
// Now whether `T` is a "value" or a "component" - it will just work.
impl<T> Persist for Vec<T>
where
    T: Persist,
{
    fn persist(&self, context: Context<'_>) -> Result<Value> {
        context.sequence().set(self.iter())
    }
}

Now, the recursive implementation for components can look something like

impl Save for Bar {
    fn save(&self, component: Component<'_>) -> Result<Value> {
         // Note that the interface is now the same for all fields.
         // 
         // This makes it *much* easier to automatically generate code
         // from a macro.
         let value = Value::new([
             ("foo", save(&self.foo, component.field("foo"))?),
             ("baz", save(&self.baz, component.field("baz"))?),
         ]);
         Ok(value)
    }
}

Importantly, the blanket impl<T: Save> Persist for T does not conflict with primitives nor with Vec, so they can coexist.

Dealing with Scopes

In the above design, instead of aggregating the work-in-progress serialized representation on the context - we instead return Values up the stack, where a Value represents the serialized state for some sub-object. This completely side-steps potential issues around accumulating the object mutably in the context.

Here is a compiling sketch generated with the help of AI tooling, so take it with a grain of salt. But the high-level ideas are mostly captured there.


| Crate | Mechanism | Notes |
|-------|-----------|-------|
| `diskann-disk` | `GraphHeader` with `GraphLayoutVersion` | Sector-aligned; validates disk layout. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two flavors of DiskANN indices:

  1. Disk-based index:
    The graph is stored on disk as a single blob and is accessed from the hot path during search. This enables reading all required data - vectors, adjacency lists, and associated metadata - in a single read operation. It’s designed for efficient random access with minimal I/O overhead.

  2. Generic index:
    Data is typically accessed via a get/set API, backed by either in-memory or persistent storage. Loading happens outside the hot path, and the format is not designed for random access. The load/save functionality here is mainly a convenience feature.

I’m not fully convinced yet that we need to converge the formats for these two flavors, since their intended usage differs. Put differently: if we do want to converge them, why wouldn’t we converge on the existing disk-based format?

### Goals

1. Reliably detect index format versions from root + component metadata
2. Load indices saved by the immediate SemVer predecessor (major version) using current code. May also support older versions via explicit legacy converters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure why we need to promise support for the immediate SemVer predecessor (the previous major version) by default. This should be decided on a case‑by‑case basis.

In general, a DiskANN library with major version X should only guarantee that it can load data produced by version X. Support for loading data produced by older major versions can be added when it makes sense, but it should be an explicit, deliberate decision rather than a blanket promise.


### Dual-Path Legacy Loading vs. One-Time Migration

**Chosen: dual-path loading with no required migration.** Readers detect the root format and dispatch to either the new context-first path or the legacy path.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to ensure that M365 is capable of detecting the format to support dual-path loading.

**Chosen: dual-path loading with no required migration.** Readers detect the root format and dispatch to either the new context-first path or the legacy path.

- *Pro:* No forced migration step; legacy indices remain loadable indefinitely.
- *Pro:* Writers always emit the new format, so the ecosystem naturally converges.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

M365 currently owns the logic for extracting existing raw vectors from the graph blob and using them as input for graph updates. We need to make sure this proposal does not break or regress that functionality.


| | |
|---|---|
| **Authors** | Suhas Jayaram Subramanya |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@suhasjs , @harsha-simhadri , let’s iterate on this internally a bit, and then loop in the M365 tech leads for review.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we brief how to handle the migration from the original index?
And what is the risk if the original non-version index stays?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants