RFC #2 : Index Serialization and Backward Compatibility by suhasjs · Pull Request #807 · microsoft/DiskANN

suhasjs · 2026-03-02T20:35:00Z

Does this PR have a descriptive title that could go in our release notes?
Does this PR add any new dependencies?
Does this PR modify any existing APIs?
Is the change to the API backwards compatible?
Should this result in any changes to our documentation, either updating existing docs or adding new ones?

Reference Issues/PRs

What does this implement/fix? Briefly explain your changes.

Summary

Introduces a unified serialization contract for DiskANN indices, replacing the current per-crate ad hoc approaches (SaveWith/LoadWith) with a single context-first, manifest-driven design.

Key Design Points

Version primitive in diskann-utils: simple (major, minor, patch) struct with SemVer compatibility checks.
Load / Save traits (async, context-first): each component declares a static VERSION, and load() dispatches to load_compatible() or load_legacy() based on the persisted version.
SerializationContext / DeserializationContext: orchestrate manifest navigation, per-component versioning, and artifact resolution via scoped/RAII APIs. Callers never deal with file paths directly.
Recursive UUID-metadata.json manifest: single JSON file with per-component versioning and files arrays. Avoids cascading version bumps when only one component evolves.
N-1 backward compatibility: guaranteed within the same major version. Dual-path loading (new manifest vs. legacy) with no required one-time migration.
File-per-artifact: each binary payload gets its own file with deterministic naming (UUID-lineage-key.data).
SaveValue / LoadValue: separate value-level traits for primitives, keeping generic containers (Vec<T>) uniform across primitive and composite types.
Optional preflight/summary: lightweight can_load and summary passes for runtime sizing before full materialization.

Any other comments?

Link to issue -- #737

Copilot

Pull request overview

This PR adds an RFC proposing a unified, manifest-driven (context-first) serialization contract for DiskANN indices, including per-component versioning and an N-1 backward compatibility policy, to replace today’s ad hoc SaveWith/LoadWith patterns across crates.

Changes:

Introduces RFC #2 describing the proposed serialization architecture (manifest schema, context APIs, artifact naming, and versioning rules).
Defines proposed Version semantics and load/save dispatch model (load_compatible vs load_legacy) at a design/spec level.
Documents example manifests for both in-memory and disk-based indices, including per-component versions and file registries.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-02T20:37:45Z

rfcs/00002-serialization.md

+| | |
+|---|---|
+| **Authors** | Suhas Jayaram Subramanya |
+| **Contributors** | Mark Hildebrand |
+| **Created** | 2026-01-15 |
+| **Updated** | 2026-03-02 |


The RFC header table appears malformed: the first two rows are || | | / ||---|---| with leading ||, which renders inconsistently in Markdown and makes the metadata hard to read. Consider rewriting this as a standard 2-column Markdown table (single leading | per row) so it renders reliably on GitHub.

Copilot · 2026-03-02T20:37:46Z

rfcs/00002-serialization.md

+    /// within the same major (e.g., 0.2.5 -> 0.3.0 or 1.4.2 -> 1.4.3).
+    /// Used to validate required N-1 compatibility.
+    pub fn is_prev_of(&self, other: &Version) -> bool {
+        if self.major != other.major {
+            return false;
+        }
+
+        if self.minor == other.minor {
+            self.patch + 1 == other.patch
+        } else {
+            self.minor + 1 == other.minor && other.patch == 0
+        }
+    }


Version::is_prev_of is underspecified/incorrect for an “immediate predecessor” concept. As written, it would report that 0.2.0 is the predecessor of 0.3.0 (it ignores self.patch entirely when minor increments), but SemVer doesn’t define a unique immediate predecessor across a minor bump. If the intent is “previous minor series” (N-1 minor) or “previous patch within the same minor”, consider renaming and redefining this predicate accordingly to avoid encoding an unimplementable concept.

Suggested change

/// within the same major (e.g., 0.2.5 -> 0.3.0 or 1.4.2 -> 1.4.3).

/// Used to validate required N-1 compatibility.

pub fn is_prev_of(&self, other: &Version) -> bool {

if self.major != other.major {

return false;

}

if self.minor == other.minor {

self.patch + 1 == other.patch

} else {

self.minor + 1 == other.minor && other.patch == 0

}

}

/// at the patch level within the same major and minor

/// (e.g., 1.4.2 -> 1.4.3).

/// Used to validate required N-1 patch compatibility.

pub fn is_prev_of(&self, other: &Version) -> bool {

self.major == other.major

&& self.minor == other.minor

&& self.patch + 1 == other.patch

}

}

/// Returns true if `self` is compatible with `other` by SemVer major.

/// This allows loading any version <= `other` within the same major.

pub fn is_compatible_with(&self, other: &Version) -> bool {

self.major == other.major && self <= other

}

}

Copilot · 2026-03-02T20:37:46Z

rfcs/00002-serialization.md

+    /// Returns true if `self` is compatible with `other` by SemVer major.
+    /// This allows loading any version <= `other` within the same major.
+    pub fn is_compatible_with(&self, other: &Version) -> bool {
+        self.major == other.major && self <= other


The compatibility / dispatch rules are internally inconsistent: is_compatible_with (same major and <=) implies all older versions in the same major are “compatible”, which would route them to load_compatible, but Section 8 defines an N-1 policy and says older versions should go through load_legacy. Clarify whether load_compatible is meant to handle the entire supported backward range, or whether dispatch should treat versions older than N-1 as legacy even if same-major.

Suggested change

/// Returns true if `self` is compatible with `other` by SemVer major.

/// This allows loading any version <= `other` within the same major.

pub fn is_compatible_with(&self, other: &Version) -> bool {

self.major == other.major && self <= other

/// Returns true if `self` is within the required compatibility window

/// relative to `other` (the "current" version), i.e., either equal to

/// `other` (N) or the immediate predecessor (N-1) within the same major.

///

/// Older versions within the same major that are not N or N-1 are *not*

/// considered compatible by this helper and should be handled via

/// explicit legacy conversion (e.g., `load_legacy`).

pub fn is_compatible_with(&self, other: &Version) -> bool {

self == other || self.is_prev_of(other)

Copilot · 2026-03-02T20:37:46Z

rfcs/00002-serialization.md

+### 8. Backward Compatibility Policy
+
+This RFC proposes **minimum N-1 compatibility** within the same major:
+
+| Code Version | Required Support | Optional Support |
+|--------------|------------------|------------------|
+| `0.2.x`      | `0.2.x`, `0.1.x` | none             |
+| `0.3.x`      | `0.3.x`, `0.2.x` | `0.1.x` via explicit converters |
+| `1.4.x`      | `1.4.x`, `1.3.x` | older `1.x` via explicit converters |
+


The document repeatedly refers to “SemVer major compatibility” (e.g., is_compatible_with and the N-1 policy), but the proposed rules treat 0.x as a stable major line (Section 8 explicitly supports 0.2.x loading 0.1.x). That’s a reasonable project-specific policy, but it diverges from SemVer’s usual pre-1.0 guidance where minor bumps may be breaking. Consider explicitly stating DiskANN’s compatibility policy for major=0 so readers don’t assume standard SemVer behavior.

suhasjs · 2026-03-02T20:38:13Z

@microsoft-github-policy-service agree company="Microsoft"

codecov-commenter · 2026-03-02T20:53:47Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.45%. Comparing base (9b2fd20) to head (47753eb).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #807      +/-   ##
==========================================
- Coverage   90.66%   89.45%   -1.21%     
==========================================
  Files         432      432              
  Lines       79433    79452      +19     
==========================================
- Hits        72019    71075     -944     
- Misses       7414     8377     +963

Flag	Coverage Δ
miri	`89.45% <ø> (-1.21%)`	⬇️
unittests	`89.30% <ø> (-1.34%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 38 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

harsha-simhadri · 2026-03-03T16:51:34Z

At a high level, we should consider 3 aspects.

Backward compat to avoid having to regenerate existing indices at scale @arrayka and @Yujia to sign off on this.
Whether its easy to support new incoming features like attributes, filtered diskann, diversity, multi-vectors. @suri-kumkaran and Ravi to sign off on this
Whether it is easy to work across Data products. Harsha and Suhas to look into this.

hildebrandmw

Thanks Suhas! I sketched out a potential way to unify the saving of values and components in a way to that hopefully helps keep the implementation easily auto-generatable.

One orthogonal issue brought up in offline discussion is how to approach a "standardized" representation for indices. This design provides a lot of flexibility, but schemas are largely ad-hoc. Can you expand on what you were thinking regarding JSON Schema (or any other ideas about using a shareable format).

hildebrandmw · 2026-03-03T17:21:08Z

rfcs/00002-serialization.md

+7. If `T` is primitive, each element uses the value contract (`SaveValue` / `LoadValue`) under the element key.
+8. If `T` is composite, each element uses the component contract (`Save` / `Load`) with normal version dispatch.
+9. Field/key names SHOULD be supplied by the caller or derive macro expansion (for example, from field identifiers), not inferred at runtime from primitive values.
+


I have a few concerns I will try to list and I think I have solutions for most of them.

Value vs Component Split

The split between Save and SaveValue is going to be problematic. First, it will require code to look something like

struct Foo { a: usize, b: usize, } struct Bar { foo: Foo, baz: usize, } impl Save for Bar { const VERSION = /*...*/; fn save(&self, ctx: ComponentScope<'_>) { self.foo.save(ctx); self.baz.save_value(baz); // Different API } }

The difference in calling API means that it will be impossible for a relatively simple macro to generate this code - it will need to have some awareness of types and then guess whether save or save_value should be used (macros do not have type information, just syntax).

Second, I don't know how we'd implement

impl<T> Save for Vec<T> where T: ?

We cannot have one implementation for T: Save and another for T: SaveValue as this will run afoul of coherence. Rust does not supply sufficient metaprogramming tools to introspect which definition applies and dispatch to that.

My main point is that primitive, sequence, and aggregate types should all use the same API.

There are potential issues with mutating the context when creating component scopes: (1) what happens if a scope is created but then nothing is ever inserted and (2) if the solution to #1 is to use Drop on the component, can we protect against std::mem::forget from ruining our day?

Potential Solution

This is what I was able to come up with to solve this problem.

Common Interface

Using saving as an example, we take a three-layered approach. The first layer entry point for saving any object is a free function:

fn save<T>(object: &T, context: Context<'_>) -> Result<Value> where T: Persist, { object.persist(context) }

The free function uses the second level trait Persist (feel free to bike-shed - I am terrible at naming):

pub trait Persist { fn persist(&self, context: Context<'_>) -> Result<Value>; }

The Persist trait is where the choice between "components" and "values" resides.

Components use the third-level Save trait:

pub trait Save { const VERSION: Version; fn save(&self, component: Component<'_>) -> Result<Value>; }

Note that this takes a Component instead of a Context, and has the VERSION associated constant. The third layer is glued to the second layer with a blanket implementation:

impl<T> Persist for T where T: Save, { fn persist(&self, context: Context<'_>) -> Result<Value> { // Consume `context` to create a `Component`. // supply the `VERSION` upon creation. self.save(context.component(T::VERSION)) } }

Concrete primitive types can implement Persist directly:

impl Persist for usize { fn persist(&self, context: Context<'_>) -> Result<Value> { context.primitive().set(self) } } impl Persist for String { fn persist(&self, context: Context<'_>) -> Result<Value> { context.primitive().set(self) } }

Additionally, containers like Vec can implement Persist directly:

// Note that we use just a single bound `Persist` for all savable types. // // Now whether `T` is a "value" or a "component" - it will just work. impl<T> Persist for Vec<T> where T: Persist, { fn persist(&self, context: Context<'_>) -> Result<Value> { context.sequence().set(self.iter()) } }

Now, the recursive implementation for components can look something like

impl Save for Bar { fn save(&self, component: Component<'_>) -> Result<Value> { // Note that the interface is now the same for all fields. // // This makes it *much* easier to automatically generate code // from a macro. let value = Value::new([ ("foo", save(&self.foo, component.field("foo"))?), ("baz", save(&self.baz, component.field("baz"))?), ]); Ok(value) } }

Importantly, the blanket impl<T: Save> Persist for T does not conflict with primitives nor with Vec, so they can coexist.

Dealing with Scopes

In the above design, instead of aggregating the work-in-progress serialized representation on the context - we instead return Values up the stack, where a Value represents the serialized state for some sub-object. This completely side-steps potential issues around accumulating the object mutably in the context.

Here is a compiling sketch generated with the help of AI tooling, so take it with a grain of salt. But the high-level ideas are mostly captured there.

arrayka · 2026-03-03T18:47:43Z

rfcs/00002-serialization.md

+
+| Crate | Mechanism | Notes |
+|-------|-----------|-------|
+| `diskann-disk` | `GraphHeader` with `GraphLayoutVersion` | Sector-aligned; validates disk layout. |


There are two flavors of DiskANN indices:

Disk-based index:
The graph is stored on disk as a single blob and is accessed from the hot path during search. This enables reading all required data - vectors, adjacency lists, and associated metadata - in a single read operation. It’s designed for efficient random access with minimal I/O overhead.

Generic index:
Data is typically accessed via a get/set API, backed by either in-memory or persistent storage. Loading happens outside the hot path, and the format is not designed for random access. The load/save functionality here is mainly a convenience feature.

I’m not fully convinced yet that we need to converge the formats for these two flavors, since their intended usage differs. Put differently: if we do want to converge them, why wouldn’t we converge on the existing disk-based format?

arrayka · 2026-03-03T18:53:04Z

rfcs/00002-serialization.md

+### Goals
+
+1. Reliably detect index format versions from root + component metadata
+2. Load indices saved by the immediate SemVer predecessor (major version) using current code. May also support older versions via explicit legacy converters.


I’m not sure why we need to promise support for the immediate SemVer predecessor (the previous major version) by default. This should be decided on a case‑by‑case basis.

In general, a DiskANN library with major version X should only guarantee that it can load data produced by version X. Support for loading data produced by older major versions can be added when it makes sense, but it should be an explicit, deliberate decision rather than a blanket promise.

arrayka · 2026-03-03T19:02:24Z

rfcs/00002-serialization.md

+
+### Dual-Path Legacy Loading vs. One-Time Migration
+
+**Chosen: dual-path loading with no required migration.** Readers detect the root format and dispatch to either the new context-first path or the legacy path.


We need to ensure that M365 is capable of detecting the format to support dual-path loading.

arrayka · 2026-03-03T19:06:44Z

rfcs/00002-serialization.md

+**Chosen: dual-path loading with no required migration.** Readers detect the root format and dispatch to either the new context-first path or the legacy path.
+
+- *Pro:* No forced migration step; legacy indices remain loadable indefinitely.
+- *Pro:* Writers always emit the new format, so the ecosystem naturally converges.


M365 currently owns the logic for extracting existing raw vectors from the graph blob and using them as input for graph updates. We need to make sure this proposal does not break or regress that functionality.

arrayka · 2026-03-03T19:19:59Z

rfcs/00002-serialization.md

+
+| | |
+|---|---|
+| **Authors** | Suhas Jayaram Subramanya |


@suhasjs , @harsha-simhadri , let’s iterate on this internally a bit, and then loop in the M365 tech leads for review.

randybird · 2026-03-04T13:27:47Z

rfcs/00002-serialization.md

Can we brief how to handle the migration from the original index?
And what is the risk if the original non-version index stays?

added v1 of index serialization RFC

47753eb

suhasjs requested review from harsha-simhadri, hildebrandmw and suri-kumkaran March 2, 2026 20:35

suhasjs self-assigned this Mar 2, 2026

suhasjs requested review from a team and Copilot March 2, 2026 20:35

Copilot started reviewing on behalf of suhasjs March 2, 2026 20:35 View session

Copilot AI reviewed Mar 2, 2026

View reviewed changes

arrayka self-requested a review March 3, 2026 16:43

hildebrandmw reviewed Mar 3, 2026

View reviewed changes

arrayka requested changes Mar 3, 2026

View reviewed changes

randybird reviewed Mar 4, 2026

View reviewed changes

-    /// Returns true if `self` is compatible with `other` by SemVer major.
-    /// This allows loading any version <= `other` within the same major.
-    pub fn is_compatible_with(&self, other: &Version) -> bool {
-        self.major == other.major && self <= other
+    /// Returns true if `self` is within the required compatibility window
+    /// relative to `other` (the "current" version), i.e., either equal to
+    /// `other` (N) or the immediate predecessor (N-1) within the same major.
+    ///
+    /// Older versions within the same major that are not N or N-1 are *not*
+    /// considered compatible by this helper and should be handled via
+    /// explicit legacy conversion (e.g., `load_legacy`).
+    pub fn is_compatible_with(&self, other: &Version) -> bool {
+        self == other || self.is_prev_of(other)


		### Dual-Path Legacy Loading vs. One-Time Migration

		Chosen: dual-path loading with no required migration. Readers detect the root format and dispatch to either the new context-first path or the legacy path.

Conversation

suhasjs commented Mar 2, 2026

Reference Issues/PRs

What does this implement/fix? Briefly explain your changes.

Summary

Key Design Points

Any other comments?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

suhasjs commented Mar 2, 2026

Uh oh!

codecov-commenter commented Mar 2, 2026

Codecov Report

Uh oh!

harsha-simhadri commented Mar 3, 2026

Uh oh!

hildebrandmw left a comment

Choose a reason for hiding this comment

Uh oh!

hildebrandmw Mar 3, 2026

Choose a reason for hiding this comment

Value vs Component Split

Potential Solution

Common Interface

Dealing with Scopes

Uh oh!

arrayka Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

arrayka Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

arrayka Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

arrayka Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

arrayka Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

randybird Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants