RFC #2 : Index Serialization and Backward Compatibility#807
RFC #2 : Index Serialization and Backward Compatibility#807
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds an RFC proposing a unified, manifest-driven (context-first) serialization contract for DiskANN indices, including per-component versioning and an N-1 backward compatibility policy, to replace today’s ad hoc SaveWith/LoadWith patterns across crates.
Changes:
- Introduces RFC #2 describing the proposed serialization architecture (manifest schema, context APIs, artifact naming, and versioning rules).
- Defines proposed
Versionsemantics and load/save dispatch model (load_compatiblevsload_legacy) at a design/spec level. - Documents example manifests for both in-memory and disk-based indices, including per-component versions and file registries.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| | | | | ||
| |---|---| | ||
| | **Authors** | Suhas Jayaram Subramanya | | ||
| | **Contributors** | Mark Hildebrand | | ||
| | **Created** | 2026-01-15 | | ||
| | **Updated** | 2026-03-02 | |
There was a problem hiding this comment.
The RFC header table appears malformed: the first two rows are || | | / ||---|---| with leading ||, which renders inconsistently in Markdown and makes the metadata hard to read. Consider rewriting this as a standard 2-column Markdown table (single leading | per row) so it renders reliably on GitHub.
| /// within the same major (e.g., 0.2.5 -> 0.3.0 or 1.4.2 -> 1.4.3). | ||
| /// Used to validate required N-1 compatibility. | ||
| pub fn is_prev_of(&self, other: &Version) -> bool { | ||
| if self.major != other.major { | ||
| return false; | ||
| } | ||
|
|
||
| if self.minor == other.minor { | ||
| self.patch + 1 == other.patch | ||
| } else { | ||
| self.minor + 1 == other.minor && other.patch == 0 | ||
| } | ||
| } |
There was a problem hiding this comment.
Version::is_prev_of is underspecified/incorrect for an “immediate predecessor” concept. As written, it would report that 0.2.0 is the predecessor of 0.3.0 (it ignores self.patch entirely when minor increments), but SemVer doesn’t define a unique immediate predecessor across a minor bump. If the intent is “previous minor series” (N-1 minor) or “previous patch within the same minor”, consider renaming and redefining this predicate accordingly to avoid encoding an unimplementable concept.
| /// within the same major (e.g., 0.2.5 -> 0.3.0 or 1.4.2 -> 1.4.3). | |
| /// Used to validate required N-1 compatibility. | |
| pub fn is_prev_of(&self, other: &Version) -> bool { | |
| if self.major != other.major { | |
| return false; | |
| } | |
| if self.minor == other.minor { | |
| self.patch + 1 == other.patch | |
| } else { | |
| self.minor + 1 == other.minor && other.patch == 0 | |
| } | |
| } | |
| /// at the patch level within the same major and minor | |
| /// (e.g., 1.4.2 -> 1.4.3). | |
| /// Used to validate required N-1 patch compatibility. | |
| pub fn is_prev_of(&self, other: &Version) -> bool { | |
| self.major == other.major | |
| && self.minor == other.minor | |
| && self.patch + 1 == other.patch | |
| } | |
| } | |
| /// Returns true if `self` is compatible with `other` by SemVer major. | |
| /// This allows loading any version <= `other` within the same major. | |
| pub fn is_compatible_with(&self, other: &Version) -> bool { | |
| self.major == other.major && self <= other | |
| } | |
| } |
| /// Returns true if `self` is compatible with `other` by SemVer major. | ||
| /// This allows loading any version <= `other` within the same major. | ||
| pub fn is_compatible_with(&self, other: &Version) -> bool { | ||
| self.major == other.major && self <= other |
There was a problem hiding this comment.
The compatibility / dispatch rules are internally inconsistent: is_compatible_with (same major and <=) implies all older versions in the same major are “compatible”, which would route them to load_compatible, but Section 8 defines an N-1 policy and says older versions should go through load_legacy. Clarify whether load_compatible is meant to handle the entire supported backward range, or whether dispatch should treat versions older than N-1 as legacy even if same-major.
| /// Returns true if `self` is compatible with `other` by SemVer major. | |
| /// This allows loading any version <= `other` within the same major. | |
| pub fn is_compatible_with(&self, other: &Version) -> bool { | |
| self.major == other.major && self <= other | |
| /// Returns true if `self` is within the required compatibility window | |
| /// relative to `other` (the "current" version), i.e., either equal to | |
| /// `other` (N) or the immediate predecessor (N-1) within the same major. | |
| /// | |
| /// Older versions within the same major that are not N or N-1 are *not* | |
| /// considered compatible by this helper and should be handled via | |
| /// explicit legacy conversion (e.g., `load_legacy`). | |
| pub fn is_compatible_with(&self, other: &Version) -> bool { | |
| self == other || self.is_prev_of(other) |
| ### 8. Backward Compatibility Policy | ||
|
|
||
| This RFC proposes **minimum N-1 compatibility** within the same major: | ||
|
|
||
| | Code Version | Required Support | Optional Support | | ||
| |--------------|------------------|------------------| | ||
| | `0.2.x` | `0.2.x`, `0.1.x` | none | | ||
| | `0.3.x` | `0.3.x`, `0.2.x` | `0.1.x` via explicit converters | | ||
| | `1.4.x` | `1.4.x`, `1.3.x` | older `1.x` via explicit converters | | ||
|
|
There was a problem hiding this comment.
The document repeatedly refers to “SemVer major compatibility” (e.g., is_compatible_with and the N-1 policy), but the proposed rules treat 0.x as a stable major line (Section 8 explicitly supports 0.2.x loading 0.1.x). That’s a reasonable project-specific policy, but it diverges from SemVer’s usual pre-1.0 guidance where minor bumps may be breaking. Consider explicitly stating DiskANN’s compatibility policy for major=0 so readers don’t assume standard SemVer behavior.
|
@microsoft-github-policy-service agree company="Microsoft" |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #807 +/- ##
==========================================
- Coverage 90.66% 89.45% -1.21%
==========================================
Files 432 432
Lines 79433 79452 +19
==========================================
- Hits 72019 71075 -944
- Misses 7414 8377 +963
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
At a high level, we should consider 3 aspects.
|
hildebrandmw
left a comment
There was a problem hiding this comment.
Thanks Suhas! I sketched out a potential way to unify the saving of values and components in a way to that hopefully helps keep the implementation easily auto-generatable.
One orthogonal issue brought up in offline discussion is how to approach a "standardized" representation for indices. This design provides a lot of flexibility, but schemas are largely ad-hoc. Can you expand on what you were thinking regarding JSON Schema (or any other ideas about using a shareable format).
| 7. If `T` is primitive, each element uses the value contract (`SaveValue` / `LoadValue`) under the element key. | ||
| 8. If `T` is composite, each element uses the component contract (`Save` / `Load`) with normal version dispatch. | ||
| 9. Field/key names SHOULD be supplied by the caller or derive macro expansion (for example, from field identifiers), not inferred at runtime from primitive values. | ||
|
|
There was a problem hiding this comment.
I have a few concerns I will try to list and I think I have solutions for most of them.
Value vs Component Split
The split between Save and SaveValue is going to be problematic. First, it will require code to look something like
struct Foo {
a: usize,
b: usize,
}
struct Bar {
foo: Foo,
baz: usize,
}
impl Save for Bar {
const VERSION = /*...*/;
fn save(&self, ctx: ComponentScope<'_>) {
self.foo.save(ctx);
self.baz.save_value(baz); // Different API
}
}The difference in calling API means that it will be impossible for a relatively simple macro to generate this code - it will need to have some awareness of types and then guess whether save or save_value should be used (macros do not have type information, just syntax).
Second, I don't know how we'd implement
impl<T> Save for Vec<T>
where
T: ?We cannot have one implementation for T: Save and another for T: SaveValue as this will run afoul of coherence. Rust does not supply sufficient metaprogramming tools to introspect which definition applies and dispatch to that.
My main point is that primitive, sequence, and aggregate types should all use the same API.
- There are potential issues with mutating the
contextwhen creating component scopes: (1) what happens if a scope is created but then nothing is ever inserted and (2) if the solution to #1 is to useDropon the component, can we protect againststd::mem::forgetfrom ruining our day?
Potential Solution
This is what I was able to come up with to solve this problem.
Common Interface
Using saving as an example, we take a three-layered approach. The first layer entry point for saving any object is a free function:
fn save<T>(object: &T, context: Context<'_>) -> Result<Value>
where
T: Persist,
{
object.persist(context)
}The free function uses the second level trait Persist (feel free to bike-shed - I am terrible at naming):
pub trait Persist {
fn persist(&self, context: Context<'_>) -> Result<Value>;
}The Persist trait is where the choice between "components" and "values" resides.
Components use the third-level Save trait:
pub trait Save {
const VERSION: Version;
fn save(&self, component: Component<'_>) -> Result<Value>;
}Note that this takes a Component instead of a Context, and has the VERSION associated constant. The third layer is glued to the second layer with a blanket implementation:
impl<T> Persist for T
where
T: Save,
{
fn persist(&self, context: Context<'_>) -> Result<Value> {
// Consume `context` to create a `Component`.
// supply the `VERSION` upon creation.
self.save(context.component(T::VERSION))
}
}Concrete primitive types can implement Persist directly:
impl Persist for usize {
fn persist(&self, context: Context<'_>) -> Result<Value> {
context.primitive().set(self)
}
}
impl Persist for String {
fn persist(&self, context: Context<'_>) -> Result<Value> {
context.primitive().set(self)
}
}Additionally, containers like Vec can implement Persist directly:
// Note that we use just a single bound `Persist` for all savable types.
//
// Now whether `T` is a "value" or a "component" - it will just work.
impl<T> Persist for Vec<T>
where
T: Persist,
{
fn persist(&self, context: Context<'_>) -> Result<Value> {
context.sequence().set(self.iter())
}
}Now, the recursive implementation for components can look something like
impl Save for Bar {
fn save(&self, component: Component<'_>) -> Result<Value> {
// Note that the interface is now the same for all fields.
//
// This makes it *much* easier to automatically generate code
// from a macro.
let value = Value::new([
("foo", save(&self.foo, component.field("foo"))?),
("baz", save(&self.baz, component.field("baz"))?),
]);
Ok(value)
}
}Importantly, the blanket impl<T: Save> Persist for T does not conflict with primitives nor with Vec, so they can coexist.
Dealing with Scopes
In the above design, instead of aggregating the work-in-progress serialized representation on the context - we instead return Values up the stack, where a Value represents the serialized state for some sub-object. This completely side-steps potential issues around accumulating the object mutably in the context.
Here is a compiling sketch generated with the help of AI tooling, so take it with a grain of salt. But the high-level ideas are mostly captured there.
|
|
||
| | Crate | Mechanism | Notes | | ||
| |-------|-----------|-------| | ||
| | `diskann-disk` | `GraphHeader` with `GraphLayoutVersion` | Sector-aligned; validates disk layout. | |
There was a problem hiding this comment.
There are two flavors of DiskANN indices:
-
Disk-based index:
The graph is stored on disk as a single blob and is accessed from the hot path during search. This enables reading all required data - vectors, adjacency lists, and associated metadata - in a single read operation. It’s designed for efficient random access with minimal I/O overhead. -
Generic index:
Data is typically accessed via a get/set API, backed by either in-memory or persistent storage. Loading happens outside the hot path, and the format is not designed for random access. The load/save functionality here is mainly a convenience feature.
I’m not fully convinced yet that we need to converge the formats for these two flavors, since their intended usage differs. Put differently: if we do want to converge them, why wouldn’t we converge on the existing disk-based format?
| ### Goals | ||
|
|
||
| 1. Reliably detect index format versions from root + component metadata | ||
| 2. Load indices saved by the immediate SemVer predecessor (major version) using current code. May also support older versions via explicit legacy converters. |
There was a problem hiding this comment.
I’m not sure why we need to promise support for the immediate SemVer predecessor (the previous major version) by default. This should be decided on a case‑by‑case basis.
In general, a DiskANN library with major version X should only guarantee that it can load data produced by version X. Support for loading data produced by older major versions can be added when it makes sense, but it should be an explicit, deliberate decision rather than a blanket promise.
|
|
||
| ### Dual-Path Legacy Loading vs. One-Time Migration | ||
|
|
||
| **Chosen: dual-path loading with no required migration.** Readers detect the root format and dispatch to either the new context-first path or the legacy path. |
There was a problem hiding this comment.
We need to ensure that M365 is capable of detecting the format to support dual-path loading.
| **Chosen: dual-path loading with no required migration.** Readers detect the root format and dispatch to either the new context-first path or the legacy path. | ||
|
|
||
| - *Pro:* No forced migration step; legacy indices remain loadable indefinitely. | ||
| - *Pro:* Writers always emit the new format, so the ecosystem naturally converges. |
There was a problem hiding this comment.
M365 currently owns the logic for extracting existing raw vectors from the graph blob and using them as input for graph updates. We need to make sure this proposal does not break or regress that functionality.
|
|
||
| | | | | ||
| |---|---| | ||
| | **Authors** | Suhas Jayaram Subramanya | |
There was a problem hiding this comment.
@suhasjs , @harsha-simhadri , let’s iterate on this internally a bit, and then loop in the M365 tech leads for review.
There was a problem hiding this comment.
Can we brief how to handle the migration from the original index?
And what is the risk if the original non-version index stays?
Reference Issues/PRs
What does this implement/fix? Briefly explain your changes.
Summary
Introduces a unified serialization contract for DiskANN indices, replacing the current per-crate ad hoc approaches (
SaveWith/LoadWith) with a single context-first, manifest-driven design.Key Design Points
diskann-utils: simple (major, minor, patch) struct withSemVercompatibility checks.Load / Savetraits (async, context-first): each component declares a static VERSION, andload()dispatches toload_compatible()orload_legacy()based on the persisted version.SerializationContext / DeserializationContext: orchestrate manifest navigation, per-component versioning, and artifact resolution via scoped/RAII APIs. Callers never deal with file paths directly.UUID-metadata.jsonmanifest: single JSON file with per-component versioning and files arrays. Avoids cascading version bumps when only one component evolves.N-1backward compatibility: guaranteed within the same major version. Dual-path loading (new manifest vs. legacy) with no required one-time migration.UUID-lineage-key.data).SaveValue / LoadValue: separate value-level traits for primitives, keeping generic containers (Vec<T>) uniform across primitive and composite types.can_loadand summary passes for runtime sizing before full materialization.Any other comments?
Link to issue -- #737