Skip to content

RFC: Session-Scoped Architecture for vMCP#38

Open
yrobla wants to merge 7 commits intomainfrom
session-scoped-mcp-client
Open

RFC: Session-Scoped Architecture for vMCP#38
yrobla wants to merge 7 commits intomainfrom
session-scoped-mcp-client

Conversation

@yrobla
Copy link

@yrobla yrobla commented Feb 4, 2026

RFC: Session-Scoped Architecture for vMCP

This RFC refactors vMCP session management to fix per-request client creation and scattered session concerns.

Problem

  • Per-request overhead: Backend clients created/closed on every tool call, causing connection overhead and state loss for stateful
    backends (Playwright, databases)
  • Scattered architecture: Session creation tangled across middleware, adapters, and SDK hooks with no clear ownership
  • Hard to extend: Current design makes it difficult to add features (optimizer-in-vmcp) or write integration tests without spinning up
    the full server

Solution

Introduce two core interfaces:

  1. Session - Domain object that owns backend clients, encapsulates routing logic, manages lifecycle
  2. SessionFactory - Creates fully-formed sessions (capability discovery + client initialization)
  3. SessionManager - Bridges domain logic to SDK protocol concerns

Key improvement: Clients created once during session init, reused for all requests, closed on expiration.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an RFC proposing a shift from per-request/lazy MCP backend client creation to session-scoped client lifecycle management to simplify the current pooling approach while preserving backend state across tool calls.

Changes:

  • Introduces a new RFC describing a session-scoped backend client map owned by VMCPSession.
  • Specifies eager client initialization during session setup (AfterInitialize) and cleanup during session close.
  • Outlines phased implementation steps, testing strategy, and alternatives considered.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yrobla yrobla force-pushed the session-scoped-mcp-client branch from f935166 to 6e503d4 Compare February 4, 2026 13:40
The RFC captures the complete architectural vision from the discussion,
providing a roadmap for simplifying the client pooling implementation
while maintaining all current functionality.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yrobla yrobla force-pushed the session-scoped-mcp-client branch 3 times, most recently from 840b512 to ddf4273 Compare February 4, 2026 15:26
@yrobla yrobla requested a review from Copilot February 4, 2026 15:26
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yrobla yrobla force-pushed the session-scoped-mcp-client branch 2 times, most recently from 50f4742 to ec70d08 Compare February 4, 2026 15:51
@yrobla yrobla requested a review from Copilot February 4, 2026 15:52
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yrobla yrobla force-pushed the session-scoped-mcp-client branch 2 times, most recently from 54dd1b4 to d714819 Compare February 4, 2026 16:28
Copy link

@jerm-dro jerm-dro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left two bigger comments elaborating on my thoughts. To summarize, I think there are larger problems with how the session concept is handled within vMCP. Namely, we have session concerns scattered throughout the codebase and all of these concerns are tightly coupled. As a solution, I think we should create an interface that encapsulates these concerns:

// Session represents an active MCP session with all its capabilities and resources.
// This is a pure domain object - no protocol concerns and can be run without being spun up in a server.
type Session interface {
    ID() string

    // Capabilities - returns discovered tools/resources for this session
    // Perhaps these could be combined into one "GetCapabilities" method.
    Tools() []Tool
    Resources() []Resource

    // MCP operations - routing logic is encapsulated here
    CallTool(ctx context.Context, name string, arguments map[string]any) (*ToolResult, error)
    ReadResource(ctx context.Context, uri string) (*ResourceResult, error)
    GetPrompt(ctx context.Context, name string, arguments map[string]any) (*PromptResult, error)

    // Lifecycle
    Close() error
}

That would make solving this client-lifetime problem trivial.

yrobla added a commit that referenced this pull request Feb 9, 2026
Address architectural feedback from PR #38 review, specifically jerm-dro's
comments about the root cause being scattered session concerns rather than
just client lifecycle management.

Major changes:

1. **Expanded Problem Statement**:
   - Identified root cause: session concerns scattered across middleware,
     adapters, and hooks
   - Current VMCPSession is a passive data container, not a domain object
   - Session construction tangled with SDK lifecycle callbacks
   - Hard to create session-scoped resources (clients, optimizer features)

2. **Introduced Core Interfaces**:
   - Session interface: Domain object that owns clients, encapsulates routing,
     manages lifecycle (not just a data container)
   - SessionFactory interface: Creates fully-formed sessions with all
     dependencies (discovery, client init, routing setup)
   - SessionManager: Bridges between domain (Session) and protocol (SDK)

3. **Updated Goals**:
   - Primary goal: Encapsulate session behavior in clean interfaces
   - Decouple session creation from SDK wiring
   - Enable testing without full server (addresses toolhive#2852)
   - Enable future features through interface decoration

4. **Revised Implementation Plan**:
   - Phase 1: Introduce Session/SessionFactory interfaces
   - Phase 2: Wire SessionManager to SDK
   - Phase 3: Migrate existing code to use Session interface
   - Phase 4: Cleanup and observability

5. **Added Benefits Documentation**:
   - Concrete examples of decorator pattern (optimizer-in-vmcp)
   - Simplified capability refresh approach
   - Clear testing strategy without server dependency

6. **Fixed Formatting Issues** (from Copilot review):
   - Split Goals & Non-Goals into separate sections
   - Changed "Security" to "Security Considerations"
   - Restructured Compatibility section
   - Fixed References to use full repo-qualified format
   - Added credential refresh considerations

The RFC now addresses the architectural root cause (scattered session
concerns) rather than just the symptom (per-request client creation).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
yrobla added a commit that referenced this pull request Feb 9, 2026
Refactor RFC to address scattered session architecture concerns

Address architectural feedback from PR #38 review, specifically jerm-dro's
comments about the root cause being scattered session concerns rather than
just client lifecycle management.

Major changes:

1. **Expanded Problem Statement**:
   - Identified root cause: session concerns scattered across middleware,
     adapters, and hooks
   - Current VMCPSession is a passive data container, not a domain object
   - Session construction tangled with SDK lifecycle callbacks
   - Hard to create session-scoped resources (clients, optimizer features)

2. **Introduced Core Interfaces**:
   - Session interface: Domain object that owns clients, encapsulates routing,
     manages lifecycle (not just a data container)
   - SessionFactory interface: Creates fully-formed sessions with all
     dependencies (discovery, client init, routing setup)
   - SessionManager: Bridges between domain (Session) and protocol (SDK)

3. **Updated Goals**:
   - Primary goal: Encapsulate session behavior in clean interfaces
   - Decouple session creation from SDK wiring
   - Enable testing without full server (addresses toolhive#2852)
   - Enable future features through interface decoration

4. **Revised Implementation Plan**:
   - Phase 1: Introduce Session/SessionFactory interfaces
   - Phase 2: Wire SessionManager to SDK
   - Phase 3: Migrate existing code to use Session interface
   - Phase 4: Cleanup and observability

5. **Added Benefits Documentation**:
   - Concrete examples of decorator pattern (optimizer-in-vmcp)
   - Simplified capability refresh approach
   - Clear testing strategy without server dependency

6. **Fixed Formatting Issues** (from Copilot review):
   - Split Goals & Non-Goals into separate sections
   - Changed "Security" to "Security Considerations"
   - Restructured Compatibility section
   - Fixed References to use full repo-qualified format
   - Added credential refresh considerations

The RFC now addresses the architectural root cause (scattered session
concerns) rather than just the symptom (per-request client creation).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@yrobla yrobla force-pushed the session-scoped-mcp-client branch from d698c3e to 9654f19 Compare February 9, 2026 15:16
yrobla added a commit that referenced this pull request Feb 9, 2026
Refactor RFC to address scattered session architecture concerns

Address architectural feedback from PR #38 review, specifically jerm-dro's
comments about the root cause being scattered session concerns rather than
just client lifecycle management.

Major changes:

1. **Expanded Problem Statement**:
   - Identified root cause: session concerns scattered across middleware,
     adapters, and hooks
   - Current VMCPSession is a passive data container, not a domain object
   - Session construction tangled with SDK lifecycle callbacks
   - Hard to create session-scoped resources (clients, optimizer features)

2. **Introduced Core Interfaces**:
   - Session interface: Domain object that owns clients, encapsulates routing,
     manages lifecycle (not just a data container)
   - SessionFactory interface: Creates fully-formed sessions with all
     dependencies (discovery, client init, routing setup)
   - SessionManager: Bridges between domain (Session) and protocol (SDK)

3. **Updated Goals**:
   - Primary goal: Encapsulate session behavior in clean interfaces
   - Decouple session creation from SDK wiring
   - Enable testing without full server (addresses toolhive#2852)
   - Enable future features through interface decoration

4. **Revised Implementation Plan**:
   - Phase 1: Introduce Session/SessionFactory interfaces
   - Phase 2: Wire SessionManager to SDK
   - Phase 3: Migrate existing code to use Session interface
   - Phase 4: Cleanup and observability

5. **Added Benefits Documentation**:
   - Concrete examples of decorator pattern (optimizer-in-vmcp)
   - Simplified capability refresh approach
   - Clear testing strategy without server dependency

6. **Fixed Formatting Issues** (from Copilot review):
   - Split Goals & Non-Goals into separate sections
   - Changed "Security" to "Security Considerations"
   - Restructured Compatibility section
   - Fixed References to use full repo-qualified format
   - Added credential refresh considerations

The RFC now addresses the architectural root cause (scattered session
concerns) rather than just the symptom (per-request client creation).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Update Last Updated date to 2026-02-09
Copy link
Contributor

@JAORMX JAORMX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, appreciate the work on this RFC. The per-request client creation problem is very real and worth solving (I filed stacklok/toolhive#2417 about exactly this pain point). The session-scoped client lifecycle and the Session-as-domain-object approach are solid improvements. That said, I have some concerns about how this interacts with the existing session storage architecture we built specifically for externalizing sessions to Redis/Valkey.

Let me dig in...

What we already have in place

Over PRs #1677, #1989, #1770, and #1771, we built a deliberate, phased session management architecture:

  • Storage interface (pkg/transport/session/storage.go): A 5-method pluggable abstraction (Store, Load, Delete, DeleteExpired, Close) designed specifically to swap between in-memory and distributed backends (Redis/Valkey)
  • LocalStorage: In-memory sync.Map implementation behind the Storage interface
  • Manager: Uses Storage + Factory, with NewManagerWithStorage() for injecting custom backends
  • serialization.go: JSON ser/deser already prepared for Redis/Valkey (currently nolint:unused, ready for when we implement a distributed backend)

The whole point of this layering was: when we're ready for distributed sessions, we implement a RedisStorage (or equivalent) satisfying the Storage interface, and nothing else changes. Not the Manager, not the proxies, not the vMCP server.

With this in mind, here are my concerns:

The parallel storage path problem

The RFC's SessionManager introduces its own sync.Map for session storage:

// From RFC: SessionManager
type sessionManager struct {
    sessions sync.Map  // bypasses the Storage interface
    factory  SessionFactory
}

So, basically, this creates a parallel session management path that doesn't go through transportsession.Manager or the Storage interface. The RFC says "TTL management and storage are handled by the existing session storage layer", but the proposed implementation stores sessions in its own sync.Map, not through the existing Storage interface.

This means when we implement Redis/Valkey for transport sessions, the vMCP session data managed by this new SessionManager won't be externalized. We'd need to separately refactor this sync.Map later, which is exactly the kind of rework the Storage interface was designed to prevent.

Suggestion: The SessionManager should delegate session storage to transportsession.Manager (which already uses the pluggable Storage interface) rather than maintaining its own sync.Map. The existing Manager.AddSession(), Manager.Get(), and Manager.Delete() methods already provide exactly what's needed here.

How do the two Session interfaces compose?

The RFC proposes a new Session interface (behavior-oriented: CallTool(), ReadResource(), etc.) while we already have transportsession.Session (storage-oriented: ID(), Type(), Touch(), GetData()). These serve different purposes, but the RFC doesn't show how they compose.

Note that currently VMCPSession embeds StreamableSession which implements transportsession.Session. That's what makes it compatible with the Storage interface and Manager. The new behavior-oriented Session needs to either:

  1. Embed/implement transportsession.Session so it remains compatible with the existing storage layer, or
  2. Clearly document the two-layer model: transport session (storage) wraps/references domain session (behavior)

Without this, the new Session type won't be storable through the Storage interface, which breaks the externalization path.

Partial initialization and serializability

The RFC allows sessions to continue with partial backends (some failed during MakeSession()). If we later add Redis/Valkey persistence, a session restored on a different instance wouldn't know which backends originally failed vs. which were never attempted. The session state model needs to account for this if it's going to be serializable.

What I think should change

  1. Wire through transportsession.Manager instead of a standalone sync.Map. The Manager already supports custom factories and pluggable storage.
  2. Show the composition model: How does the new behavior-oriented Session relate to transportsession.Session? Does it embed it? Wrap it? Replace it?
  3. Add a section on session serializability: What part of the new Session is metadata (storable in Redis/Valkey) vs. runtime state (client connections, goroutines that must stay in-memory)? Issue stacklok/toolhive#2417 already sketched this dual-layer model.

The RFC solves a real problem and the domain-object approach is the right direction. I just want to make sure we don't accidentally create architectural debt that blocks the Redis/Valkey externalization path we've been building toward.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yrobla yrobla force-pushed the session-scoped-mcp-client branch 2 times, most recently from b05416f to 63c5d62 Compare February 12, 2026 10:20
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yrobla yrobla force-pushed the session-scoped-mcp-client branch from 6d03d71 to 6bcdb92 Compare February 12, 2026 10:53
@jerm-dro
Copy link

@JAORMX Thanks for the review and context around transportsession.Manager!

I'm not opposed to preserving the existing session storage mechanism as part of this work. The main thrust of this RFC is encapsulating the session concept and behavior into a coherent domain object, and, based on my read of the code, that can still be accomplished without changing the storage mechanism.

That said, it's worth acknowledging that we're currently paying a cognitive cost to maintain the storage abstraction layer ahead of a distributed backend that doesn't exist yet. I'd be curious to hear your thoughts on the timeline for that work, since it would help us calibrate how much we invest in preserving compatibility with it right now vs. keeping things simple.

In the future, I also think it's worth exploring whether we can simplify the transportsession.Manager API. For example, rather than the current get/set pattern on the Manager, an alternative factoring could have the Session object become the sole owner of its state and expose a Marshal() method that produces a serialized representation on demand. That would make the ownership of data clearer and the storage layer thinner. But that's beyond the scope of this RFC and something we can think through at a later date. When we design this in earnest, we should think deeper about what state can change, when it changes how we signal to update its stored representation, etc.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 11 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +341 to +343
- **Bounded concurrency**: Limit parallel goroutines (e.g., 10 concurrent initializations) to avoid resource exhaustion
- **Per-backend timeout**: Apply context timeout (e.g., 5s per backend) so one slow backend doesn't block session creation
- **Partial initialization**: If some backends fail, log warnings and continue with successfully initialized backends (failed backends not added to clients map)
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parallel client initialization uses bounded concurrency (e.g., 10 concurrent initializations) and per-backend timeouts (e.g., 5s), but these specific values are provided as examples without justification. The optimal values depend on expected network latency, backend responsiveness, and resource constraints. Consider providing guidance on how to tune these parameters based on deployment characteristics, or making them configurable. Additionally, document the trade-offs: higher concurrency reduces session creation latency but increases resource usage and potential backend load spikes.

Suggested change
- **Bounded concurrency**: Limit parallel goroutines (e.g., 10 concurrent initializations) to avoid resource exhaustion
- **Per-backend timeout**: Apply context timeout (e.g., 5s per backend) so one slow backend doesn't block session creation
- **Partial initialization**: If some backends fail, log warnings and continue with successfully initialized backends (failed backends not added to clients map)
- **Bounded concurrency**: Limit parallel goroutines (for example, 10 concurrent initializations) to avoid resource exhaustion.
- These values **MUST be configurable** (e.g., via server configuration or environment variables), not hard-coded constants in the implementation.
- Tuning guidance:
- Start with a baseline such as `min(numBackends, 8–16)` and adjust based on CPU usage, memory pressure, backend rate limits, and startup latency SLOs.
- Higher concurrency reduces end-to-end session creation latency but increases CPU/memory usage and can cause load spikes or connection storms against backends.
- Lower concurrency smooths backend load and reduces resource contention at the cost of slower session initialization when many backends are configured.
- **Per-backend timeout**: Apply a per-backend context timeout (for example, 5s per backend) so one slow or hung backend does not block overall session creation.
- This timeout SHOULD be deployment-specific: shorter in low-latency, co-located deployments; longer in high-latency or intermittently slow environments.
- Consider setting the per-backend timeout relative to observed p95–p99 connection + handshake latency (e.g., `2–3x p99`) and revisiting it as telemetry is collected in production.
- Trade-off: shorter timeouts improve resilience and responsiveness to unhealthy backends but may prematurely drop slow-yet-healthy backends; longer timeouts improve tolerance for slow backends but increase worst-case session startup time.
- **Partial initialization**: If some backends fail (e.g., time out or error during client creation), log warnings and continue with successfully initialized backends (failed backends are not added to the clients map).

Copilot uses AI. Check for mistakes.
Comment on lines +538 to +555
1. **`Generate()` phase**: Creates empty session (UUID only), always succeeds
- Stores empty session in map
- Returns session ID to SDK
- SDK sends `Mcp-Session-Id` header in `InitializeResult`

2. **`OnRegisterSession` hook phase**: Calls `SessionFactory.MakeSession()` - failure possible here
- If `MakeSession()` fails completely (e.g., all backends unreachable):
- Log error with session ID and failure details
- Create session with **empty capabilities** (no tools, resources, or prompts)
- SDK registers empty capability list, client sees "no tools available"
- Keep session in map (allows graceful handling without special error checks)
- Client experience:
- Session exists but has no capabilities
- Client can query session but sees empty tool list
- Client may delete session and re-initialize, or backend recovery may populate capabilities later

**Rationale**: The two-phase creation pattern (empty session + populate via hook) is necessary because the SDK's `Generate()` must return a session ID. Additionally, the SDK's `OnRegisterSession` hook does not allow returning an error, so failures during `MakeSession()` cannot be propagated directly. Instead of storing initialization errors that require checks in every method (`InitError()` pattern is error-prone - easy to forget checks when adding new methods), we create sessions with empty capabilities. Failed backends don't advertise tools/resources, so clients never try to call them. This is simpler and safer than requiring defensive `InitError()` checks throughout the codebase.

Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error handling section describes creating sessions with empty capabilities when MakeSession fails, but this approach could lead to confusing user experience. A client successfully creates a session but then finds no tools available, without clear indication that initialization failed. Consider enhancing the error handling to provide explicit feedback mechanisms, such as adding a session status field that clients can query, or including initialization errors in the session metadata that can be retrieved via a dedicated endpoint.

Suggested change
1. **`Generate()` phase**: Creates empty session (UUID only), always succeeds
- Stores empty session in map
- Returns session ID to SDK
- SDK sends `Mcp-Session-Id` header in `InitializeResult`
2. **`OnRegisterSession` hook phase**: Calls `SessionFactory.MakeSession()` - failure possible here
- If `MakeSession()` fails completely (e.g., all backends unreachable):
- Log error with session ID and failure details
- Create session with **empty capabilities** (no tools, resources, or prompts)
- SDK registers empty capability list, client sees "no tools available"
- Keep session in map (allows graceful handling without special error checks)
- Client experience:
- Session exists but has no capabilities
- Client can query session but sees empty tool list
- Client may delete session and re-initialize, or backend recovery may populate capabilities later
**Rationale**: The two-phase creation pattern (empty session + populate via hook) is necessary because the SDK's `Generate()` must return a session ID. Additionally, the SDK's `OnRegisterSession` hook does not allow returning an error, so failures during `MakeSession()` cannot be propagated directly. Instead of storing initialization errors that require checks in every method (`InitError()` pattern is error-prone - easy to forget checks when adding new methods), we create sessions with empty capabilities. Failed backends don't advertise tools/resources, so clients never try to call them. This is simpler and safer than requiring defensive `InitError()` checks throughout the codebase.
1. **`Generate()` phase**: Creates an empty session shell (UUID only), always succeeds
- Stores session in map with status `initializing`
- Returns session ID to SDK
- SDK sends `Mcp-Session-Id` header in `InitializeResult`
2. **`OnRegisterSession` hook phase**: Calls `SessionFactory.MakeSession()` - failure possible here
- On **success**:
- Populate session capabilities (tools, resources, prompts, clients map, etc.)
- Update session status to `ready`
- Optionally store a brief "initialized at <timestamp>" metadata record
- If `MakeSession()` fails completely (e.g., all backends unreachable):
- Log error with session ID and failure details
- Do **not** attempt to fabricate capabilities; leave capabilities empty
- Update session status to `failed`
- Store an initialization error record in session metadata, e.g.:
- `error_message` (human-readable summary)
- optional `error_code` / `backend_errors` for structured inspection
- Keep session in map (so the ID remains valid and can be inspected/deleted)
**Client experience for failed initialization**:
- Session exists but is marked as `failed`
- Capabilities list is empty (no tools, resources, or prompts)
- A dedicated "session inspection" surface exposes status + error metadata, e.g.:
- via an existing `GetSession`/`ListSessions` API augmented with `status` and `init_error`
- or via a small MCP-specific endpoint / tool that returns session status + error info
- SDK clients / UIs are expected to:
- Check the session status when they observe an empty capability list
- Surface the initialization error message to users
- Allow users (or automation) to delete and re-initialize the session once backends recover
**Rationale**: The two-phase creation pattern (empty session shell + populate via hook) is necessary because the SDK's `Generate()` must return a session ID, and the SDK's `OnRegisterSession` hook does not allow returning an error, so failures during `MakeSession()` cannot be propagated directly in-band. Instead of silently treating initialization failures as "normal" sessions with empty capabilities, we centralize failure handling in the session's `status` and initialization error metadata. This avoids the `InitError()` pattern (which requires checks in every method and is easy to forget when adding new methods), while still giving clients an explicit, queryable signal that initialization failed. Failed backends don't advertise tools/resources, so clients never try to call them, and the additional status+metadata allow UIs and automated clients to distinguish "no tools configured" from "initialization failed."

Copilot uses AI. Check for mistakes.
Comment on lines +632 to +636
- Store a cryptographic hash of the original authentication token (e.g., `SHA256(bearerToken)`) in the session during creation
- On each request, validate that the current auth token hash matches the session's bound token hash
- If mismatch, reject with "session authentication mismatch" error and terminate session
- This prevents stolen session IDs from being used with different credentials

Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The session hijacking prevention section recommends storing a cryptographic hash of the authentication token (e.g., SHA256(bearerToken)), but doesn't address token rotation scenarios. If a client's authentication token is legitimately refreshed (e.g., OAuth token refresh), the hash would no longer match, causing session validation to fail even for the legitimate client. Consider designing the token binding mechanism to support token refresh patterns, such as storing multiple valid token hashes or implementing a token family concept.

Suggested change
- Store a cryptographic hash of the original authentication token (e.g., `SHA256(bearerToken)`) in the session during creation
- On each request, validate that the current auth token hash matches the session's bound token hash
- If mismatch, reject with "session authentication mismatch" error and terminate session
- This prevents stolen session IDs from being used with different credentials
- Store a **token binding record** in the session during creation, which includes a cryptographic hash of the initial authentication token (e.g., `SHA256(bearerToken)`)
- Support **token refresh/rotation** by allowing the token binding record to track a small set of non-expired token hashes (e.g., the current token plus one or two recently rotated tokens with short grace periods)
- On each request, validate that the hash of the current auth token matches one of the non-expired hashes in the token binding record; if it matches a new token, update the record to rotate out old hashes
- If no match is found, reject with a "session authentication mismatch" error and terminate the session
- This prevents stolen session IDs from being used with different credentials while remaining compatible with OAuth-style token refresh and rotation

Copilot uses AI. Check for mistakes.
Comment on lines +524 to +525
- **Note**: Phase 1 will be done incrementally across multiple PRs if necessary, reusing existing implementation pieces. This allows us to introduce the bulk of the code without having to worry about refactoring the existing system.

Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Phase 1 implementation mentions "reusing existing implementation pieces" and doing work "incrementally across multiple PRs if necessary," but doesn't provide clear guidance on what can be reused versus what needs to be written fresh. This ambiguity could lead to implementation inconsistencies or unnecessary refactoring. Consider providing a more detailed breakdown of which existing components can be leveraged (e.g., aggregator, client factory) and which require new implementations.

Suggested change
- **Note**: Phase 1 will be done incrementally across multiple PRs if necessary, reusing existing implementation pieces. This allows us to introduce the bulk of the code without having to worry about refactoring the existing system.
- **Execution model**: Phase 1 will be done incrementally across multiple PRs. Each PR should be strictly additive (no behavior changes for existing session types) and should keep the old and new paths side‑by‑side.
- **Reused components**:
- Existing session storage implementations used by `transportsession.Manager` (e.g., in‑memory/local, Redis), including their TTL/`Touch()` behavior and expiration cleanup.
- Existing client creation code paths where possible (e.g., HTTP/backend client factory, aggregators, or other shared client wiring logic), wrapped or adapted inside the new `SessionFactory` rather than re‑implemented.
- Existing logging, metrics, and configuration loading mechanisms for sessions and backends.
- **New components**:
- Domain‑level `Session` type, `SessionFactory`, and `SessionManager` interfaces plus their initial concrete implementations.
- Wiring from `SessionManager` into the existing MCP SDK integration points (e.g., hooks that call `SessionFactory.MakeSession()` and map SDK session IDs to stored domain sessions).
- Any glue code needed to adapt existing client factories/aggregators into the new session‑scoped lifecycle (without changing their external behavior).

Copilot uses AI. Check for mistakes.
- **Owns backend clients**: Session stores pre-initialized MCP clients in an internal map
- **Encapsulates routing**: `CallTool()` looks up tool in routing table, routes to correct backend client
- **Manageable lifetime**: `Close()` cleans up clients and any other resources. The SessionManager/caller is decoupled from what exactly happens on close()
- **Thread-safe**: All methods protected by internal RWMutex for concurrent access to prevent data races on internal state. Read operations (Tools, Resources, Prompts, CallTool, ReadResource, GetPrompt) use read locks; write operations (Close) use write locks. Methods like `Tools()`, `Resources()`, `Prompts()` return defensive copies (not references to internal slices) to prevent caller mutations. Internal maps and slices are never exposed directly. Note: The RWMutex prevents data races but does not prevent `CallTool()` from using a closed client after `Close()` completes—the session implementation must track closed state explicitly (see "Client Closed Mid-Request" in Error Handling). Proper encapsulation may allow removal of mutexes elsewhere (e.g., SessionManager, current VMCPSession) since state is now owned by the Session
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thread safety section (line 299) states that "Proper encapsulation may allow removal of mutexes elsewhere (e.g., SessionManager, current VMCPSession)," but this is speculative without concrete analysis. Removing mutexes in other components could introduce race conditions if those components have concurrent access patterns not covered by Session's internal locking. This statement should either be removed, made conditional ("may enable future investigation into..."), or backed by specific analysis of where mutex removal is safe.

Suggested change
- **Thread-safe**: All methods protected by internal RWMutex for concurrent access to prevent data races on internal state. Read operations (Tools, Resources, Prompts, CallTool, ReadResource, GetPrompt) use read locks; write operations (Close) use write locks. Methods like `Tools()`, `Resources()`, `Prompts()` return defensive copies (not references to internal slices) to prevent caller mutations. Internal maps and slices are never exposed directly. Note: The RWMutex prevents data races but does not prevent `CallTool()` from using a closed client after `Close()` completes—the session implementation must track closed state explicitly (see "Client Closed Mid-Request" in Error Handling). Proper encapsulation may allow removal of mutexes elsewhere (e.g., SessionManager, current VMCPSession) since state is now owned by the Session
- **Thread-safe**: All methods protected by internal RWMutex for concurrent access to prevent data races on internal state. Read operations (Tools, Resources, Prompts, CallTool, ReadResource, GetPrompt) use read locks; write operations (Close) use write locks. Methods like `Tools()`, `Resources()`, `Prompts()` return defensive copies (not references to internal slices) to prevent caller mutations. Internal maps and slices are never exposed directly. Note: The RWMutex prevents data races but does not prevent `CallTool()` from using a closed client after `Close()` completes—the session implementation must track closed state explicitly (see "Client Closed Mid-Request" in Error Handling). Proper encapsulation confines concurrency concerns within the Session and may enable future investigation into whether mutex usage in other components (e.g., SessionManager, current VMCPSession) can be simplified, but this RFC does not prescribe or assume any such changes.

Copilot uses AI. Check for mistakes.
JAORMX
JAORMX previously approved these changes Feb 13, 2026
Copy link
Contributor

@JAORMX JAORMX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the feedback, Yolanda! The integration with transportsession.Manager (instead of a parallel sync.Map), the dual-layer storage model (metadata vs. runtime), and the composition model (defaultSession embedding transportsession.Session) all look good. This preserves the Redis/Valkey externalization path we built out in #1677/#1989/#1770/#1771 while delivering the session encapsulation improvements.

LGTM!

Copy link

@jerm-dro jerm-dro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes motivated by others' feedback LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants