Skip to content

feat(disttae,tae/logtail): per-account catalog memory isolation for CN nodes (#23777)#24262

Open
aptend wants to merge 11 commits intomatrixorigin:mainfrom
aptend:feat/account-mem-isolation
Open

feat(disttae,tae/logtail): per-account catalog memory isolation for CN nodes (#23777)#24262
aptend wants to merge 11 commits intomatrixorigin:mainfrom
aptend:feat/account-mem-isolation

Conversation

@aptend
Copy link
Copy Markdown
Contributor

@aptend aptend commented Apr 30, 2026

What type of PR is this?

  • Feature

Which issue(s) this PR fixes:

issue #23777

What this PR does / why we need it:

This PR introduces per-account (tenant-aware) catalog memory isolation for CN
nodes, addressing #23777. Previously every CN replayed and cached mo_database
/ mo_tables / mo_columns rows for all tenants, which scales poorly with
tenant count. With this change, each CN only materialises the catalog of the
sys account plus the tenants it actually serves.

Highlights

  • Lazy catalog activation protocol – the logtail stream now carries
    per-account subscribe / activation request / response messages dedicated to
    the three system catalog tables (mo_database, mo_tables, mo_columns).
    Normal-table subscribe / unsubscribe / publish flows are untouched.
  • TN-side row filtering – the TN holds a CN-scoped active-account snapshot
    and filters subscribe / activation / publish payloads by account_id (or
    decoded cpkey for deletes) before sending. Row-level filtering uses a
    selection-list + Batch.Union(sels) and lazily allocates so the all-pass
    case stays zero-copy.
  • CN-side per-account state machine – CNs track per-account (seq, ready, readyTS, wantedAccounts, accountDCA) and replay each newly activated tenant
    at a single replayTS snapshot. A delayed-cache-apply (DCA) buffer per
    account ensures live push-stream entries don't escape catalog cache before
    replay rebuilds the baseline.
  • Per-account serve gatestxnDatabase/stats consumers now check both
    CatalogCache.CanServe(ts) and pClient.CanServeAccount(account, ts).
    Bare/unit-test engines that never enable lazy catalog keep their pre-existing
    ungated behavior.
  • Reconnect / restart restore – on reconnect, CN re-subscribes the three
    catalog tables with wantedAccounts ∪ {0} and replays them at one snapshot
    TS, so previously-active tenants come back ready without per-account
    re-activation. Newly cold tenants stay not-ready until first access.
  • Frontend / restore hooks – frontend auth activates the tenant catalog
    immediately after SetTenantID. RESTORE CLUSTER and single-account restore
    also activate the impacted accounts so restored databases are immediately
    visible. GetSubscriptionMeta activates the publisher's account so
    subscription paths see publisher metadata even on cold CNs.
  • Debug surfaces/debug/status/catalog, /debug/status/catalog-cache,
    /debug/status/catalog-activation, /debug/status/partitions expose
    per-account readiness, catalog-cache contents, recent activation lifecycle
    events, and partition state for diagnosis.

Validation

  • Unit tests: go test -tags matrixone_test ./pkg/catalog ./pkg/vm/engine/disttae/... ./pkg/vm/engine/tae/logtail/... all green.
  • A dedicated embed-cluster reconnect integration test
    (pkg/tests/issues/lazy_catalog_reconnect_test.go) exercises real
    disconnect / reconnect / per-account activation isolation.
  • Multi-CN live harness (docs/design/lazy_catalog_load_for_cn.{plan,flow}.md
    describes the design) validated independent activation, cross-CN cold
    activation, CN-local restart rebuild, TN reconnect with divergent ready sets,
    and checkpoint/in-memory mixed-state recovery.

Notes

  • Only the three lazy catalog tables are affected; normal-table data plane is
    not modified.
  • EntireEngine now forwards engine.TenantCatalogActivator so frontend can
    trigger activation through the wrapper.
  • Diagnostic instrumentation used during development was renamed to a
    consistent catalog-load log prefix; transient debug DIAG- Warn and a
    panic in loadDatabaseFromStorage were removed in favor of an
    InternalError return.

Special notes for your reviewer:

  • The TN row-filtering helpers live in
    pkg/vm/engine/tae/logtail/service/catalog_filter.go with focused tests in
    catalog_filter_test.go.
  • CN-side state and activation flow live in
    pkg/vm/engine/disttae/lazy_catalog_cn.go /
    logtail_consumer.go (functions tagged Activation /
    replayCatalogTableCache).
  • Proto changes are in proto/api.proto and proto/logtail.proto; the
    regenerated pb files are included.

aptend and others added 7 commits April 30, 2026 15:58
- show_account.go: Change INNER JOIN to LEFT JOIN from mo_account to
  db_tbl_counts so freshly-created but not-yet-activated accounts still
  appear in 'show accounts'. TN-side lazy catalog publish filtering
  strips mo_tables/mo_database entries for non-activated accounts, so
  the INNER JOIN was hiding them. COALESCE handles NULL counts.
- snapshot.go: Add activateAccountCatalogIfNeeded() before the first
  internal SQL in restore flows (doRestoreSnapshot, restoreAccountUsingClusterSnapshotToNew)
  so background SQL sessions that bypass AuthenticateUser/ActivateTenantCatalog
  can see catalog entries for the target account.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When restoring a dropped account (via cluster snapshot or account-level
snapshot), the source account may have never been activated on this CN
(e.g., no tenant session was created before the drop). Without activation,
the source account's catalog entries are absent from both the catalog cache
and the mo_tables partition state (filtered by TN publish), causing table
resolution for mo_foreign_keys to fail with 'table does not exist'.

Fix: activate the source (from) account's catalog before FK table
resolution when from != to, in both the cluster restore path
(restoreAccountUsingClusterSnapshotToNew) and the account-level restore
path (doRestoreSnapshot).

BVT: snapshot/cluster 3772/3772 (100%), snapshot/ 17597/17598 (99.99%,
the 1 remaining failure is a pre-existing stale expected-result in
snapshotRead.sql unrelated to this change)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Regenerated proto-derived files after rebasing onto origin/main.
Also includes gofmt formatting fixes from make pb.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previously visitObjMeta used the global HasDropCommitted() flag to decide
whether to skip an appendable object's in-memory data scan. This races
with concurrent compaction: if the drop commits AFTER the scan-end of the
current logtail window, the aobj was still alive at scan-end and its rows
must be included in the response, but the global flag causes us to skip
the scan. The newly created merged nobj has CreateNode > scan-end, so
both objects fall outside the window and the rows are silently lost.

This was the root cause of flaky failures in the high-concurrent CREATE/
DROP ACCOUNT tests: mo_columns aobj drops committed a few ms after the
activation-pull scan-end, causing CN to receive an empty-column TableDef
for information_schema.character_sets, which later surfaced as
'no such table information_schema.character_sets' during drop-account
FK cleanup.

CI evidence: mo-auto-test run 24869137344 job 72813171794 step 7 shows
logtail.pull.skip.just.dropped.aobj firing on mo_columns ~1.5s before
the client-visible 'no such table' error.

Fix: skip the data scan only if DeleteNode is committed AND its End <=
scan-end, so the decision is ts-relative to the scan window, matching
the semantics already used by ForeachMVCCNodeInRange for metadata.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When TN's HandleSyncLogTailReq hits a tail larger than Size90M it calls
FlushTable and retries the same (from, end) range. The flush upgrades
the appendable object's mnode to a pnode and writes a DeleteNode at
flushTS > end. On retry visitObjMeta still emits the aobj meta (its
CreateNode falls in range, its DeleteNode does not), but
baseObject.ScanInMemory returned nil for any persisted node, dropping
all of the aobj's row data from the retry tail.

For lazy catalog (mo_database / mo_tables / mo_columns) this silently
produced 0-row activation replays on CN, which later manifested as
'delete table ... failed' / 'no such table' panics during concurrent
account creation.

Fix: add persistedNode.ScanDataInRange, modeled after
CollectObjectTombstoneInRange, that reads per-row data + commit_ts
from the flushed aobj's block and filters by [start, end]. Wire it
into ScanInMemory for appendable, non-tombstone persisted objects.
Non-appendable persisted objects stay meta-only (unchanged).

Lock the behavior down with TestCollectInsertAfterFlush: append three
transactions into an aobj, flush it, then verify
RangeScanInMemoryByObject still returns the same 6 / 9 / 3 / 6 row
windows for [0,p1] / [0,p2] / (p1,p2] / (p1,p3] that
TestCollectInsert checks before the flush.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ve DIAG- noise

- Rename the diagnostic FIND_TABLE prefix used across loadTableFromStorage,
  replayCatalog* and a few error-injection sites to the more descriptive
  catalog-load tag, so the logs read as production observability rather than
  ad-hoc debug markers.
- Drop the noisy DIAG-consumeEntry per-account DCA delayed Warn: the path
  fires on every per-account catch-up entry by design and added no signal
  beyond what the activation lifecycle logs already cover.
- Replace the panic in loadDatabaseFromStorage's row-count guard with a
  proper InternalError return; the caller already handles errors and the
  panic was added for diagnosis, not as a real invariant.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 30, 2026 08:11
@qodo-code-review
Copy link
Copy Markdown

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements tenant-aware (“lazy”) catalog loading so CN nodes only materialize mo_database/mo_tables/mo_columns for sys + activated accounts, reducing per-CN catalog memory and replay work in high-tenant deployments.

Changes:

  • Extends logtail protocol to support lazy-catalog subscription metadata and per-account activation request/response.
  • Adds TN-side row filtering for the three catalog tables and CN-side per-account activation/replay + serve gating.
  • Adds frontend hooks (auth/restore/clone/subscription) and debug/status surfaces for catalog activation/cache/partition introspection.

Reviewed changes

Copilot reviewed 70 out of 70 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
test/distributed/cases/git4data/branch/metadata/branch_metadata_tenant.sql Activates tenant catalog on CN for cross-account catalog queries in test case.
test/distributed/cases/git4data/branch/metadata/branch_metadata_tenant.result Updates expected output for added activation steps.
proto/logtail.proto Adds lazy-catalog subscribe fields + activation request/response messages.
proto/api.proto Adds presence-safe account summary fields on api.Entry for lazy-catalog filtering.
pkg/vm/engine/types.go Introduces TenantCatalogActivator optional engine interface.
pkg/vm/engine/tae/tables/pnode.go Adds persisted appendable-object range scan to support lazy-catalog tail collection.
pkg/vm/engine/tae/tables/base.go Routes flushed appendable objects to persisted range scan when needed.
pkg/vm/engine/tae/logtail/txn_handle_test.go Tests entry account-summary annotation uses row account, not current account.
pkg/vm/engine/tae/logtail/txn_handle.go Annotates lazy-catalog entries with account summary when single-account.
pkg/vm/engine/tae/logtail/service/session_test.go Adds coverage for activation response send path.
pkg/vm/engine/tae/logtail/service/session.go Adds activation response builder and clarifies filter behavior separation.
pkg/vm/engine/tae/logtail/service/server_test.go Adds tests for activation queueing and early row filtering behavior.
pkg/vm/engine/tae/logtail/service/lazy_catalog_session_test.go Tests lazy-catalog per-session filter/activation state bookkeeping.
pkg/vm/engine/tae/logtail/service/lazy_catalog_session.go Implements per-session lazy-catalog active/activating account tracking + publish rewrite hook.
pkg/vm/engine/tae/logtail/service/client_test.go Adds tests for new logtail client subscribe/activate APIs.
pkg/vm/engine/tae/logtail/service/client.go Adds SubscribeCatalogTable and ActivateAccountForCatalog APIs + request cloning helpers.
pkg/vm/engine/tae/logtail/service/catalog_filter_test.go Adds focused unit tests for TN-side catalog row filtering.
pkg/vm/engine/tae/logtail/service/catalog_filter.go Implements TN-side lazy-catalog row-level filtering for subscribe/activation/publish paths.
pkg/vm/engine/tae/logtail/handle.go Fixes appendable-object window logic to avoid missing rows across compaction timing.
pkg/vm/engine/tae/db/test/db_test.go Adds regression test verifying insert collection still works after aobj flush.
pkg/vm/engine/entire_engine_test.go Tests EntireEngine forwards ActivateTenantCatalog.
pkg/vm/engine/entire_engine.go Forwards ActivateTenantCatalog when wrapped engine supports it.
pkg/vm/engine/disttae/txn_table.go Minor refactor for in-memory block collection condition readability.
pkg/vm/engine/disttae/txn_database.go Adds serve-gating by account, removes panics, improves logging, and avoids returning column-less table defs.
pkg/vm/engine/disttae/stats.go Gates stats refresh/update on per-account catalog readiness.
pkg/vm/engine/disttae/mo_table_stats.go Gates cache-based stats reads on per-account readiness and reuses cached handles.
pkg/vm/engine/disttae/logtailreplay/partition_state_test.go Adds regression tests for duplicate DELETE (push vs activation) GC behavior.
pkg/vm/engine/disttae/logtailreplay/partition_state.go Adjusts dead-object protection to still allow GC on repeated DELETEs.
pkg/vm/engine/disttae/logtailreplay/partition.go Exposes checkpoint-consumed state accessor.
pkg/vm/engine/disttae/logtailreplay/debug.go Adds partition-state debug summary helpers.
pkg/vm/engine/disttae/logtail_consumer_test.go Adds test ensuring DCA drains nested delayed closures before replay.
pkg/vm/engine/disttae/logtail.go Adds skipCatalogCache flag and per-account delayed catalog-cache apply routing.
pkg/vm/engine/disttae/lazy_catalog_cn_test.go Adds unit tests for CN lazy-catalog state machine and helpers.
pkg/vm/engine/disttae/lazy_catalog_cn.go Implements CN per-account activation state, DCA buffering, and response delivery plumbing.
pkg/vm/engine/disttae/engine.go Adds per-account serve gating for DB/relation resolution and implements ActivateTenantCatalog.
pkg/vm/engine/disttae/debug_state_test.go Tests debug-state rendering for lazy-catalog activation/cache readiness.
pkg/vm/engine/disttae/debug_state.go Adds debug structs and endpoints backing for catalog/cache/activation/partitions introspection.
pkg/vm/engine/disttae/cache/debug_test.go Tests catalog-cache debug account-contents summaries.
pkg/vm/engine/disttae/cache/debug.go Adds catalog-cache debug summary + per-account contents rendering.
pkg/vm/engine/disttae/cache/catalog.go Fixes cpkey-index delete correctness, adds duplicate-column diagnostics, and makes InsertColumns copy-on-write.
pkg/util/status/server_test.go Adds tests for new debug/status HTTP endpoints.
pkg/util/status/server.go Extends status server routing with catalog/cache/activation/partitions debug endpoints and CN selection.
pkg/tests/issues/lazy_catalog_reconnect_test.go Adds embed-cluster reconnect integration tests for ready-account restoration.
pkg/sql/plan/make.go Changes hidden-column lookup to return nil instead of panic.
pkg/sql/plan/function/ctl/types.go Registers ActivateTenantCatalog ctl method.
pkg/sql/plan/function/ctl/cmd_activate_tenant_catalog.go Implements sys-only CN ctl command to activate tenant catalog on current CN.
pkg/sql/colexec/aggexec/sumavg2.go Fixes YEAR type specialization to use types.MoYear.
pkg/frontend/snapshot_restore_with_ts.go Treats catalog-resolution errors as “no FK deps” under lazy-catalog restore edge cases.
pkg/frontend/snapshot.go Activates relevant tenant catalogs for restore flows and softens FK dependency resolution failures.
pkg/frontend/show_account.go Makes SHOW ACCOUNTS resilient to missing counts via LEFT JOIN + activates target account catalog for special-table stats.
pkg/frontend/session.go Activates tenant catalog during authentication before tenant-context SQL.
pkg/frontend/predefined.go Adjusts predefined schema (level varchar(10)).
pkg/frontend/output.go Fixes YEAR extraction/slicing to use types.MoYear backing storage.
pkg/frontend/compiler_context.go Activates publisher/target tenant catalogs when resolving subscription metadata or snapshots.
pkg/frontend/clone.go Activates target account catalog for clone flows that run internal SQL under tenant context.
pkg/frontend/back_exec.go Guards against nil parent context when creating back sessions.
pkg/frontend/authenticate.go Activates target account catalog before ALTER/DROP ACCOUNT internal SQL runs under tenant context.
pkg/container/types/types.go Extends Ints type set to include MoYear.
pkg/cnservice/distributed_tae.go Registers disttae engine with status server for new debug endpoints.
pkg/catalog/types_test.go Adds unit tests for new account-scoped catalog query builders.
pkg/catalog/types.go Adds helpers to build account-scoped batch queries for the three catalog tables.
pkg/catalog/lazy_catalog.go Adds lazy-catalog helpers for identifying catalog tables and extracting account IDs from entries/batches.
etc/docker-multi-cn-local-disk/start.sh Exports repo-root for docker multi-CN local-disk harness.
etc/docker-multi-cn-local-disk/docker-compose.yml Mounts test/ into containers to support BVT test data paths.
docs/design/lazy_catalog_load_for_cn.plan.md Adds main design doc for lazy catalog load on CN.
docs/design/lazy_catalog_load_for_cn.flow.md Adds flow doc describing startup/activation/reconnect semantics and invariants.

Comment thread pkg/sql/plan/make.go
}
}
panic("Unable to find target column from predefined table columns")
return nil
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetColDefFromTable now returns nil when the column isn't found, but there are existing callers that use the returned pointer without a nil check (e.g. building composite PK/cluster-by defs). This change can turn a previously-fast-fail panic into a later nil-deref or invalid TableDef. Either keep the panic, or change all call sites to handle nil and propagate a structured error.

Copilot uses AI. Check for mistakes.
…eCBs

golangci-lint flagged two issues introduced by the lazy-catalog branch:
- pkg/vm/engine/disttae/logtail_consumer.go: forcedSubscribeTable is
  unused (its single caller was replaced by forcedSubscribeCatalogTable
  during the lazy-catalog refactor); remove it.
- pkg/vm/engine/tae/logtail/service/server.go: pre-allocate the
  allCloseCBs slice with the known table count to satisfy prealloc.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add unit tests for the server param-parsing helpers in pkg/util/status
(optional/required uint, optional timestamp, intParam, selectCNInstance)
plus the full handleActivateTenantCatalog branch matrix. These exercise
the previously-uncovered error paths that drag down PR coverage.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@aunjgr aunjgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Per-Account Catalog Memory Isolation (#23777)

Well-designed feature. The architecture is clean: TN filters rows by account before sending, CN tracks per-account readiness via a state machine, and the activation protocol ensures exactly-once catalog materialization per tenant. All CI green.

Architecture Assessment

Correctness of the core protocol:

  • TN-side filtering (catalog_filter.go) correctly handles insert/update (by account_id column) and delete (by decoding cpkey[0]) paths, with proper all-pass zero-copy fast path when selectedRows == bat.RowCount().
  • CN-side state machine (lazy_catalog_cn.go) has clean state transitions: inactive → catching_up → ready_draining → ready. The beginAccountReadyTransition + finishAccountReady split correctly ensures DCA closures are flushed under catalogCacheMu before publishing readyTS.
  • Inflight dedup via sync.Map.LoadOrStore prevents duplicate activations; waiters block on inf.done channel.
  • Reconnect correctly collects wantedAccounts before reset, passes them as initial_active_accounts on re-subscribe, and replays all at once at one snapshot TS.

TN sender serialization:

  • Phase-1 pull happens in the bounded activationReqChan worker pool.
  • Phase-2 completion and response send happen in the single logtailSender goroutine, preventing the activation response from racing with steady-state push. The account is only promoted to active AFTER the response enters the session FIFO — correct ordering guarantee.

Panic removal:

  • loadDatabaseFromStorage and loadTableFromStorage now return errors instead of panicking — good defensive change for multi-tenant environments where dropped accounts may have compacted catalog.

Catalog cache race fix in getTableItem:

  • The nil-Defs fallthrough to storage handles the InsertTable/InsertColumns split-apply window. This is a latent race fix independent of lazy loading — nice catch.

Questions / Minor Items (non-blocking)

  1. delayedCacheApply phase → atomic + mutex double-check: The dcaTryDelay does phase.Load() outside the lock as a fast path, then re-checks inside. This is correct but the dcaConfirmAndApply loop re-acquires the lock on each iteration. With many concurrent push entries during draining, this could cause brief contention. Not a real concern for 3 catalog tables with low push frequency.

  2. lazyCatalogSubscribeFilterMP is a package-level mpool: It's used for row copying during subscribe filtering on the TN side. Since filtering runs in the logtailSender (single goroutine for subscribe) and the activation worker pool (bounded), concurrent allocations from this pool should be fine as mpool is thread-safe. Just noting it's shared.

  3. Activation retry backoff (500ms → 4s, max 4 retries): Total worst-case latency is ~7.5s. For auth-path activation this means a login could block up to 7.5s during rapid reconnect storms. Acceptable given reconnects are rare and this is better than failing the login outright.

  4. buildAccountScopedCatalogBatchQuery string manipulation: The strings.Index(query, " order by ") to split and re-append ORDER BY is fragile if the base query ever uses uppercase or has subqueries with ORDER BY. Currently safe because the base queries are hard-coded constants. A comment noting this assumption would help future readers.

  5. Design docs committed (docs/design/lazy_catalog_load_for_cn.{plan,flow}.md): Good to have the protocol documented alongside the code.

Summary

Solid implementation of a significant scalability improvement. The protocol correctly handles all lifecycle events (startup, activation, push, reconnect, restore). State machine transitions are well-guarded, and the test coverage (unit + integration reconnect test) is adequate. CI all green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature size/XL Denotes a PR that changes [1000, 1999] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants