fix(permission): clean up leftover access in SpiceDB when a permission is deleted#1685
fix(permission): clean up leftover access in SpiceDB when a permission is deleted#1685whoAbhishekSah wants to merge 1 commit into
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📝 WalkthroughSummary by CodeRabbit
WalkthroughThis PR makes permission deletion cascade: API and service layers now sweep SpiceDB role→permission relations, prune role permission lists, expose a DeletePermission handler with builtin and namespace-safety guards, update server wiring/authorization, adjust store/SpiceDB behavior, and add unit and E2E tests and mocks. ChangesPermission Cascade Deletion
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Suggested reviewers
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
core/permission/service_test.go (2)
219-220: ⚡ Quick winAssert the exact cascade delete relation payload.
Line 219 and Line 252 use
mock.Anythingfor the relation argument, so tests won’t catch a broken namespace/relation-name contract.Proposed test assertion tightening
+ // add imports: + // "github.com/raystack/frontier/core/relation" + // "github.com/raystack/frontier/internal/bootstrap/schema" + expectedRel := relation.Relation{ + Object: relation.Object{ + Namespace: schema.RoleNamespace, + }, + RelationName: perm.Slug, + } - mockRelation.On("Delete", mock.Anything, mock.Anything).Return(nil).Once() + mockRelation.On("Delete", mock.Anything, expectedRel).Return(nil).Once() - mockRelation.On("Delete", mock.Anything, mock.Anything).Return(expectedErr).Once() + mockRelation.On("Delete", mock.Anything, expectedRel).Return(expectedErr).Once()Also applies to: 252-253
242-258: ⚡ Quick winAdd a delete test for ignored
relation.ErrNotExist.
Service.Deleteintentionally toleratesrelation.ErrNotExist; add a subtest to lock that behavior and ensure repo delete still executes.Suggested subtest
+ t.Run("should continue deleting permission when relation sweep returns ErrNotExist", func(t *testing.T) { + mockRepo := mocks.NewRepository(t) + mockRelation := mocks.NewRelationService(t) + svc := permission.NewService(mockRepo, mockRelation) + + permissionID := uuid.New().String() + perm := permission.Permission{ID: permissionID, Name: "delete", Slug: "app_organization_delete"} + + mockRepo.On("Get", mock.Anything, permissionID).Return(perm, nil).Once() + mockRelation.On("Delete", mock.Anything, mock.Anything).Return(relation.ErrNotExist).Once() + mockRepo.On("Delete", mock.Anything, permissionID).Return(nil).Once() + + err := svc.Delete(context.Background(), permissionID) + assert.NoError(t, err) + })test/e2e/regression/service_registration_test.go (1)
170-175: ⚡ Quick winDerive
permSlugvia schema helper instead of hardcoding.Line 170 hardcodes slug formatting; generating it from namespace/name keeps this test stable if slug rules change.
Proposed tweak
- const permSlug = "permcascade_res_act" + const permNamespace = "permcascade/res" + const permName = "act" + permSlug := schema.FQPermissionNameFromNamespace(permNamespace, permName) createPermResp, err := s.testBench.AdminClient.CreatePermission(ctx, connect.NewRequest(&frontierv1beta1.CreatePermissionRequest{ Bodies: []*frontierv1beta1.PermissionRequestBody{ - {Name: "act", Namespace: "permcascade/res"}, + {Name: permName, Namespace: permNamespace}, }, }))
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2dca1fc8-e791-43e9-a9f1-c6f82b8eba39
📒 Files selected for processing (9)
cmd/serve.gocore/permission/mocks/relation_service.gocore/permission/service.gocore/permission/service_test.gointernal/api/v1beta1connect/interfaces.gointernal/api/v1beta1connect/mocks/permission_service.gointernal/api/v1beta1connect/permission.gopkg/server/connect_interceptors/authorization.gotest/e2e/regression/service_registration_test.go
Coverage Report for CI Build 27267132511Coverage decreased (-0.09%) to 43.281%Details
Uncovered Changes
Coverage RegressionsNo coverage regressions found. Coverage Stats
💛 - Coveralls |
27696db to
8c37eb5
Compare
8c37eb5 to
0d86850
Compare
0d86850 to
632e276
Compare
632e276 to
391819b
Compare
391819b to
6cf5c46
Compare
6cf5c46 to
57f928c
Compare
57f928c to
518a316
Compare
518a316 to
dc51fa9
Compare
dc51fa9 to
4b670c2
Compare
4b670c2 to
ec737b2
Compare
ec737b2 to
afb3748
Compare
afb3748 to
d93329c
Compare
d93329c to
7dd1da3
Compare
Manual verification — comprehensive ✅Tested every path of the updated The current delete logic has three layers, all exercised:
Round 1 — behavior (no restart)
Round 2 — boot safety + fresh config built-in (with a real restart)
Round 3 — namespace guard reads SpiceDB, not the
|
| State | Delete last perm of the namespace |
|---|---|
A raw SpiceDB tuple of that type written directly (<ns>:x#owner@app/user:…), resources table empty |
rejected (FailedPrecondition) ✅ |
| That tuple removed | succeeds ✅ |
This is why a resources-table-only check would be insufficient — it would miss a loose/orphaned tuple and let the delete through, re-creating the boot break.
The boot-safety result is the key one: the deletes the guard allowed leave the server bootable, and the ones it rejected are exactly those that would otherwise have broken startup.
Side-findings (all pre-existing — NOT introduced by this PR; it doesn't touch this code)
Tracked separately in #1693.
DeleteProjectResourcereturns 500 for a custom namespace with nodeleteverb. It does an authz precheck for thedeletepermission on the resource type; a namespace created viaCreatePermissiononly has the verbs you defined, so the check fails (relation/permission delete not found under definition ...) → 500. This creates a chicken-and-egg with the new guard: it tells you to "remove the resources first," but you can't remove such a resource through the API. Worth deciding whether the guard's message should point at this, or whether resource deletion should tolerate a missingdeleteverb — but it's arguably out of scope here.CreatePermissionsilently no-ops in coreapp/*namespaces — creating e.g.app.organization.zzzreturns200with an empty body and creates nothing (those namespaces are filtered). No stray verbs can exist there.CreatePermissionreturns 500 instead of 400 when the service or verb component is < 3 characters — they become SpiceDB relation/definition identifiers, which must match^[a-z][a-z0-9_]{1,62}[a-z0-9]$(min 3 chars). e.g. keys with verbgo/itor servicer2fail at schema-compile time. Should be rejected as a clean validation error at the API.
7dd1da3 to
3b06e31
Compare
permission.Delete only removed the DB row. Every role granting the permission keeps app/role:<role>#<slug>@<*> tuples (one per principal type), so deleting a permission left those tuples dangling on a relation no longer backed by any permission row. The method was also unreachable: the DeletePermission RPC was hardwired to "function not available". Changes: - permission.Service gains a relation dependency and, on Delete, sweeps the role->permission tuples by object-namespace (app/role) + relation-name (the permission slug), clearing them across all roles and principal types before removing the row. Tolerates ErrNotExist for unused permissions. - Implement the DeletePermission admin handler and gate it on superuser (previously returned CodeUnavailable), making the guard reachable and giving the data-cleanup effort a way to remove stray permissions. Adds an e2e regression test: create a permission, build a role on it, delete the permission, assert no role->permission tuple remains. Verified it fails without the sweep (tuple lingers) and passes with it. Refs #1661 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
3b06e31 to
dfc1924
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
core/permission/service.go (1)
71-100: 💤 Low valueMinor: Comment wording could be clearer.
Line 73 says "one per principal type" but the code deletes all tuples in a single
relationService.Deletecall. Consider rewording to "the SpiceDB tuples that let roles grant it (app/role:*#<slug>@<principal>:*, across all role IDs and principal types)" for clarity.internal/api/v1beta1connect/permission.go (1)
153-229: ⚡ Quick winObservability gap: intermediate service errors not logged.
The handler correctly implements the three-layer guard (built-in check, namespace-safety check, then delete), but lines 174, 192, and 208 skip
errorLogger.LogServiceErrorbefore returning Internal errors, unlike the pattern at lines 164 and 218. Intermediate service failures (BuiltinPermissions, List, ListRelations) should also be logged to aid debugging.📊 Proposed fix: add errorLogger calls for consistency
builtin, err := h.bootstrapService.BuiltinPermissions(ctx) if err != nil { + errorLogger.LogServiceError(ctx, request, "DeletePermission.BuiltinPermissions", err, "permission_id", permissionID) return nil, connect.NewError(connect.CodeInternal, fmt.Errorf("DeletePermission.BuiltinPermissions: permission_id=%s: %w", permissionID, err)) }Apply the same pattern at lines 192 (List) and 208 (ListRelations).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1a916416-411d-4a2d-8798-7016eacbf115
📒 Files selected for processing (19)
cmd/serve.gocore/permission/mocks/relation_service.gocore/permission/mocks/role_service.gocore/permission/service.gocore/permission/service_test.gocore/role/mocks/repository.gocore/role/role.gocore/role/service.gocore/role/service_test.gointernal/api/v1beta1connect/interfaces.gointernal/api/v1beta1connect/mocks/bootstrap_service.gointernal/api/v1beta1connect/mocks/permission_service.gointernal/api/v1beta1connect/mocks/relation_service.gointernal/api/v1beta1connect/permission.gointernal/bootstrap/service.gointernal/store/postgres/role_repository.gointernal/store/spicedb/relation_repository.gopkg/server/connect_interceptors/authorization.gotest/e2e/regression/service_registration_test.go
✅ Files skipped from review due to trivial changes (5)
- internal/api/v1beta1connect/mocks/bootstrap_service.go
- internal/api/v1beta1connect/mocks/relation_service.go
- core/permission/mocks/role_service.go
- core/permission/mocks/relation_service.go
- internal/api/v1beta1connect/mocks/permission_service.go
🚧 Files skipped from review as they are similar to previous changes (3)
- test/e2e/regression/service_registration_test.go
- pkg/server/connect_interceptors/authorization.go
- core/permission/service_test.go
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/e2e/regression/service_registration_test.go (1)
257-308: 🏗️ Heavy liftAdd a raw-SpiceDB-tuple variant for the namespace guard.
This test only proves the guard blocks deletion when a
CreateProjectResourcerow exists. IfDeletePermissionever regressed to checking Postgres-backed resources instead ofrelationService.ListRelations, this would still pass because the resource row is present. A second case with a directorphanguard/widgetrelation and no backing resource row would pin the “SpiceDB is the source of truth” behavior this PR is relying on.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8f27328c-c72c-4d0d-b09a-1b612f0f25fd
📒 Files selected for processing (19)
cmd/serve.gocore/permission/mocks/relation_service.gocore/permission/mocks/role_service.gocore/permission/service.gocore/permission/service_test.gocore/role/mocks/repository.gocore/role/role.gocore/role/service.gocore/role/service_test.gointernal/api/v1beta1connect/interfaces.gointernal/api/v1beta1connect/mocks/bootstrap_service.gointernal/api/v1beta1connect/mocks/permission_service.gointernal/api/v1beta1connect/mocks/relation_service.gointernal/api/v1beta1connect/permission.gointernal/bootstrap/service.gointernal/store/postgres/role_repository.gointernal/store/spicedb/relation_repository.gopkg/server/connect_interceptors/authorization.gotest/e2e/regression/service_registration_test.go
✅ Files skipped from review due to trivial changes (4)
- internal/api/v1beta1connect/mocks/bootstrap_service.go
- core/permission/mocks/relation_service.go
- internal/api/v1beta1connect/mocks/relation_service.go
- core/permission/mocks/role_service.go
🚧 Files skipped from review as they are similar to previous changes (13)
- core/role/role.go
- internal/bootstrap/service.go
- core/role/service.go
- pkg/server/connect_interceptors/authorization.go
- internal/api/v1beta1connect/mocks/permission_service.go
- internal/api/v1beta1connect/interfaces.go
- core/role/service_test.go
- core/permission/service.go
- internal/store/spicedb/relation_repository.go
- core/permission/service_test.go
- cmd/serve.go
- internal/store/postgres/role_repository.go
- core/role/mocks/repository.go
| // A namespace exists in SpiceDB only as long as it has a permission. If this | ||
| // is its last permission, the namespace disappears on the next boot — and | ||
| // SpiceDB won't drop a type that still has relationships, which breaks | ||
| // startup. So if nothing else keeps the namespace alive, refuse the delete | ||
| // while any relationship of that type still exists. | ||
| otherPerms, err := h.permissionService.List(ctx, permission.Filter{Namespace: perm.NamespaceID}) | ||
| if err != nil { | ||
| return nil, connect.NewError(connect.CodeInternal, fmt.Errorf("DeletePermission.List: permission_id=%s: %w", permissionID, err)) | ||
| } | ||
| namespaceSurvives := false | ||
| for _, p := range otherPerms { | ||
| if p.ID != perm.ID { | ||
| namespaceSurvives = true | ||
| break | ||
| } | ||
| } | ||
| if !namespaceSurvives { | ||
| // ask SpiceDB itself (the source of truth) whether anything of this type | ||
| // still exists — resources, policy grants, or directly-created tuples. | ||
| rels, err := h.relationService.ListRelations(ctx, relation.Relation{ | ||
| Object: relation.Object{Namespace: perm.NamespaceID}, | ||
| }) | ||
| if err != nil { | ||
| return nil, connect.NewError(connect.CodeInternal, fmt.Errorf("DeletePermission.ListRelations: permission_id=%s: %w", permissionID, err)) | ||
| } | ||
| if len(rels) > 0 { | ||
| return nil, connect.NewError(connect.CodeFailedPrecondition, | ||
| errors.New("cannot delete the last permission of namespace "+perm.NamespaceID+": relationships of this type still exist in SpiceDB (e.g. resources); remove them first")) | ||
| } | ||
| } | ||
|
|
||
| err = h.permissionService.Delete(ctx, permissionID) |
There was a problem hiding this comment.
Serialize the last-permission guard with namespace writers.
This check is still TOCTOU: after List/ListRelations says the namespace is empty, another request can create a resource or direct tuple in perm.NamespaceID before permissionService.Delete removes the final permission row. The current SpiceDB schema keeps accepting that type until restart, so the delete can still leave relationships behind and reintroduce the boot-failure this guard is trying to prevent. This needs a shared critical section with writers for that namespace (for example an advisory lock or an equivalent recheck under the same lock) before the last permission is deleted.
There was a problem hiding this comment.
Valid — this is a real TOCTOU: a resource/tuple could be created in the namespace between the emptiness check and the row delete, re-orphaning it. A couple of reasons I'm leaving it as best-effort here rather than adding the lock:
- Low probability, recoverable worst case. The window is the few microseconds between the check and the delete; it needs a superuser deleting a namespace's last permission at the exact moment someone creates a resource of that same namespace. The worst case is a startup failure (recoverable by re-adding the permission or clearing the relationships), not data loss or an auth bypass.
- A correct fix is cross-cutting and arguably belongs a layer down. Serializing this would mean every namespace writer (resource create, CreateRelation, policy-on-resource) taking the same advisory lock keyed by namespace — significant surface area for this corner. And the deeper root cause isn't the missing lock; it's that bootstrap hard-fails when it tries to drop a populated object type. The robust fix is making bootstrap tolerant (don't drop a type that still has relationships, or clear them first), which is schema-management work tracked separately.
So this guard intentionally catches the common case (delete while a resource clearly exists) and is not race-free. Tracking the bootstrap-tolerance / serialization follow-up in #1693.
There was a problem hiding this comment.
@whoAbhishekSah Thanks for the thorough explanation. The reasoning is sound — the lock surface would be wide (every namespace writer would need to participate), the worst case is a recoverable startup failure rather than a data integrity issue, and making bootstrap tolerant of populated object types is the correct long-term fix anyway. The best-effort guard as implemented is a meaningful improvement over the previous state. Happy to defer the serialization/bootstrap-tolerance work to #1693.
🧠 Learnings used
Learnt from: rohilsurana
Repo: raystack/frontier PR: 1552
File: billing/customer/service.go:52-55
Timestamp: 2026-04-22T05:54:17.663Z
Learning: In this repo’s Go services, avoid adding nil-safe logger guards inside service constructors (e.g., `if logger == nil { logger = slog.Default() }`). Instead, ensure all call sites explicitly pass a non-nil `*slog.Logger` (typically `slog.Default()` or a properly initialized logger). Reviewers should reject suggestions to add constructor nil guards for internal service constructors.
Background: how permissions are stored
A permission lives in three places:
permissionstable (Postgres) — one row per permission.rolestable (Postgres) — each role keeps a list of the permission slugs it grants.(format:
app/role:<role-id>#<permission-slug>@<principal>:*)Rows in the Postgres
permissionstable are written by exactly one function — bootstrap'sAppendSchema(viamigrateServiceDefinitionToDB). It runs at startup (MigrateSchema, from base schema + config) and from theCreatePermissionAPI. On every startup the whole SpiceDB schema is rebuilt from three things: the base schema, the config files, and thepermissionstable. A namespace (object type) shows up in that schema only if it still has at least one permission. So if you delete a namespace's last permission, the namespace disappears from the schema on the next startup. Example:compute/machinehas one permissioncompute/machine:scrub; delete that permission andcompute/machineis no longer in the schema after the next startup.The problems
permission.Deleteonly removed the Postgres row and left the SpiceDB grant tuples behind, so a check kept returning true — the deleted permission still granted access.roles.permissionslist (no FK — it's a denormalized jsonb array).DeletePermissionadmin RPC was hardwired to "function not available."cannot delete object definition … as a relationship exists under it).The fix
permission.Deleteremoves everyapp/role:*#<slug>grant tuple (all roles, all principal types) before deleting the row.role.Service.RemovePermissionFromRoles— through the role service (not its repo), a single targeted DB update, deliberately notrole.Update(which would rebuild each role's tuples + emit a per-role audit). Wired via a setter-injected interface (theSetMembershipServicepattern).DeletePermissionadmin handler, gated on superuser.FailedPrecondition).ReadRelationshipsby object type — the source of truth, since the Postgres mirror can drift) whether any relationship of that type still exists. If so →FailedPrecondition(remove those resources/relationships first), so a delete can't wedge the next startup. If the namespace has other permissions, it survives and the delete proceeds.Known limitation (follow-up)
Tests
TestPermissionDeleteCascade(e2e): stray permission + role → delete → noapp/role:*#<slug>tuple remains and the role no longer lists it; deleting a built-in permission (app/organization:get) →FailedPrecondition.TestPermissionDeleteBlockedByResource(e2e): permission in a fresh namespace + a resource of that type → delete →FailedPrecondition; remove the resource → permission deletes cleanly.Plus unit tests for the relation sweep + role prune (
core/permission) and the role-list strip (core/role).Addresses gap (6) of #1661.