-
Notifications
You must be signed in to change notification settings - Fork 2
feat: local graph caching and pregeneration #33
Description
Problem
Every command that hits the API (dead-code, blast-radius, audit, focus, find) re-uploads the repo zip and waits 1-3 minutes for analysis. The `analyze` command caches the graph locally by zip hash, but the dedicated analysis endpoints (dead-code, impact) don't benefit — they always re-upload. Users should always have an up-to-date graph without thinking about it.
Current state
`internal/cache/cache.go` — JSON files in `~/.supermodel/cache/{hash}.json`. Each entry is `{graph, cached_at}`. Functions: `Get(hash)`, `Put(hash, graph)`, `HashFile(path)`, `Evict(hash)`. Simple, no index, no staleness detection.
`internal/analyze/handler.go` — `GetGraph()` creates zip, hashes it, checks cache, calls API if miss. Every invocation creates and hashes the zip even if nothing changed.
`internal/archive/archive.go` — shared zip creation. `git archive HEAD` for git repos (fast, respects .gitignore), `WalkZip` fallback.
How commands use caching today:
- `analyze`: checks cache by zip hash ✅
- `dead-code`: creates zip, hashes, calls API every time ❌
- `blast-radius`: creates zip, hashes, calls API every time ❌
- `audit`: calls both graph + impact, no caching ❌
- `focus`, `find`: use cached graph from `analyze` ✅ (but only if `analyze` ran first)
Implementation plan
Phase 1: Git-based staleness detection (fast cache lookups)
New file: `internal/cache/fingerprint.go`
Stop hashing the zip on every invocation. Instead, use git to detect changes in <5ms:
```go
// RepoFingerprint returns a cache key for the current repo state.
// For clean repos (common case): just the commit SHA (~1ms).
// For dirty repos: commit SHA + tree hash (~100ms).
func RepoFingerprint(dir string) (string, error) {
commitSHA, err := gitOutput(dir, "rev-parse", "HEAD")
if err != nil {
return "", fmt.Errorf("not a git repo: %w", err)
}
dirty, err := gitOutput(dir, "status", "--porcelain", "--untracked-files=no")
if err != nil {
return commitSHA, nil // can't check dirty, use commit SHA
}
if dirty == "" {
return commitSHA, nil // clean: commit SHA is sufficient
}
// Dirty: include a hash of the working tree changes
treeHash, err := gitOutput(dir, "stash", "create")
if err != nil {
return commitSHA + ":dirty", nil // fallback
}
return commitSHA + ":" + treeHash, nil
}
func gitOutput(dir string, args ...string) (string, error) {
cmd := exec.Command("git", append([]string{"-C", dir}, args...)...)
out, err := cmd.Output()
return strings.TrimSpace(string(out)), err
}
```
Performance: `git rev-parse HEAD` is ~1ms. `git status --porcelain -uno` is ~100ms. The current approach (create zip + SHA256 the whole thing) is 1-10 seconds depending on repo size.
Phase 2: Cache key that includes analysis type
Extend the cache to store different analysis results (graph, dead-code, impact) for the same repo state:
```go
// CacheKey encodes everything that affects analysis output.
type CacheKey struct {
RepoFingerprint string // from Phase 1
AnalysisType string // "graph" | "dead-code" | "impact" | "impact:{target}"
CLIVersion string // invalidate on upgrade
}
func (k CacheKey) Hash() string {
h := sha256.New()
fmt.Fprintf(h, "%s\x00%s\x00%s", k.RepoFingerprint, k.AnalysisType, k.CLIVersion)
return hex.EncodeToString(h.Sum(nil))
}
```
Update `cache.go`:
- `Get` and `Put` accept `CacheKey` instead of raw hash
- Dead-code results cached under `{fingerprint}:dead-code:{version}`
- Impact results cached under `{fingerprint}:impact:{target}:{version}`
Phase 3: Cache index for status and fast lookup
New file: `internal/cache/index.go`
```go
// ~/.supermodel/cache/index.json
type CacheIndex struct {
Version int `json:"version"`
Repos map[string]*RepoState `json:"repos"` // key: abs path
}
type RepoState struct {
Fingerprint string `json:"fingerprint"`
CachedAt time.Time `json:"cached_at"`
Analyses map[string]string `json:"analyses"` // type → cache file
Status string `json:"status"` // "ready" | "refreshing" | "error"
}
```
Powers:
- `supermodel status` shows: "Graph cached, 2 commits behind" or "Graph is current"
- `supermodel cache list` shows all cached repos with age and staleness
- `supermodel cache clear` removes everything
Phase 4: Background refresh via git hooks
New file: `internal/cache/hooks.go`
On first `supermodel analyze`, offer to install git hooks:
```go
func InstallHooks(repoDir string) error {
hookDir := filepath.Join(repoDir, ".git", "hooks")
for _, name := range []string{"post-commit", "post-checkout", "post-merge"} {
path := filepath.Join(hookDir, name)
appendToHook(path, "supermodel cache refresh --background 2>/dev/null &\n")
}
return nil
}
```
New subcommand: `supermodel cache refresh`:
```go
// Forks a detached process that re-analyzes and warms the cache.
// Uses a lock file to prevent concurrent refreshes.
func RefreshBackground(cfg *config.Config, dir string) error {
lockPath := filepath.Join(cache.Dir(), "refresh.lock")
if isLocked(lockPath) {
return nil
}
exe, _ := os.Executable()
cmd := exec.Command(exe, "cache", "refresh", "--dir", dir)
cmd.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
return cmd.Start() // detach, don't wait
}
```
Why git hooks over file watching (fsnotify):
- Zero runtime overhead (no daemon, no file descriptors)
- Fires on meaningful events (commit, checkout, merge) not noisy file writes
- Works on NFS/network mounts where inotify/kqueue don't
- No new dependencies
Phase 5: Pregeneration command
`supermodel analyze --pregen` runs all analysis types in parallel:
```go
func Pregen(ctx context.Context, cfg *config.Config, dir string) error {
zipPath, err := archive.CreateZip(dir)
defer os.Remove(zipPath)
fingerprint, _ := cache.RepoFingerprint(dir)
client := api.New(cfg)
g := errgroup.Group{}
g.Go(func() error {
// Graph analysis
result, err := client.Analyze(ctx, zipPath, key("graph", fingerprint))
if err == nil { cache.Put(graphKey, result) }
return err
})
g.Go(func() error {
// Dead code analysis
result, err := client.DeadCode(ctx, zipPath, key("dc", fingerprint), "", 0)
if err == nil { cache.Put(dcKey, result) }
return err
})
g.Go(func() error {
// Impact analysis (global)
result, err := client.Impact(ctx, zipPath, key("impact", fingerprint), "", "")
if err == nil { cache.Put(impactKey, result) }
return err
})
return g.Wait()
}
```
File changes summary
| File | Change |
|---|---|
| `internal/cache/cache.go` | Extend `Get`/`Put` with `CacheKey`, atomic writes |
| `internal/cache/fingerprint.go` | New — `RepoFingerprint()` using git |
| `internal/cache/index.go` | New — `CacheIndex` for status tracking |
| `internal/cache/hooks.go` | New — git hook installation |
| `cmd/cache.go` | New — `supermodel cache [list|refresh|clear]` subcommands |
| `internal/analyze/handler.go` | Use `RepoFingerprint` instead of zip-hash-every-time |
| `internal/deadcode/handler.go` | Check cache before API call |
| `internal/blastradius/handler.go` | Check cache before API call |
| `cmd/analyze.go` | Add `--pregen` flag |
Prior art
Archived Supermodel VS Code extension
Delta Manifest System — `/api-server/java/.../DeltaManifestRegistry.java`: `ConcurrentHashMap<String, DeltaManifest>` tracking upserts (path + contentHash + baseHash), deletes, and renames with versioning (runId, parserVersion, pipelineVersion). The CLI's `CacheKey` struct serves the same purpose — version-gated cache invalidation.
Readiness Sentinel — `/api-server/java/.../RepoReadinessSentinel.java`: writes `.supermodel.ready` after extraction. Workers poll (20 attempts, 500ms). Translates to the CLI as: write `{hash}.json.tmp` then atomic rename to `{hash}.json` — concurrent readers see complete data or nothing.
File Delta Watcher — `/apps/vscode_extension/src/extension/services/fileDeltaWatcher.ts`: watched for changes, built partial ZIPs. In the CLI, git hooks replace file watching for the common case. A future `supermodel watch` command could use fsnotify.
Repo Status Lifecycle — `RepoStatus.NEW → uploaded → processing → complete`. On upload, checks for `delta.json` in ZIP for incremental analysis. The CLI's `CacheIndex.Status` field (`ready|refreshing|error`) mirrors this.
Industry research
Turborepo: content-addressable cache keyed by `SHA256(source files + config + deps + env)`. Global hash + per-task hash. Cache stored in `.turbo/cache/`. Background daemon (being deprecated in v3 — deemed not worth the complexity).
Bazel: two-tier Content-Addressable Store — Action Cache (`SHA256(action + inputs)`) maps to output metadata, CAS maps content hashes to blobs. Cache key deliberately excludes outputs.
Git staleness benchmarks: `git rev-parse HEAD` ~1ms, `git status --porcelain -uno` ~100ms, full `git status` ~42s on WebKit. The `--untracked-files=no` flag is critical.
fsnotify (Go): production-grade (used by Hugo, Air), but has macOS kqueue fd limits, no recursive watching, editor temp file noise. Git hooks are simpler for the "always fresh" use case.
Sequencing
- Phase 1 + 2 (fingerprint + typed cache keys) — biggest impact, removes zip-hash-every-time bottleneck
- Phase 3 (index) — enables `supermodel status` to show cache state
- Phase 4 (hooks) — makes the cache self-refreshing
- Phase 5 (pregen) — warmth all caches in one pass
Phases 1-3 are one PR. Phase 4 is a follow-up. Phase 5 is independent.