Skip to content

feat: local graph caching and pregeneration #33

@jonathanpopham

Description

@jonathanpopham

Problem

Every command that hits the API (dead-code, blast-radius, audit, focus, find) re-uploads the repo zip and waits 1-3 minutes for analysis. The `analyze` command caches the graph locally by zip hash, but the dedicated analysis endpoints (dead-code, impact) don't benefit — they always re-upload. Users should always have an up-to-date graph without thinking about it.

Current state

`internal/cache/cache.go` — JSON files in `~/.supermodel/cache/{hash}.json`. Each entry is `{graph, cached_at}`. Functions: `Get(hash)`, `Put(hash, graph)`, `HashFile(path)`, `Evict(hash)`. Simple, no index, no staleness detection.

`internal/analyze/handler.go` — `GetGraph()` creates zip, hashes it, checks cache, calls API if miss. Every invocation creates and hashes the zip even if nothing changed.

`internal/archive/archive.go` — shared zip creation. `git archive HEAD` for git repos (fast, respects .gitignore), `WalkZip` fallback.

How commands use caching today:

  • `analyze`: checks cache by zip hash ✅
  • `dead-code`: creates zip, hashes, calls API every time ❌
  • `blast-radius`: creates zip, hashes, calls API every time ❌
  • `audit`: calls both graph + impact, no caching ❌
  • `focus`, `find`: use cached graph from `analyze` ✅ (but only if `analyze` ran first)

Implementation plan

Phase 1: Git-based staleness detection (fast cache lookups)

New file: `internal/cache/fingerprint.go`

Stop hashing the zip on every invocation. Instead, use git to detect changes in <5ms:

```go
// RepoFingerprint returns a cache key for the current repo state.
// For clean repos (common case): just the commit SHA (~1ms).
// For dirty repos: commit SHA + tree hash (~100ms).
func RepoFingerprint(dir string) (string, error) {
commitSHA, err := gitOutput(dir, "rev-parse", "HEAD")
if err != nil {
return "", fmt.Errorf("not a git repo: %w", err)
}

dirty, err := gitOutput(dir, "status", "--porcelain", "--untracked-files=no")
if err != nil {
    return commitSHA, nil // can't check dirty, use commit SHA
}

if dirty == "" {
    return commitSHA, nil // clean: commit SHA is sufficient
}

// Dirty: include a hash of the working tree changes
treeHash, err := gitOutput(dir, "stash", "create")
if err != nil {
    return commitSHA + ":dirty", nil // fallback
}
return commitSHA + ":" + treeHash, nil

}

func gitOutput(dir string, args ...string) (string, error) {
cmd := exec.Command("git", append([]string{"-C", dir}, args...)...)
out, err := cmd.Output()
return strings.TrimSpace(string(out)), err
}
```

Performance: `git rev-parse HEAD` is ~1ms. `git status --porcelain -uno` is ~100ms. The current approach (create zip + SHA256 the whole thing) is 1-10 seconds depending on repo size.

Phase 2: Cache key that includes analysis type

Extend the cache to store different analysis results (graph, dead-code, impact) for the same repo state:

```go
// CacheKey encodes everything that affects analysis output.
type CacheKey struct {
RepoFingerprint string // from Phase 1
AnalysisType string // "graph" | "dead-code" | "impact" | "impact:{target}"
CLIVersion string // invalidate on upgrade
}

func (k CacheKey) Hash() string {
h := sha256.New()
fmt.Fprintf(h, "%s\x00%s\x00%s", k.RepoFingerprint, k.AnalysisType, k.CLIVersion)
return hex.EncodeToString(h.Sum(nil))
}
```

Update `cache.go`:

  • `Get` and `Put` accept `CacheKey` instead of raw hash
  • Dead-code results cached under `{fingerprint}:dead-code:{version}`
  • Impact results cached under `{fingerprint}:impact:{target}:{version}`

Phase 3: Cache index for status and fast lookup

New file: `internal/cache/index.go`

```go
// ~/.supermodel/cache/index.json
type CacheIndex struct {
Version int `json:"version"`
Repos map[string]*RepoState `json:"repos"` // key: abs path
}

type RepoState struct {
Fingerprint string `json:"fingerprint"`
CachedAt time.Time `json:"cached_at"`
Analyses map[string]string `json:"analyses"` // type → cache file
Status string `json:"status"` // "ready" | "refreshing" | "error"
}
```

Powers:

  • `supermodel status` shows: "Graph cached, 2 commits behind" or "Graph is current"
  • `supermodel cache list` shows all cached repos with age and staleness
  • `supermodel cache clear` removes everything

Phase 4: Background refresh via git hooks

New file: `internal/cache/hooks.go`

On first `supermodel analyze`, offer to install git hooks:

```go
func InstallHooks(repoDir string) error {
hookDir := filepath.Join(repoDir, ".git", "hooks")
for _, name := range []string{"post-commit", "post-checkout", "post-merge"} {
path := filepath.Join(hookDir, name)
appendToHook(path, "supermodel cache refresh --background 2>/dev/null &\n")
}
return nil
}
```

New subcommand: `supermodel cache refresh`:

```go
// Forks a detached process that re-analyzes and warms the cache.
// Uses a lock file to prevent concurrent refreshes.
func RefreshBackground(cfg *config.Config, dir string) error {
lockPath := filepath.Join(cache.Dir(), "refresh.lock")
if isLocked(lockPath) {
return nil
}
exe, _ := os.Executable()
cmd := exec.Command(exe, "cache", "refresh", "--dir", dir)
cmd.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
return cmd.Start() // detach, don't wait
}
```

Why git hooks over file watching (fsnotify):

  • Zero runtime overhead (no daemon, no file descriptors)
  • Fires on meaningful events (commit, checkout, merge) not noisy file writes
  • Works on NFS/network mounts where inotify/kqueue don't
  • No new dependencies

Phase 5: Pregeneration command

`supermodel analyze --pregen` runs all analysis types in parallel:

```go
func Pregen(ctx context.Context, cfg *config.Config, dir string) error {
zipPath, err := archive.CreateZip(dir)
defer os.Remove(zipPath)

fingerprint, _ := cache.RepoFingerprint(dir)
client := api.New(cfg)

g := errgroup.Group{}
g.Go(func() error {
    // Graph analysis
    result, err := client.Analyze(ctx, zipPath, key("graph", fingerprint))
    if err == nil { cache.Put(graphKey, result) }
    return err
})
g.Go(func() error {
    // Dead code analysis  
    result, err := client.DeadCode(ctx, zipPath, key("dc", fingerprint), "", 0)
    if err == nil { cache.Put(dcKey, result) }
    return err
})
g.Go(func() error {
    // Impact analysis (global)
    result, err := client.Impact(ctx, zipPath, key("impact", fingerprint), "", "")
    if err == nil { cache.Put(impactKey, result) }
    return err
})
return g.Wait()

}
```

File changes summary

File Change
`internal/cache/cache.go` Extend `Get`/`Put` with `CacheKey`, atomic writes
`internal/cache/fingerprint.go` New — `RepoFingerprint()` using git
`internal/cache/index.go` New — `CacheIndex` for status tracking
`internal/cache/hooks.go` New — git hook installation
`cmd/cache.go` New — `supermodel cache [list|refresh|clear]` subcommands
`internal/analyze/handler.go` Use `RepoFingerprint` instead of zip-hash-every-time
`internal/deadcode/handler.go` Check cache before API call
`internal/blastradius/handler.go` Check cache before API call
`cmd/analyze.go` Add `--pregen` flag

Prior art

Archived Supermodel VS Code extension

Delta Manifest System — `/api-server/java/.../DeltaManifestRegistry.java`: `ConcurrentHashMap<String, DeltaManifest>` tracking upserts (path + contentHash + baseHash), deletes, and renames with versioning (runId, parserVersion, pipelineVersion). The CLI's `CacheKey` struct serves the same purpose — version-gated cache invalidation.

Readiness Sentinel — `/api-server/java/.../RepoReadinessSentinel.java`: writes `.supermodel.ready` after extraction. Workers poll (20 attempts, 500ms). Translates to the CLI as: write `{hash}.json.tmp` then atomic rename to `{hash}.json` — concurrent readers see complete data or nothing.

File Delta Watcher — `/apps/vscode_extension/src/extension/services/fileDeltaWatcher.ts`: watched for changes, built partial ZIPs. In the CLI, git hooks replace file watching for the common case. A future `supermodel watch` command could use fsnotify.

Repo Status Lifecycle — `RepoStatus.NEW → uploaded → processing → complete`. On upload, checks for `delta.json` in ZIP for incremental analysis. The CLI's `CacheIndex.Status` field (`ready|refreshing|error`) mirrors this.

Industry research

Turborepo: content-addressable cache keyed by `SHA256(source files + config + deps + env)`. Global hash + per-task hash. Cache stored in `.turbo/cache/`. Background daemon (being deprecated in v3 — deemed not worth the complexity).

Bazel: two-tier Content-Addressable Store — Action Cache (`SHA256(action + inputs)`) maps to output metadata, CAS maps content hashes to blobs. Cache key deliberately excludes outputs.

Git staleness benchmarks: `git rev-parse HEAD` ~1ms, `git status --porcelain -uno` ~100ms, full `git status` ~42s on WebKit. The `--untracked-files=no` flag is critical.

fsnotify (Go): production-grade (used by Hugo, Air), but has macOS kqueue fd limits, no recursive watching, editor temp file noise. Git hooks are simpler for the "always fresh" use case.

Sequencing

  1. Phase 1 + 2 (fingerprint + typed cache keys) — biggest impact, removes zip-hash-every-time bottleneck
  2. Phase 3 (index) — enables `supermodel status` to show cache state
  3. Phase 4 (hooks) — makes the cache self-refreshing
  4. Phase 5 (pregen) — warmth all caches in one pass

Phases 1-3 are one PR. Phase 4 is a follow-up. Phase 5 is independent.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions