feat(migrate): MongoDB tooling for v1.19.2 -> v2.0.0 (URLs + emails + ACL audit)#1543
feat(migrate): MongoDB tooling for v1.19.2 -> v2.0.0 (URLs + emails + ACL audit)#1543re-vlad wants to merge 14 commits into
Conversation
✅ Deploy Preview for endearing-brigadeiros-63f9d0 canceled.
|
|
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1543 +/- ##
=======================================
Coverage 90.65% 90.65%
=======================================
Files 69 69
Lines 5690 5690
Branches 985 985
=======================================
Hits 5158 5158
Misses 514 514
Partials 18 18 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…it where missing.
51384cc to
fb1782a
Compare
- Rename URL migration entrypoints to migrate-urls and backup-urls - Rename analyzer module to lib/analyze-urls - Make createBackup accept file prefix and data payload - Update URL migration guide and examples to new names
|
|
||
| **Goal:** every `repos.url` that v2 will match must include the `.git` suffix where it is missing. | ||
|
|
||
| **Why:** v2 resolves repos by **exact** `url`; v1 often relied on `name`. Clients typically send URLs with `.git`; rows without `.git` stop matching. |
There was a problem hiding this comment.
If the "client" you're referring to is the UI, this is inaccurate: the UI enforces the .git ending when creating new repositories. I'd edit this for accuracy/clarity:
| **Why:** v2 resolves repos by **exact** `url`; v1 often relied on `name`. Clients typically send URLs with `.git`; rows without `.git` stop matching. | |
| **Why:** v2 resolves repos by **exact** `url`; v1 often relied on `name`. Rows without `.git` no longer match when calling `getRepoByUrl`, which makes processors such as `checkRepoInAuthorisedList` behave unexpectedly. |
There was a problem hiding this comment.
Good point, this is definately my miswording - by "clients" I meant incoming git HTTP traffic (which includes .git in the path), not the admin UI, which already enforces .git on create. I've updated the 'Why..' section to describe the actual failure mode: exact getRepoByUrl lookup vs legacy rows without .git, and the impact on processors like checkRepoInAuthorisedList
| counts, | ||
| duplicateGroups, | ||
| issues, | ||
| blockingIssueCount: counts.missing + counts.blank + counts.duplicate, |
There was a problem hiding this comment.
Shouldn't we be counting emails with invalid formatting counts.invalidFormat here as well?
There was a problem hiding this comment.
Good catch: invalid-format is already in issues and counts.invalidFormat, but it was omitted from blockingIssueCount. That meant dry-run could exit 0 while the DB still had malformed emails, which is inconsistent with CSV validation. Now thecounts.invalidFormat included into the blockingIssueCount.
| expect(byId.get('u1').status).toBe('missing'); | ||
| expect(byId.get('u2').status).toBe('blank'); | ||
| expect(byId.get('u3').status).toBe('invalid-format'); | ||
| expect(report.blockingIssueCount).toBe(2); // missing + blank |
There was a problem hiding this comment.
Same as the previous comment: any specific reason why emails with invalid formatting are not blocking? The script would just proceed with the update even if invalid emails were present.
There was a problem hiding this comment.
Sure, changed: missing + blank + invalid-format
|
|
||
| for (const repo of repos) { | ||
| const currentUrl = (repo.url || '').trim(); | ||
| const needsUpdate = !currentUrl.endsWith('.git'); |
There was a problem hiding this comment.
Wouldn't this create invalid URLs if the url field wasn't formatted properly before? Some examples:
url: ""would turn intourl: ".git"url: "https://github.com/finos/git-proxy/would turn intourl: "https://github.com/finos/git-proxy/.git
At least we should report these, so admins can fix them manually.
There was a problem hiding this comment.
Agreed, thank you @jescalada, the initial implementation could produce invalid values (e.g. "" → ".git" or trailing / → "/.git"). This PR now normalizes only safe cases (trim + strip trailing /) and reports blank/non-http(s) URLs as issues (with url-issues-*.csv) for manual admin fixes; it also exits non‑zero while issues remain so they can’t be missed.
|
|
||
| for (const change of report.changes) { | ||
| try { | ||
| const success = await updateRepoUrl( |
There was a problem hiding this comment.
Just a heads up: updateRepoUrl is creating a new MongoClient for each repo to update. We probably want to have a single client or do bulk writes instead.
There was a problem hiding this comment.
Agreed: now we reuse a single Mongo сlient when 'apply' to avoid reconnect per repo
| } | ||
|
|
||
| async function countReposWithoutGit(mongoUri, dbName) { | ||
| const client = new MongoClient(mongoUri); |
There was a problem hiding this comment.
This function might also benefit from having a shared MongoClient
There was a problem hiding this comment.
Absolutely, now the function accept already prepared collection and not opened the connection internally
| const reposCollection = db.collection('repos'); | ||
|
|
||
| return await reposCollection.countDocuments({ | ||
| url: { $not: /\.git$/ }, |
There was a problem hiding this comment.
Wouldn't this match empty strings/null URLs? These would get counted but not actually migrated since we're currently ignoring empty URLs in buildUrlNormalizationReport.
There was a problem hiding this comment.
The criteria: { url: {
That's intentional. remaining is a post-apply DB check (how many repos.url values still don't end with .git), not a count of failed migration updates. Blank/unsupported URLs are skipped by normalizeRepoUrl and go to 'issues', not report.changes, but they still belong in this total because they need manual fixes.
We already report them as URL 'issues' and fail the run when urlIssues > 0. I updated the verify warning to note that remaining includes blank/invalid URLs listed under URL issues.
- avoid generating invalid repo URLs during v1→v2 migration - normalizing safe cases (trim + strip trailing slashes) - reporting blank/non-http(s) values as manual-fix issues.
- refactor to reuse one MongoClient per run - avoid repeated connect/close cycles
Andreybest
left a comment
There was a problem hiding this comment.
Looks ok from my side, please check @jescalada comments.
Only 2 things from my side to make it perfect:
- TypeScript in migration scripts would be appreciated.
- Tests that are run on virtual mongo would show more reliability by using mongodb-memory-server or it's derivatives.
Agreed. I'll track a follow-up to convert scripts/migrate to TS for consistency with the rest of the repo. |
+ reword migration docs regarding 'clients' -> 'incoming git HTTP traffic'
Description
Upgrade to Git-Proxy v2.0.0 requires database preparation for breaking behavior around exact repo URL lookup (need .git on repos.url) and v2 push authorization based on committer email + User.username ACL entries.
This PR adds operator-run migration scripts with dry-run by default, explicit apply, structured reports, and separate backups so upgrades are safer and repeatable (aligned with issue #1535).
Phase 1 — repo URL normalization: append .git to repos.url where missing (idempotent) via migrate-urls.js + backup-urls.js.
Phase 2 — users & ACL normalization:
Related Issue
Resolves #1535
Checklist
General
Documentation
Configuration
config.schema.json) was modified:npm run generate-config-types)npm run gen-schema-doc)Tests
npm test)npm run lintandnpm run format:check)npm run check-types)Note: Migration scripts are one-off operator tooling, not part of the main app bundle. Their unit tests live under scripts/migrate/test/ and are not included in the default npm test run (which targets ./test). Run them separately with: