Add embed information to url-metadata by fennifith · Pull Request #121 · playfulprogramming/hoof

fennifith · 2026-03-03T02:12:05Z

Fixes #82 - copies the platform-detectors implementation from the frontend to run as part of the url-metadata task to store/upload github gists, posts, and video embed info.

Summary by CodeRabbit

New Features
- Embed extraction expanded: GitHub Gists, X posts, YouTube/Vimeo/Twitch videos, plus unified HTML dispatch and dedicated providers; public fetchers added for gist/post/video/html.
Database
- New tables and fields to store gist files and post embed metadata; richer embed fields persisted (type, ids, iframe src/dimensions, image alt).
Tests
- Added unit tests covering gist and post embed flows.
Chores
- S3 object removal utility added; image processing result type exported; package dependency references consolidated.

coderabbitai · 2026-03-03T02:12:25Z

📝 Walkthrough

Walkthrough

Adds a full embed extraction pipeline (gist, X/Twitter posts, video providers, HTML), DB schema and migration for embed entities, S3 upload/remove support, worker fetchers and tests, API response enrichment, GitHub helper, and migrates TypeBox imports/dependencies from @sinclair/typebox to typebox.

Changes

Cohort / File(s)	Summary
Manifests / Catalog `pnpm-workspace.yaml`, `apps/api/package.json`, `apps/worker/package.json`, `packages/bullmq/package.json`, `packages/common/package.json`, `packages/post-images/package.json`	Replaced `@sinclair/typebox` with `typebox` in manifests/catalog entries and dependency lists.
Import path updates `apps/api/src/routes/tasks/post-images.ts`, `packages/bullmq/src/tasks/post-image.ts`, `packages/bullmq/src/tasks/url-metadata.ts`, `packages/post-images/src/fetch-post-data.ts`, `apps/worker/src/tasks/sync-//`	Updated TypeBox import sources from `@sinclair/typebox` → `typebox` (including value variants).
API: URL metadata route `apps/api/src/routes/tasks/url-metadata.ts`, `apps/api/package.json`	Extended response schema to include `embed` (gist/post/video), added mapping helpers to assemble enriched responses (images, embeds), and normalized URL comparisons to include query string.
DB schema & migration `packages/db/src/schema/url-metadata.ts`, `packages/db/drizzle/.../migration.sql`, `packages/db/drizzle/.../snapshot.json`	Added tables `url_metadata_gist`, `url_metadata_gist_file`, `url_metadata_post`; extended `url_metadata` with gist/post FKs and fields `embed_src`, `embed_width`, `embed_height`, `embed_type`.
Worker: embed common & types `apps/worker/src/tasks/url-metadata/common.ts`, `apps/worker/src/tasks/url-metadata/utils/processImage.ts`	Added exported `EmbedData` type, exported `BUCKET` (awaited s3.ensureBucket), and exported `ProcessImageResult`.
Worker: gist support `apps/worker/src/tasks/url-metadata/getEmbedDataFromGist.ts`, `.../getEmbedDataFromGist.test.ts`	New gist handler: fetch gist metadata/files, upload file contents to S3 (MD5 keys), upsert DB records, delete stale rows and S3 objects; unit test added.
Worker: post (X) support `apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts`, `.../getEmbedDataFromPost.test.ts`, `apps/worker/src/tasks/url-metadata/data-providers/x.ts`	New post fetcher: fetch FxTwitter post, process avatar/media images, upsert `url_metadata_post`, return embed id; unit test added.
Worker: video providers & dispatcher `apps/worker/src/tasks/url-metadata/getEmbedDataFromVideo.ts`, `.../data-providers/youtube.ts`, `.../vimeo.ts`, `.../twitch.ts`, `.../video.ts`	Added YouTube/Vimeo/Twitch oEmbed providers and dispatcher `getVideoDataFromUrl`; video fetcher parses iframe, uploads thumbnail, returns embed data; exported host lists.
Worker: HTML embed fetcher `apps/worker/src/tasks/url-metadata/getEmbedDataFromHtml.ts`	New HTML embed fetcher: fetch page, extract title, icons and OG banners, process images, return EmbedData with error handling.
Worker: FX embed typings `apps/worker/src/tasks/url-metadata/data-providers/fx-embed/twitter.ts`, `.../types.ts`	Added extensive TypeScript type definitions for FX/Twitter API responses used by post fetcher.
Worker: processor refactor `apps/worker/src/tasks/url-metadata/processor.ts`	Reworked processor to dispatch to embed fetchers (gist/post/video/html), require jobId, persist embedType/IDs/embedSrc/dimensions, and centralize embed error handling.
S3 utils & test setup `packages/s3/src/utils.ts`, `apps/worker/test-utils/setup.ts`	Added `remove(bucket,key)` wrapper using DeleteObjectCommand; expanded test mocks to include `s3.remove`, stateful `db` urlMetadata* mocks and `github-api.getGistById`.
GitHub API helper `packages/github-api/src/getGistById.ts`, `packages/github-api/src/index.ts`	Added `getGistById(params)` and re-exported it from package index.
Misc worker additions `apps/worker/src/tasks/url-metadata/*` (many new modules & tests)	Multiple new modules, helpers, and tests supporting embed extraction, mapping, and persistence (gist, post, video, html).

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant API as API
    participant Worker as Worker
    participant GitHub as GitHub
    participant DB as DB
    participant S3 as S3

    Client->>API: enqueue /tasks/url-metadata (gist URL)
    API->>Worker: job received
    Worker->>GitHub: GET /gists/{id}
    GitHub-->>Worker: gist metadata + files
    Worker->>S3: upload file contents (MD5 keys)
    S3-->>Worker: content keys
    Worker->>DB: upsert gist + file rows, delete stale rows
    Worker->>S3: remove deleted files
    Worker->>DB: update url_metadata (gistId, embedType)
    Worker-->>API: job complete
    API->>DB: query enriched url_metadata
    API-->>Client: enriched response (metadata + embed)

sequenceDiagram
    participant Client as Client
    participant API as API
    participant Worker as Worker
    participant FX as FXTwitter
    participant Images as ImageProcessor
    participant S3 as S3
    participant DB as DB

    Client->>API: enqueue /tasks/url-metadata (tweet URL)
    API->>Worker: job received
    Worker->>FX: GET /{user}/status/{id}
    FX-->>Worker: tweet + media URLs
    Worker->>Images: process avatar -> upload
    Images->>S3: upload avatar
    S3-->>Worker: avatar key
    Worker->>Images: process media -> upload
    Images->>S3: upload media
    S3-->>Worker: media key
    Worker->>DB: upsert url_metadata_post and url_metadata (embedType=post)
    Worker-->>API: job complete
    API->>DB: query enriched url_metadata
    API-->>Client: enriched response (author, content, images)

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

update drizzle?? #111: touches DB schema and url_metadata migrations — strongly related to the newly added url_metadata_gist / url_metadata_post tables and migration.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 13.04% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add embed information to url-metadata' accurately summarizes the main objective of the PR, which is to add OEmbed and embed data support to the url-metadata task.
Linked Issues check	✅ Passed	The PR fulfills all three primary requirements from issue `#82`: (1) implements platform detection for gists, posts, and videos; (2) adds database columns for embed information (embedSrc, embedWidth, embedHeight, embedType, gistId, postId); (3) uploads image fields to S3 and stores resulting keys.
Out of Scope Changes check	✅ Passed	All code changes are directly related to supporting embed metadata processing. The TypeBox package migration and GitHub gist API additions are required dependencies for the embed feature implementation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch oembed-url-metadata

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 15

🧹 Nitpick comments (4)

pnpm-workspace.yaml (1)
6-6: TypeBox migration is complete; drift check would be a good preventive measure.

All source files are using the new typebox package correctly, and no legacy @sinclair/typebox references remain. A CI check to prevent reintroduction of the legacy package name is recommended as a future safeguard, but not required now.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pnpm-workspace.yaml` at line 6, Add a preventive CI drift check to ensure the
legacy package name '@sinclair/typebox' is never reintroduced: create a CI job
(or an npm script referenced by CI) named something like "check-typebox" that
searches the repository for the string '@sinclair/typebox' (excluding
node_modules/build artifacts) and fails the job if any matches are found; this
should be triggered on PRs and main branch updates and references the package
names "typebox" and "@sinclair/typebox" so reviewers can quickly locate the
check and its intent.
apps/worker/src/tasks/url-metadata/processor.ts (1)
10-10: Consider adding an "html" embed type for completeness.

The UrlMetadataEmbedType defines "gist" | "video" | "post", but the HTML fallback path leaves embedType as undefined. For clarity and future filtering/querying, you might want to include "html" in the union:
♻️ Suggested addition
-type UrlMetadataEmbedType = "gist" | "video" | "post";
+type UrlMetadataEmbedType = "gist" | "video" | "post" | "html";
And then set it explicitly in the else branch (line 28-29):
 } else {
+	embedType = "html";
 	embedPromise = getEmbedDataFromHtml(jobId, inputUrl, signal);
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/tasks/url-metadata/processor.ts` at line 10, The
UrlMetadataEmbedType union should include an "html" variant and the fallback
assignment should set embedType = "html" instead of leaving it undefined; update
the type alias UrlMetadataEmbedType to include "html" and modify the fallback
branch in the URL metadata parsing logic (the code that currently leaves
embedType undefined) to explicitly assign "html" so downstream filtering/queries
can rely on a concrete embedType.
apps/worker/src/tasks/url-metadata/data-providers/vimeo.ts (1)
3-25: Consider relaxing literal types for external API fields.

The provider_url: "https://vimeo.com/" is typed as a literal string. If Vimeo's API ever returns a slightly different value (e.g., without trailing slash), it could cause type mismatches at runtime with stricter type checking. Consider using string for fields that aren't critical to the application logic.
♻️ Suggested type relaxation
 export interface VimeoOEmbedResponse {
 	type: "video";
 	version: "1.0";
 	provider_name: "Vimeo";
-	provider_url: "https://vimeo.com/";
+	provider_url: string;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/tasks/url-metadata/data-providers/vimeo.ts` around lines 3 -
25, The VimeoOEmbedResponse interface uses literal string types for external API
fields (e.g., provider_url: "https://vimeo.com/") which is brittle; update the
interface by changing those literal types to more permissive types (for example
provider_url: string and similarly relax provider_name, version, and type if
they are not critical) so the shape accepts minor API variations while keeping
strongly typed fields like title, width, height, duration, thumbnail_url,
video_id, and uri unchanged.
apps/worker/src/tasks/url-metadata/common.ts (1)
17-17: Consider lazy initialization for the S3 bucket.

The top-level await runs at module import time, which means:

Any import failure from s3.ensureBucket will prevent the module from loading.

The bucket is created even if the importing code never uses it.

Test mocking becomes trickier since this runs before test setup.

For a worker context this may be acceptable, but lazy initialization would be more resilient:
♻️ Suggested lazy initialization pattern
-export const BUCKET = await s3.ensureBucket(env.S3_BUCKET);
+let _bucket: Awaited<ReturnType<typeof s3.ensureBucket>> | undefined;
+
+export async function getBucket() {
+  if (!_bucket) {
+    _bucket = await s3.ensureBucket(env.S3_BUCKET);
+  }
+  return _bucket;
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/tasks/url-metadata/common.ts` at line 17, Replace the
top-level awaited export with a lazy initializer: remove the top-level "export
const BUCKET = await s3.ensureBucket(env.S3_BUCKET)" and add an async getter
(e.g., getBucket or ensureBucketInitialized) that caches the bucket after
calling s3.ensureBucket(env.S3_BUCKET) on first use; update callers to call this
getter instead of referencing BUCKET directly so the bucket creation runs lazily
and is easier to mock/test and to avoid module-import side effects.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/api/src/routes/tasks/url-metadata.ts`:
- Around line 107-109: Replace the use of logical OR with nullish coalescing for
optional numeric fields so valid 0 values are preserved: change occurrences of
"width: width || undefined" and "height: height || undefined" (and the analogous
occurrences at the later lines) to use "width: width ?? undefined" and "height:
height ?? undefined"; keep altText as "altText || undefined" only if empty
string should be treated as missing, otherwise change altText to "altText ??
undefined". Update the assignments where these symbols (width, height, altText)
are used in the URL metadata construction and at the later lines mentioned to
use ?? accordingly.
- Around line 113-115: The mapGist function currently throws when the side-table
row is missing; instead, change the behavior in mapGist (and the analogous
functions that throw for missing side rows—e.g., the mapPost path and the other
URL metadata mappers around the later throw sites) to return undefined when
db.query.urlMetadataGist.findFirst(...) (or the equivalent db query) returns no
row, and call the existing requeue/refresh helper (use the module's
enqueue/refresh function used elsewhere for metadata refresh) to schedule an
async refresh rather than throwing; ensure callers of mapGist handle undefined
as a cache-hit fallback so the request path does not 500.
- Line 237: The code currently force-casts result.embedType using "embedType:
result.embedType as 'post' | 'gist' | 'video' | undefined" which is unsafe;
replace this with a guarded mapper: add a small type‑guard (e.g.,
isValidEmbedType(value): value is 'post'|'gist'|'video') and set embedType to
isValidEmbedType(result.embedType) ? result.embedType : undefined (or handle
invalid values explicitly), updating the assignment site where embedType is set
from result.embedType so the response contract is enforced without unsafe
casting.

In `@apps/worker/src/tasks/url-metadata/data-providers/twitch.ts`:
- Line 42: twitchHosts currently lists ["clips.twitch.tv", "twitch.tv"] but
misses "www.twitch.tv", causing valid Twitch URLs to be ignored; update the
twitchHosts export (the twitchHosts constant in
apps/worker/src/tasks/url-metadata/data-providers/twitch.ts) to include
"www.twitch.tv" in the array alongside the existing entries so URL
matching/embedding recognizes both forms.
- Around line 29-38: The current fallback uses the page URL (embedUrl = url)
which produces invalid Twitch iframe src; instead detect non-clip Twitch URLs
and build a proper player endpoint (use "https://player.twitch.tv/" with either
?channel=<channelName> for channel pages or ?video=v<videoId> for video pages /
collections), preserve/add the parent query param on that player URL (use
embedUrl.searchParams.set("parent", parent)), and update the returned html to
use width="400" (minimum) and height="300"; change references to embedUrl and
url where the iframe src is constructed and handle the TODO for video/collection
paths by mapping path segments to the correct channel or video query parameter
before returning html.

In `@apps/worker/src/tasks/url-metadata/data-providers/video.ts`:
- Line 37: Exported videoHosts currently combines youtubeHosts and vimeoHosts
but omits twitchHosts, preventing Twitch URLs from being routed into the video
embed path despite getVideoDataFromUrl supporting Twitch; update the exported
videoHosts (symbol: videoHosts) to include twitchHosts alongside youtubeHosts
and vimeoHosts so Twitch URLs are recognized, and verify getVideoDataFromUrl
still handles Twitch-specific logic (symbols: getVideoDataFromUrl, twitchHosts).

In `@apps/worker/src/tasks/url-metadata/data-providers/youtube.ts`:
- Around line 27-47: The code builds youtubeUrl using the variable videoId
without ensuring it was set; update the parsing logic around videoId in
apps/worker/src/tasks/url-metadata/data-providers/youtube.ts (the videoId
variable and the youtubeUrl construction) to guard against missing/empty IDs:
after the if/else blocks that set videoId, check that videoId is a non-empty
string and if not, return null (or otherwise handle the invalid case) instead of
proceeding to construct `https://www.youtube.com/watch?v=${videoId}`.

In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromGist.test.ts`:
- Around line 33-35: The test uses a simplified mock chain
((db.delete(urlMetadataGistFile).where(undefined).returning as Mock)) that hides
the real where clause used by the implementation (and(eq(...),
not(inArray(...))); update the mock in getEmbedDataFromGist.test.ts to reflect
or assert the actual where predicate: either mock
db.delete(urlMetadataGistFile).where with a matcher that checks the passed
predicate (e.g., capture the where arg and verify it composes and(eq(...),
not(inArray(...)))) or replace the .where(undefined) with a more explicit
mockImplementation that validates the incoming where function/criteria before
returning Promise.resolve([]); alternatively, if simplifying is intentional add
a brief comment explaining why the simplified mock covers the case and cannot
hide regressions, referencing db.delete and urlMetadataGistFile so reviewers can
locate the change.

In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromGist.ts`:
- Around line 25-33: The function currently logs an invalid gist URL but
continues to call github.getGistById; change the invalid-path branch (when
!githubUsername || !gistId) to return immediately (e.g., return undefined) after
logging to avoid the network call and further processing instead of only setting
error = true; ensure this prevents execution of github.getGistById and
downstream code that expects valid gistId/inputUrl.

In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.test.ts`:
- Around line 97-101: The test incorrectly passes a URL string as the jobId to
getEmbedDataFromPost; update the test to pass a distinct mock job ID (e.g.,
"test-job-id" or similar) as the first argument while keeping the existing URL
(new URL(fakeFxTwitterUrl)) and AbortSignal, so the call mirrors production
usage and avoids conflating jobId with the input URL when generating S3 keys.

In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts`:
- Around line 33-37: The code in getEmbedDataFromPost uses new URL(avatarUrl)
and new URL(image.url) directly (variables avatarUrl and image.url) which can
throw synchronously; wrap the URL construction in a try/catch or pre-validate
the string before calling processImages so malformed URLs are skipped or
converted to undefined instead of throwing, and ensure any skipped images are
filtered out before calling processImages([ ... ]) and before mapping into new
URL for images so the downstream .catch handlers are not bypassed.

In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromVideo.ts`:
- Around line 37-45: In getEmbedDataFromVideo, protect parsing of
videoData.thumbnail_url before calling processImages: the expression new
URL(videoData.thumbnail_url) can throw synchronously and bypass the existing
.catch on the processImages promise; wrap the URL construction/validation in a
try/catch (or guard with a safe URL-parse helper) and only call processImages
when the URL was successfully created, and ensure any thrown error is caught and
logged/handled using the existing jobId/signal flow so banner assignment doesn't
crash the task.

In `@apps/worker/test-utils/setup.ts`:
- Around line 38-41: The mock deleteMockResponse currently sets returning =
vi.fn() which yields undefined when awaited; change returning to be a function
that returns an empty array by default (e.g., make returning an async/mock
function that returns []), so callers that await the .returning(...) can safely
iterate; update the returned object in deleteMockResponse (the where mock that
returns { returning }) accordingly so .where(...).returning(...) resolves to an
empty array by default.

In `@packages/db/src/schema/url-metadata.ts`:
- Line 24: The url_metadata_gist_file schema currently forces language as
notNull (varchar("language", { length: 16 }).notNull()), which fails when GitHub
returns null/omits files[...].language; change the column definition in
packages/db/src/schema/url-metadata.ts for the url_metadata_gist_file entry to
allow nulls (remove .notNull() or use .nullable() depending on the ORM) so
inserts succeed when language is absent, and run migrations/regen types if
applicable.

In `@packages/github-api/src/getGistById.ts`:
- Around line 26-29: The error message in getGistById incorrectly references
getCommits; update the throw in the block that checks if data is undefined or
response.error (in function getGistById) to mention getGistById instead of
getCommits and include the same response.response.status and response.error
details so the runtime message correctly identifies the failing operation.

---

Nitpick comments:
In `@apps/worker/src/tasks/url-metadata/common.ts`:
- Line 17: Replace the top-level awaited export with a lazy initializer: remove
the top-level "export const BUCKET = await s3.ensureBucket(env.S3_BUCKET)" and
add an async getter (e.g., getBucket or ensureBucketInitialized) that caches the
bucket after calling s3.ensureBucket(env.S3_BUCKET) on first use; update callers
to call this getter instead of referencing BUCKET directly so the bucket
creation runs lazily and is easier to mock/test and to avoid module-import side
effects.

In `@apps/worker/src/tasks/url-metadata/data-providers/vimeo.ts`:
- Around line 3-25: The VimeoOEmbedResponse interface uses literal string types
for external API fields (e.g., provider_url: "https://vimeo.com/") which is
brittle; update the interface by changing those literal types to more permissive
types (for example provider_url: string and similarly relax provider_name,
version, and type if they are not critical) so the shape accepts minor API
variations while keeping strongly typed fields like title, width, height,
duration, thumbnail_url, video_id, and uri unchanged.

In `@apps/worker/src/tasks/url-metadata/processor.ts`:
- Line 10: The UrlMetadataEmbedType union should include an "html" variant and
the fallback assignment should set embedType = "html" instead of leaving it
undefined; update the type alias UrlMetadataEmbedType to include "html" and
modify the fallback branch in the URL metadata parsing logic (the code that
currently leaves embedType undefined) to explicitly assign "html" so downstream
filtering/queries can rely on a concrete embedType.

In `@pnpm-workspace.yaml`:
- Line 6: Add a preventive CI drift check to ensure the legacy package name
'@sinclair/typebox' is never reintroduced: create a CI job (or an npm script
referenced by CI) named something like "check-typebox" that searches the
repository for the string '@sinclair/typebox' (excluding node_modules/build
artifacts) and fails the job if any matches are found; this should be triggered
on PRs and main branch updates and references the package names "typebox" and
"@sinclair/typebox" so reviewers can quickly locate the check and its intent.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1f71585 and 389e182.

⛔ Files ignored due to path filters (1)

pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml

📒 Files selected for processing (39)

apps/api/package.json
apps/api/src/routes/tasks/post-images.ts
apps/api/src/routes/tasks/url-metadata.ts
apps/worker/package.json
apps/worker/src/tasks/sync-author/processor.ts
apps/worker/src/tasks/sync-author/types.ts
apps/worker/src/tasks/sync-collection/processor.ts
apps/worker/src/tasks/sync-collection/types.ts
apps/worker/src/tasks/sync-post/processor.ts
apps/worker/src/tasks/sync-post/types.ts
apps/worker/src/tasks/url-metadata/common.ts
apps/worker/src/tasks/url-metadata/data-providers/fx-embed/twitter.ts
apps/worker/src/tasks/url-metadata/data-providers/fx-embed/types.ts
apps/worker/src/tasks/url-metadata/data-providers/twitch.ts
apps/worker/src/tasks/url-metadata/data-providers/video.ts
apps/worker/src/tasks/url-metadata/data-providers/vimeo.ts
apps/worker/src/tasks/url-metadata/data-providers/x.ts
apps/worker/src/tasks/url-metadata/data-providers/youtube.ts
apps/worker/src/tasks/url-metadata/getEmbedDataFromGist.test.ts
apps/worker/src/tasks/url-metadata/getEmbedDataFromGist.ts
apps/worker/src/tasks/url-metadata/getEmbedDataFromHtml.ts
apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.test.ts
apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts
apps/worker/src/tasks/url-metadata/getEmbedDataFromVideo.ts
apps/worker/src/tasks/url-metadata/processor.ts
apps/worker/src/tasks/url-metadata/utils/processImage.ts
apps/worker/test-utils/setup.ts
packages/bullmq/package.json
packages/bullmq/src/tasks/post-image.ts
packages/bullmq/src/tasks/url-metadata.ts
packages/common/package.json
packages/common/src/env.ts
packages/db/src/schema/url-metadata.ts
packages/github-api/src/getGistById.ts
packages/github-api/src/index.ts
packages/post-images/package.json
packages/post-images/src/fetch-post-data.ts
packages/s3/src/utils.ts
pnpm-workspace.yaml

coderabbitai · 2026-03-03T02:18:41Z

apps/api/src/routes/tasks/url-metadata.ts

 		width: width || undefined,
 		height: height || undefined,
+		altText: altText || undefined,


⚠️ Potential issue | 🟡 Minor

Use nullish coalescing for optional numeric fields.

Line [107], Line [108], Line [163], and Line [164] use ||, which drops valid 0 values. Use ?? so only null/undefined are omitted.

Suggested fix

- width: width || undefined, - height: height || undefined, + width: width ?? undefined, + height: height ?? undefined, @@ - width: result.embedWidth || undefined, - height: result.embedHeight || undefined, + width: result.embedWidth ?? undefined, + height: result.embedHeight ?? undefined,

Also applies to: 163-164

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/api/src/routes/tasks/url-metadata.ts` around lines 107 - 109, Replace the use of logical OR with nullish coalescing for optional numeric fields so valid 0 values are preserved: change occurrences of "width: width || undefined" and "height: height || undefined" (and the analogous occurrences at the later lines) to use "width: width ?? undefined" and "height: height ?? undefined"; keep altText as "altText || undefined" only if empty string should be treated as missing, otherwise change altText to "altText ?? undefined". Update the assignments where these symbols (width, height, altText) are used in the URL metadata construction and at the later lines mentioned to use ?? accordingly.

apps/api/src/routes/tasks/url-metadata.ts

coderabbitai · 2026-03-03T02:18:41Z

apps/worker/src/tasks/url-metadata/data-providers/twitch.ts

+	if (!embedUrl) {
+		embedUrl = url;
+	}
+
+	// Set the "parent" property for embedding - https://dev.twitch.tv/docs/embed/everything/#usage
+	embedUrl.searchParams.set("parent", parent);
+
+	return {
+		html: `<iframe src="${embedUrl.toString()}" height="300"></iframe>`,
+		title: "Twitch Embed",


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n apps/worker/src/tasks/url-metadata/data-providers/twitch.ts

Repository: playfulprogramming/hoof

Length of output: 1678

🌐 Web query:

Twitch embed iframe src requirements player.twitch.tv channel video

💡 Result:

For a non-interactive Twitch player iframe, the src must point to https://player.twitch.tv/ and include one content identifier plus at least one parent domain:

src formats

Live channel
<iframe src="https://player.twitch.tv/?channel=<channel>&parent=<yourdomain.com>"></iframe>

VOD / past broadcast
<iframe src="https://player.twitch.tv/?video=<videoID>&parent=<yourdomain.com>"></iframe>

Key requirements

HTTPS/SSL required for domains using Twitch embeds. [1]

parent is required: add the full domain (include subdomain like www if used). If the embed can appear on multiple domains, include multiple parent= entries. Missing/mismatched parent triggers a playback/config error. [1][2]
src="https://player.twitch.tv/?channel=twitchpresents&parent=example.com&parent=www.example.com"

Minimum size (per Twitch embed requirements): width ≥ 400, height ≥ 300. [3]

Sources: Twitch embed requirements and examples from Twitch Developers docs and Twitch Developer Forums. [1][2][3]

Non-clip Twitch URLs produce invalid embed endpoints.

The code falls back to embedding the page URL directly (line 30: embedUrl = url). This results in iframe sources like https://twitch.tv/somechannel?parent=..., which Twitch does not recognize as valid embed endpoints. Twitch requires embeds to use https://player.twitch.tv/ with channel or video query parameters.

Additionally, the iframe has height="300" but no width attribute. Twitch embeds require a minimum width of 400 pixels.

The TODO comment on line 25 explicitly flags this limitation for video and collection URLs, which also need proper handling.

💡 Suggested fix

if (!embedUrl) { const [first, second] = url.pathname.split("/").filter(Boolean); embedUrl = new URL("https://player.twitch.tv/"); if (first === "videos" && second) { embedUrl.searchParams.set("video", `v${second}`); } else if (first) { embedUrl.searchParams.set("channel", first); } else { return { html: "", title: "Twitch Embed" }; } }

Also add width to the iframe: width="400" minimum.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/worker/src/tasks/url-metadata/data-providers/twitch.ts` around lines 29 - 38, The current fallback uses the page URL (embedUrl = url) which produces invalid Twitch iframe src; instead detect non-clip Twitch URLs and build a proper player endpoint (use "https://player.twitch.tv/" with either ?channel=<channelName> for channel pages or ?video=v<videoId> for video pages / collections), preserve/add the parent query param on that player URL (use embedUrl.searchParams.set("parent", parent)), and update the returned html to use width="400" (minimum) and height="300"; change references to embedUrl and url where the iframe src is constructed and handle the TODO for video/collection paths by mapping path segments to the correct channel or video query parameter before returning html.

coderabbitai · 2026-03-03T02:18:41Z

apps/worker/src/tasks/url-metadata/data-providers/twitch.ts

+	};
+}
+
+export const twitchHosts = ["clips.twitch.tv", "twitch.tv"];


⚠️ Potential issue | 🟡 Minor

Add www.twitch.tv to supported hosts.

Common Twitch URLs use www.twitch.tv; excluding it can skip valid embeds.

🛠️ Proposed fix

-export const twitchHosts = ["clips.twitch.tv", "twitch.tv"]; +export const twitchHosts = ["clips.twitch.tv", "twitch.tv", "www.twitch.tv"];

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

export const twitchHosts = ["clips.twitch.tv", "twitch.tv"];

export const twitchHosts = ["clips.twitch.tv", "twitch.tv", "www.twitch.tv"];

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/worker/src/tasks/url-metadata/data-providers/twitch.ts` at line 42, twitchHosts currently lists ["clips.twitch.tv", "twitch.tv"] but misses "www.twitch.tv", causing valid Twitch URLs to be ignored; update the twitchHosts export (the twitchHosts constant in apps/worker/src/tasks/url-metadata/data-providers/twitch.ts) to include "www.twitch.tv" in the array alongside the existing entries so URL matching/embedding recognizes both forms.

coderabbitai · 2026-03-03T02:18:41Z

apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts

+	const avatarUrl = post.author.avatar_url;
+	let avatarResult: ProcessImageResult | undefined = undefined;
+	if (avatarUrl) {
+		avatarResult = await processImages(
+			[new URL(avatarUrl)],


⚠️ Potential issue | 🟠 Major

Guard provider URL parsing before image processing.

new URL(avatarUrl) (Line 37) and new URL(image.url) (Line 54) can throw before the downstream .catch(...) blocks run. A malformed upstream value can abort the whole task.

💡 Proposed fix

+ const safeParseUrl = (value: string, label: string): URL | undefined => { + try { + return new URL(value); + } catch (e) { + console.error(`Invalid ${label} URL for '${inputUrl}'`, e); + error = true; + return undefined; + } + }; + const avatarUrl = post.author.avatar_url; let avatarResult: ProcessImageResult | undefined = undefined; if (avatarUrl) { + const parsedAvatarUrl = safeParseUrl(avatarUrl, "avatar"); + if (!parsedAvatarUrl) { + avatarResult = undefined; + } else { avatarResult = await processImages( - [new URL(avatarUrl)], + [parsedAvatarUrl], 72, BUCKET, "remote-posts", jobId, signal, ).catch((e) => { console.error(`Error processing post image for '${inputUrl}'`, e); error = true; return undefined; }); + } } ... if (image) { + const parsedImageUrl = safeParseUrl(image.url, "post image"); + if (!parsedImageUrl) { + imageResult = undefined; + } else { imageResult = await processImages( - [new URL(image.url)], + [parsedImageUrl], image.width, BUCKET, "remote-posts", jobId, signal, ).catch((e) => { console.error(`Error processing post image for '${inputUrl}'`, e); error = true; return undefined; }); + } }

Also applies to: 53-55

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts` around lines 33 - 37, The code in getEmbedDataFromPost uses new URL(avatarUrl) and new URL(image.url) directly (variables avatarUrl and image.url) which can throw synchronously; wrap the URL construction in a try/catch or pre-validate the string before calling processImages so malformed URLs are skipped or converted to undefined instead of throwing, and ensure any skipped images are filtered out before calling processImages([ ... ]) and before mapping into new URL for images so the downstream .catch handlers are not bypassed.

coderabbitai · 2026-03-03T02:18:42Z

apps/worker/src/tasks/url-metadata/getEmbedDataFromVideo.ts

+	const banner = videoData?.thumbnail_url
+		? await processImages(
+				[new URL(videoData.thumbnail_url)],
+				expectedThumbnailWidth,
+				BUCKET,
+				"remote-video-thumbnail",
+				jobId,
+				signal,
+			).catch((e) => {


⚠️ Potential issue | 🟠 Major

Protect thumbnail URL parsing before calling processImages.

new URL(videoData.thumbnail_url) (Line 39) can throw before the current .catch(...) handler executes. This can fail the task unexpectedly on malformed provider data.

💡 Proposed fix

- const banner = videoData?.thumbnail_url - ? await processImages( - [new URL(videoData.thumbnail_url)], - expectedThumbnailWidth, - BUCKET, - "remote-video-thumbnail", - jobId, - signal, - ).catch((e) => { - console.error(`Unable upload thumbnail image for ${inputUrl}`, e); - error = true; - return undefined; - }) - : undefined; + let banner = undefined; + if (videoData?.thumbnail_url) { + try { + const thumbnailUrl = new URL(videoData.thumbnail_url); + banner = await processImages( + [thumbnailUrl], + expectedThumbnailWidth, + BUCKET, + "remote-video-thumbnail", + jobId, + signal, + ).catch((e) => { + console.error(`Unable upload thumbnail image for ${inputUrl}`, e); + error = true; + return undefined; + }); + } catch (e) { + console.error(`Invalid thumbnail URL for ${inputUrl}`, e); + error = true; + } + }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromVideo.ts` around lines 37 - 45, In getEmbedDataFromVideo, protect parsing of videoData.thumbnail_url before calling processImages: the expression new URL(videoData.thumbnail_url) can throw synchronously and bypass the existing .catch on the processImages promise; wrap the URL construction/validation in a try/catch (or guard with a safe URL-parse helper) and only call processImages when the URL was successfully created, and ensure any thrown error is caught and logged/handled using the existing jobId/signal flow so banner assignment doesn't crash the task.

coderabbitai · 2026-03-03T02:18:42Z

apps/worker/test-utils/setup.ts

+	const deleteMockResponse = () => {
+		const returning = vi.fn();
+		return { where: vi.fn(() => ({ returning })) };
+	};


⚠️ Potential issue | 🟠 Major

Make mocked .returning(...) default to an array.

Current default vi.fn() returns undefined. Callers that iterate over await ...returning(...) can crash in tests.

💡 Proposed fix

const deleteMockResponse = () => { - const returning = vi.fn(); + const returning = vi.fn(async () => []); return { where: vi.fn(() => ({ returning })) }; };

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const deleteMockResponse = () => {

const returning = vi.fn();

return { where: vi.fn(() => ({ returning })) };

};

const deleteMockResponse = () => {

const returning = vi.fn(async () => []);

return { where: vi.fn(() => ({ returning })) };

};

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/worker/test-utils/setup.ts` around lines 38 - 41, The mock deleteMockResponse currently sets returning = vi.fn() which yields undefined when awaited; change returning to be a function that returns an empty array by default (e.g., make returning an async/mock function that returns []), so callers that await the .returning(...) can safely iterate; update the returned object in deleteMockResponse (the where mock that returns { returning }) accordingly so .where(...).returning(...) resolves to an empty array by default.

coderabbitai · 2026-03-03T02:18:42Z

packages/db/src/schema/url-metadata.ts

+			.references(() => urlMetadataGist.gistId, { onDelete: "cascade" }),
+		filename: varchar("filename", { length: 256 }).notNull(),
+		contentKey: varchar("content_key", { length: 256 }).notNull(),
+		language: varchar("language", { length: 16 }).notNull(),


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

GitHub REST API GET gists gist_id files language field null omitted

💡 Result:

In the REST API “Get a gist” response, each entry under files may include a language property (the docs show it in examples), but it’s not something you can rely on always being a non-empty string. [1]

In practice, files[...].language is best treated as optional/nullable:

Many client SDKs model it as an optional field (e.g., *string with omitempty), meaning it can be missing or null in JSON. [2]

GitHub determines language via Linguist (used to detect languages and ignore/handle binary content), so if Linguist can’t confidently classify a file (unknown/unsupported extension, binary-ish content, etc.), the API may return null or omit it. [3]

If you need a stable value, fall back to your own detection (e.g., by filename/extension or running Linguist yourself) when language is null/absent. [3]

Sources: GitHub REST Gists docs [1], go-github GistFile model [2], GitHub Linguist README [3].

🏁 Script executed:

# Find and inspect the url-metadata.ts schema file find . -name "url-metadata.ts" -type f

Repository: playfulprogramming/hoof

Length of output: 195

🏁 Script executed:

# Inspect the url-metadata.ts schema file to see the actual implementation cat -n ./packages/db/src/schema/url-metadata.ts | head -50

Repository: playfulprogramming/hoof

Length of output: 2244

Make language optional in url_metadata_gist_file schema.

The language field is marked as notNull() but GitHub's REST API can omit or return null for files[...].language when Linguist cannot confidently classify the file. This causes insert failures when ingesting gist data.

Schema adjustment

- language: varchar("language", { length: 16 }).notNull(), + language: varchar("language", { length: 16 }),

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

language: varchar("language", { length: 16 }).notNull(),

language: varchar("language", { length: 16 }),

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/db/src/schema/url-metadata.ts` at line 24, The url_metadata_gist_file schema currently forces language as notNull (varchar("language", { length: 16 }).notNull()), which fails when GitHub returns null/omits files[...].language; change the column definition in packages/db/src/schema/url-metadata.ts for the url_metadata_gist_file entry to allow nulls (remove .notNull() or use .nullable() depending on the ORM) so inserts succeed when language is absent, and run migrations/regen types if applicable.

coderabbitai · 2026-03-03T02:18:42Z

packages/github-api/src/getGistById.ts

+	if (typeof data === "undefined" || response.error) {
+		throw new Error(
+			`GitHub API (getCommits) returned ${response.response.status} ${response.error}`,
+		);


⚠️ Potential issue | 🟡 Minor

Fix incorrect operation name in error message.

Line 28 references getCommits, but this function is getGistById. This makes runtime errors misleading.

🛠️ Proposed fix

- `GitHub API (getCommits) returned ${response.response.status} ${response.error}`, + `GitHub API (getGistById) returned ${response.response.status} ${response.error}`,

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (typeof data === "undefined" || response.error) {

throw new Error(

`GitHub API (getCommits) returned ${response.response.status} ${response.error}`,

);

if (typeof data === "undefined" || response.error) {

throw new Error(

`GitHub API (getGistById) returned ${response.response.status} ${response.error}`,

);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/github-api/src/getGistById.ts` around lines 26 - 29, The error message in getGistById incorrectly references getCommits; update the throw in the block that checks if data is undefined or response.error (in function getGistById) to mention getGistById instead of getCommits and include the same response.response.status and response.error details so the runtime message correctly identifies the failing operation.

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

apps/api/src/routes/tasks/url-metadata.ts (1)

132-156: ⚠️ Potential issue | 🟠 Major

Requeue cache repairs when embed side-data is missing.

These mappers now fail soft, but the request path never flips shouldSubmitJob when a cached gist/post row is incomplete. That leaves broken cache entries returning partial embed payloads until the normal stale window instead of repairing themselves on the next read.

Suggested direction

 async function mapEmbed(
 	result: UrlMetadataDbResult,
 ): Promise<Static<typeof UrlMetadataResponseSchema>["embed"]> {
 	if (result.embedType === "post") {
+		const post = result.postId ? await mapEmbedPost(result.postId) : undefined;
-		return {
-			type: "post",
-			post: result.postId ? await mapEmbedPost(result.postId) : undefined,
-		};
+		return post ? { type: "post", post } : undefined;
 	}
 	if (result.embedType === "gist") {
+		const gist = result.gistId ? await mapEmbedGist(result.gistId) : undefined;
-		return {
-			type: "gist",
-			gist: result.gistId ? await mapEmbedGist(result.gistId) : undefined,
-		};
+		return gist ? { type: "gist", gist } : undefined;
 	}
 	if (result.embedType === "video") {
 		return mapEmbedVideo(result);
 	}
 	return undefined;
 }
@@
-				const response = {
-					...mapUrlMetadata(result),
-					embed: await mapEmbed(result),
-				};
+				const embed = await mapEmbed(result);
+				if (
+					(result.embedType === "gist" || result.embedType === "post") &&
+					!embed
+				) {
+					shouldSubmitJob = true;
+				}
+
+				const response = {
+					...mapUrlMetadata(result),
+					embed,
+				};

Also applies to: 193-202, 276-279

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/routes/tasks/url-metadata.ts` around lines 132 - 156, The
mappers mapEmbedGist and mapEmbedPost currently return undefined on missing
side-data (e.g., when urlMetadataGistFile.findMany returns zero rows or
urlMetadataPost rows are incomplete) but do not signal the request flow to
requeue a cache-repair job; update these functions so that whenever they detect
incomplete embed side-data (gistFiles.length === 0 or required post fields
missing after urlMetadataPost.findFirst), they set the request's cache-repair
flag (shouldSubmitJob) or call the existing enqueue/submit job helper before
returning undefined — locate mapEmbedGist/mapEmbedPost, the calls to
urlMetadataGist.findFirst, urlMetadataGistFile.findMany and
urlMetadataPost.findFirst, and add the single-line mutation/invocation to
trigger submission of the repair job when incomplete data is detected.

🧹 Nitpick comments (2)

apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts (1)
75-92: Consider defensive checks for optional API fields.

The postData object accesses post.author.name, post.text, post.url, etc. directly. Per the TweetAPIResponse types, some nested fields may be undefined or null (e.g., avatar_url is string | null). If the FXTwitter API returns incomplete data, this could insert undefined into NOT NULL columns and cause a database error.
💡 Suggested validation
+	// Validate required fields before insert
+	if (!post.author?.name || !post.author?.screen_name || !post.text || !post.url) {
+		console.error(`Incomplete post data for '${inputUrl}'`);
+		return { error: true };
+	}
+
 	const postData = {
 		postId: dbPostId,
 		authorName: post.author.name,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts` around lines 75 -
92, The postData construction in getEmbedDataFromPost (the postData object)
reads nested optional fields like post.author.name, post.text, post.url,
post.created_at, avatarResult?.key, imageResult?.key and image?.altText without
guards; add defensive checks and sane defaults (or bail out) before building
postData: validate post.author and required strings (name, screen_name, text,
url, created_at) and coalesce nullable avatar/image fields (avatarResult,
imageResult, image) to safe values or set null/empty strings, and if required
NOT NULL DB columns would be missing, return an error or skip the record so you
never insert undefined into the DB. Ensure checks live near the code that builds
postData in getEmbedDataFromPost and reference the same identifiers (post,
post.author, avatarResult, imageResult, image) when implementing.
apps/worker/src/tasks/url-metadata/processor.ts (1)
59-64: Clarify handling of undefined embed fields in database insert.

When destructuring embedData, fields like gistId, postId, embedSrc, etc. may be undefined depending on the embed type. These are then passed directly to the database insert.

This works because the schema defines these columns as nullable (per packages/db/src/schema/url-metadata.ts), but you may want to explicitly coalesce to null for consistency with other fields in the result object (e.g., title: title ?? null).
♻️ Optional: Explicit null coalescing for consistency
-		gistId,
-		postId,
-		embedSrc,
-		embedWidth,
-		embedHeight,
-		embedType,
+		gistId: gistId ?? null,
+		postId: postId ?? null,
+		embedSrc: embedSrc ?? null,
+		embedWidth: embedWidth ?? null,
+		embedHeight: embedHeight ?? null,
+		embedType: embedType ?? null,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/tasks/url-metadata/processor.ts` around lines 59 - 64, The
destructured embed fields (gistId, postId, embedSrc, embedWidth, embedHeight,
embedType) from embedData can be undefined and are being passed straight into
the DB insert; update the code where embedData is mapped for insertion (the
object built for the DB insert in processor.ts) to explicitly coalesce each
optional embed field to null (e.g., gistId: gistId ?? null) so the insert always
uses null instead of undefined, matching how other fields like title are
handled.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/worker/src/tasks/url-metadata/data-providers/x.ts`:
- Around line 9-16: Accept an AbortSignal in getXPostData's signature and
forward it to fetchAsBot (pass the signal param through the options) so job
cancellation propagates; validate that the returned TweetAPIResponse.tweet is
present before returning (throw or return null/error if missing) to handle the
optional type; also update the caller getEmbedDataFromPost to pass its received
signal into getXPostData so the signal is threaded end-to-end.

In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts`:
- Around line 32-35: The abort signal passed into getEmbedDataFromPost is not
forwarded to getXPostData, so cancellation won't reach the underlying request;
update the call to getXPostData within getEmbedDataFromPost to include the
signal (e.g., pass { signal, userId, postId } or the existing options object
augmented with signal), keeping the existing .catch handling and error log
intact so cancellations propagate correctly.
- Line 91: The createdAt assignment in getEmbedDataFromPost.ts uses new
Date(post.created_at) without validating post.created_at; update the code that
builds the object (the createdAt property) to first check that post.created_at
exists and that new Date(post.created_at) yields a valid date (e.g., const d =
new Date(post.created_at); if (!post.created_at || isNaN(d.getTime())) then
either set createdAt to null/default or throw an error), and then assign
createdAt: d (or the fallback) so you avoid inserting "Invalid Date" into the
timestamp column.

---

Duplicate comments:
In `@apps/api/src/routes/tasks/url-metadata.ts`:
- Around line 132-156: The mappers mapEmbedGist and mapEmbedPost currently
return undefined on missing side-data (e.g., when urlMetadataGistFile.findMany
returns zero rows or urlMetadataPost rows are incomplete) but do not signal the
request flow to requeue a cache-repair job; update these functions so that
whenever they detect incomplete embed side-data (gistFiles.length === 0 or
required post fields missing after urlMetadataPost.findFirst), they set the
request's cache-repair flag (shouldSubmitJob) or call the existing
enqueue/submit job helper before returning undefined — locate
mapEmbedGist/mapEmbedPost, the calls to urlMetadataGist.findFirst,
urlMetadataGistFile.findMany and urlMetadataPost.findFirst, and add the
single-line mutation/invocation to trigger submission of the repair job when
incomplete data is detected.

---

Nitpick comments:
In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts`:
- Around line 75-92: The postData construction in getEmbedDataFromPost (the
postData object) reads nested optional fields like post.author.name, post.text,
post.url, post.created_at, avatarResult?.key, imageResult?.key and
image?.altText without guards; add defensive checks and sane defaults (or bail
out) before building postData: validate post.author and required strings (name,
screen_name, text, url, created_at) and coalesce nullable avatar/image fields
(avatarResult, imageResult, image) to safe values or set null/empty strings, and
if required NOT NULL DB columns would be missing, return an error or skip the
record so you never insert undefined into the DB. Ensure checks live near the
code that builds postData in getEmbedDataFromPost and reference the same
identifiers (post, post.author, avatarResult, imageResult, image) when
implementing.

In `@apps/worker/src/tasks/url-metadata/processor.ts`:
- Around line 59-64: The destructured embed fields (gistId, postId, embedSrc,
embedWidth, embedHeight, embedType) from embedData can be undefined and are
being passed straight into the DB insert; update the code where embedData is
mapped for insertion (the object built for the DB insert in processor.ts) to
explicitly coalesce each optional embed field to null (e.g., gistId: gistId ??
null) so the insert always uses null instead of undefined, matching how other
fields like title are handled.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2e328b19-832c-4e07-8d1c-00e54b1ef4e7

📥 Commits

Reviewing files that changed from the base of the PR and between 389e182 and 12a481c.

📒 Files selected for processing (4)

apps/api/src/routes/tasks/url-metadata.ts
apps/worker/src/tasks/url-metadata/data-providers/x.ts
apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts
apps/worker/src/tasks/url-metadata/processor.ts

apps/worker/src/tasks/url-metadata/data-providers/x.ts

apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts

coderabbitai · 2026-03-07T17:08:34Z

apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts

+		numLikes: post.likes,
+		numReposts: post.reposts,
+		numReplies: post.replies,
+		createdAt: new Date(post.created_at),


⚠️ Potential issue | 🟡 Minor

Validate created_at before parsing as Date.

new Date(post.created_at) will produce an "Invalid Date" if created_at is undefined or malformed. This could cause unexpected behavior when inserting into the createdAt timestamp column.

🛡️ Proposed validation

- createdAt: new Date(post.created_at), + createdAt: post.created_at ? new Date(post.created_at) : new Date(),

Or throw an error if created_at is required:

+ if (!post.created_at || isNaN(new Date(post.created_at).getTime())) { + console.error(`Invalid created_at for post '${inputUrl}'`); + return { error: true }; + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

createdAt: new Date(post.created_at),

createdAt: post.created_at ? new Date(post.created_at) : new Date(),

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts` at line 91, The createdAt assignment in getEmbedDataFromPost.ts uses new Date(post.created_at) without validating post.created_at; update the code that builds the object (the createdAt property) to first check that post.created_at exists and that new Date(post.created_at) yields a valid date (e.g., const d = new Date(post.created_at); if (!post.created_at || isNaN(d.getTime())) then either set createdAt to null/default or throw an error), and then assign createdAt: d (or the fallback) so you avoid inserting "Invalid Date" into the timestamp column.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

apps/worker/src/tasks/url-metadata/getEmbedDataFromVideo.ts (1)
9-20: Consider stopping traversal after finding the first iframe.

The visit callback continues traversing after finding an iframe, which means if there are multiple iframes, properties will be overwritten with the last one's attributes. This may not be intentional.
♻️ Proposed fix to return early
 function getIFrameAttributes(html: string) {
 	const tree = fromHtml(html);

 	let properties = {} as Record<string, unknown>;
 	visit(tree, { tagName: "iframe" }, (_node) => {
 		if (!_node) return;
 		const node: Element = _node;
 		properties = node.properties;
+		return false; // Stop traversal after first iframe
 	});

 	return properties;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromVideo.ts` around lines 9 -
20, The getIFrameAttributes function currently lets visit continue and so
properties get overwritten by later iframes; update the visit callback in
getIFrameAttributes to stop traversal once the first iframe is found by
returning visit.EXIT (or the equivalent sentinel) from the callback after
setting properties so the first match is returned immediately.
apps/worker/src/tasks/url-metadata/data-providers/vimeo.ts (1)
42-42: Include common Vimeo subdomains in vimeoHosts to match hostname patterns used by YouTube and Twitch.

Vimeo embeds use player.vimeo.com as the standard embed URL subdomain. The codebase already includes subdomains in similar arrays (YouTube includes www.youtube.com, Twitch includes clips.twitch.tv), so vimeoHosts should include at least player.vimeo.com for consistency and to handle embeds from that domain.
export const vimeoHosts = ["vimeo.com", "player.vimeo.com"];
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/tasks/url-metadata/data-providers/vimeo.ts` at line 42, The
vimeoHosts array currently only contains "vimeo.com" so embedded Vimeo URLs like
"player.vimeo.com" are missed; update the export vimeoHosts to include common
Vimeo subdomains (e.g., add "player.vimeo.com" and optionally "www.vimeo.com")
so hostname matching in the Vimeo data provider (export const vimeoHosts) will
handle embed and subdomain patterns consistently with YouTube/Twitch host lists.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/worker/src/tasks/url-metadata/data-providers/vimeo.ts`:
- Around line 32-39: The current fetch in vimeo.ts (inside the function using
fetchAsBot) blindly calls res.body.json() which will fail or return invalid data
for non-OK HTTP responses; update the promise callback to check res.ok (or
inspect res.status) after fetchAsBot returns, and if not OK throw or reject with
a clear error containing the status and statusText (or return a controlled
null/undefined), otherwise parse and return the JSON as VimeoOEmbedResponse;
reference the fetchAsBot call and the VimeoOEmbedResponse type so the check
occurs before calling res.body.json().

---

Nitpick comments:
In `@apps/worker/src/tasks/url-metadata/data-providers/vimeo.ts`:
- Line 42: The vimeoHosts array currently only contains "vimeo.com" so embedded
Vimeo URLs like "player.vimeo.com" are missed; update the export vimeoHosts to
include common Vimeo subdomains (e.g., add "player.vimeo.com" and optionally
"www.vimeo.com") so hostname matching in the Vimeo data provider (export const
vimeoHosts) will handle embed and subdomain patterns consistently with
YouTube/Twitch host lists.

In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromVideo.ts`:
- Around line 9-20: The getIFrameAttributes function currently lets visit
continue and so properties get overwritten by later iframes; update the visit
callback in getIFrameAttributes to stop traversal once the first iframe is found
by returning visit.EXIT (or the equivalent sentinel) from the callback after
setting properties so the first match is returned immediately.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cab77c96-159b-47bf-bcfd-a282a8886fe4

📥 Commits

Reviewing files that changed from the base of the PR and between 12a481c and dadc810.

📒 Files selected for processing (6)

apps/worker/src/tasks/url-metadata/data-providers/video.ts
apps/worker/src/tasks/url-metadata/data-providers/vimeo.ts
apps/worker/src/tasks/url-metadata/data-providers/x.ts
apps/worker/src/tasks/url-metadata/data-providers/youtube.ts
apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts
apps/worker/src/tasks/url-metadata/getEmbedDataFromVideo.ts

🚧 Files skipped from review as they are similar to previous changes (2)

apps/worker/src/tasks/url-metadata/data-providers/youtube.ts
apps/worker/src/tasks/url-metadata/data-providers/x.ts

coderabbitai · 2026-03-07T17:23:42Z

apps/worker/src/tasks/url-metadata/data-providers/vimeo.ts

+	return await fetchAsBot({
+		url: new URL(
+			`https://vimeo.com/api/oembed.json?url=${encodeURIComponent(url.toString())}`,
+		),
+		method: "GET",
+		skipRobotsCheck: true,
+		signal,
+	}).then((res) => res.body.json() as Promise<VimeoOEmbedResponse>);


⚠️ Potential issue | 🟡 Minor

Missing error handling for failed HTTP responses.

The function assumes the fetch always succeeds and returns valid JSON. If Vimeo returns an error status code (e.g., 404 for invalid videos), calling res.body.json() on a non-OK response may produce unexpected results or throw.

🛡️ Proposed fix to add response status check

export async function getVimeoOEmbedDataFromUrl( url: URL, signal: AbortSignal, ): Promise<VimeoOEmbedResponse> { - return await fetchAsBot({ + const res = await fetchAsBot({ url: new URL( `https://vimeo.com/api/oembed.json?url=${encodeURIComponent(url.toString())}`, ), method: "GET", skipRobotsCheck: true, signal, - }).then((res) => res.body.json() as Promise<VimeoOEmbedResponse>); + }); + if (res.statusCode < 200 || res.statusCode >= 300) { + throw new Error(`Vimeo oEmbed request failed with status ${res.statusCode}`); + } + return res.body.json() as Promise<VimeoOEmbedResponse>; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

return await fetchAsBot({

url: new URL(

`https://vimeo.com/api/oembed.json?url=${encodeURIComponent(url.toString())}`,

),

method: "GET",

skipRobotsCheck: true,

signal,

}).then((res) => res.body.json() as Promise<VimeoOEmbedResponse>);

export async function getVimeoOEmbedDataFromUrl(

url: URL,

signal: AbortSignal,

): Promise<VimeoOEmbedResponse> {

const res = await fetchAsBot({

url: new URL(

`https://vimeo.com/api/oembed.json?url=${encodeURIComponent(url.toString())}`,

),

method: "GET",

skipRobotsCheck: true,

signal,

});

if (res.statusCode < 200 || res.statusCode >= 300) {

throw new Error(`Vimeo oEmbed request failed with status ${res.statusCode}`);

}

return res.body.json() as Promise<VimeoOEmbedResponse>;

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/worker/src/tasks/url-metadata/data-providers/vimeo.ts` around lines 32 - 39, The current fetch in vimeo.ts (inside the function using fetchAsBot) blindly calls res.body.json() which will fail or return invalid data for non-OK HTTP responses; update the promise callback to check res.ok (or inspect res.status) after fetchAsBot returns, and if not OK throw or reject with a clear error containing the status and statusText (or return a controlled null/undefined), otherwise parse and return the JSON as VimeoOEmbedResponse; reference the fetchAsBot call and the VimeoOEmbedResponse type so the check occurs before calling res.body.json().

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/db/drizzle/20260307174859_furry_dark_beast/migration.sql`:
- Around line 7-13: The column size for url_metadata_gist_file.language is too
small (varchar(16)) and must be widened to varchar(64) to accept multi-word
GitHub language labels; update the migration that creates the
"url_metadata_gist_file" table to change the "language" column from varchar(16)
to varchar(64) and also update the schema definition in the DB schema for
url-metadata (the exported url_metadata_gist_file/table schema or column
descriptor) to reflect the new max length so runtime inserts from
getEmbedDataFromGist.ts succeed.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e4ec19df-27b2-44b0-8ee1-be11c02b38de

📥 Commits

Reviewing files that changed from the base of the PR and between dadc810 and 5b64ec6.

📒 Files selected for processing (2)

packages/db/drizzle/20260307174859_furry_dark_beast/migration.sql
packages/db/drizzle/20260307174859_furry_dark_beast/snapshot.json

coderabbitai · 2026-03-07T18:01:32Z

packages/db/drizzle/20260307174859_furry_dark_beast/migration.sql

+CREATE TABLE "url_metadata_gist_file" (
+	"gist_id" varchar(256),
+	"filename" varchar(256),
+	"content_key" varchar(256) NOT NULL,
+	"language" varchar(16) NOT NULL,
+	CONSTRAINT "url_metadata_gist_file_pkey" PRIMARY KEY("gist_id","filename")
+);


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Ingest path:" rg -n -C2 'language:\s*file\?\.language\s*\?\?\s*"text"' \ apps/worker/src/tasks/url-metadata/getEmbedDataFromGist.ts echo echo "Current schema cap:" rg -n -C1 'varchar$"language", \{ length: 16 \}$|"language" varchar$16$' \ packages/db/src/schema/url-metadata.ts \ packages/db/drizzle/20260307174859_furry_dark_beast/migration.sql

Repository: playfulprogramming/hoof

Length of output: 826

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== Full url_metadata_gist_file table definition ===" rg -n -A10 'url_metadata_gist_file' packages/db/src/schema/url-metadata.ts | head -20 echo echo "=== Check for hardcoded length assumptions or truncation logic ===" rg -n 'language.*16|length.*16|truncate.*language' \ apps/worker/src/tasks/url-metadata/ packages/db/src/schema/ || echo "No matches" echo echo "=== Verify GitHub's actual language label examples ===" rg -n 'language|Language' apps/worker/src/tasks/url-metadata/getEmbedDataFromGist.ts | head -15

Repository: playfulprogramming/hoof

Length of output: 843

🌐 Web query:

GitHub gist API language field examples Jupyter Notebook maximum length

💡 Result:

language field in the GitHub Gist REST API (files.*.language)

In a Gist response, each file object can include a detected language string, e.g.:

Markdown file

"language": "Markdown" (example shown in GitHub’s Gists REST API docs) [1]

Ruby file

"language": "Ruby" (example shown in GitHub’s Gists REST API docs) [2]

Jupyter Notebook file (.ipynb)

The detected language name GitHub Linguist uses for .ipynb is "Jupyter Notebook" [3], so that’s what you should expect in files.*.language when GitHub can detect it.

Notes:

language is detection-derived (GitHub Linguist); it may be null if GitHub can’t confidently identify a language (commonly for binary/unknown content). [4]

“Maximum length”

There’s no published maximum length for the language string itself in the Gist REST API schema/docs (it’s just a short language name like "Markdown" / "Jupyter Notebook").

If you meant maximum file content length returned by the API: the Gist REST API returns up to 1 MB of content per file; beyond that, the file is marked truncated: true, and you should fetch the full file via raw_url. [1]

Sources: [1] [2] [3] [4]

Widen url_metadata_gist_file.language column to accommodate GitHub's multi-word language labels.

The ingest path at apps/worker/src/tasks/url-metadata/getEmbedDataFromGist.ts:80 writes GitHub's gist-file language value directly into this column without validation or truncation. GitHub's Linguist library returns multi-word labels such as "Jupyter Notebook" (17 characters), which exceeds the current varchar(16) constraint in both the migration and schema files. Inserts will fail when GitHub returns such labels.

Widen the column to varchar(64) in:

packages/db/drizzle/20260307174859_furry_dark_beast/migration.sql (line 11)

packages/db/src/schema/url-metadata.ts (line 24)

Proposed fix

- "language" varchar(16) NOT NULL, + "language" varchar(64) NOT NULL,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/db/drizzle/20260307174859_furry_dark_beast/migration.sql` around lines 7 - 13, The column size for url_metadata_gist_file.language is too small (varchar(16)) and must be widened to varchar(64) to accept multi-word GitHub language labels; update the migration that creates the "url_metadata_gist_file" table to change the "language" column from varchar(16) to varchar(64) and also update the schema definition in the DB schema for url-metadata (the exported url_metadata_gist_file/table schema or column descriptor) to reflect the new max length so runtime inserts from getEmbedDataFromGist.ts succeed.

Adds embed information from playfulprogramming/hoof#121 to provide & cache the video/post/gist details  ## Summary by CodeRabbit * **New Features** * Unified embed handling: richer support for video (YouTube/Vimeo/Twitch), social posts, and Gist previews via metadata. * Gist previews now show truncated code content and filename. * **Bug Fixes** * More reliable embed fallbacks and external-link behavior for video embeds. * **Style** * Improved video/social embed layout: centered, responsive sizing, correct aspect-ratio handling, and better image rendering (cross-origin-safe images and thumbnails).

add embed information to url-metadata

389e182

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

improve embed response structure

12a481c

coderabbitai bot reviewed Mar 7, 2026

View reviewed changes

pass AbortSignal through data-providers properly

dadc810

coderabbitai bot reviewed Mar 7, 2026

View reviewed changes

add drizzle migration

5b64ec6

coderabbitai bot reviewed Mar 7, 2026

View reviewed changes

fennifith merged commit f37a454 into main Mar 7, 2026
3 checks passed

fennifith deleted the oembed-url-metadata branch March 7, 2026 19:52

fennifith mentioned this pull request Mar 8, 2026

Migrate embed data to hoof (url-metadata) playfulprogramming/playfulprogramming#1590

Merged

	export const twitchHosts = ["clips.twitch.tv", "twitch.tv"];
	export const twitchHosts = ["clips.twitch.tv", "twitch.tv", "www.twitch.tv"];

	language: varchar("language", { length: 16 }).notNull(),
	language: varchar("language", { length: 16 }),

	createdAt: new Date(post.created_at),
	createdAt: post.created_at ? new Date(post.created_at) : new Date(),

Uh oh!

Conversation

fennifith commented Mar 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

src formats

Key requirements

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 7, 2026

Choose a reason for hiding this comment

language field in the GitHub Gist REST API (files.*.language)

“Maximum length”

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fennifith commented Mar 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 3, 2026 •

edited

Loading

`src` formats

`language` field in the GitHub Gist REST API (`files.*.language`)