Skip to content

Add embed information to url-metadata#121

Merged
fennifith merged 4 commits intomainfrom
oembed-url-metadata
Mar 7, 2026
Merged

Add embed information to url-metadata#121
fennifith merged 4 commits intomainfrom
oembed-url-metadata

Conversation

@fennifith
Copy link
Member

@fennifith fennifith commented Mar 3, 2026

Fixes #82 - copies the platform-detectors implementation from the frontend to run as part of the url-metadata task to store/upload github gists, posts, and video embed info.

Summary by CodeRabbit

  • New Features

    • Embed extraction expanded: GitHub Gists, X posts, YouTube/Vimeo/Twitch videos, plus unified HTML dispatch and dedicated providers; public fetchers added for gist/post/video/html.
  • Database

    • New tables and fields to store gist files and post embed metadata; richer embed fields persisted (type, ids, iframe src/dimensions, image alt).
  • Tests

    • Added unit tests covering gist and post embed flows.
  • Chores

    • S3 object removal utility added; image processing result type exported; package dependency references consolidated.

@coderabbitai
Copy link

coderabbitai bot commented Mar 3, 2026

📝 Walkthrough

Walkthrough

Adds a full embed extraction pipeline (gist, X/Twitter posts, video providers, HTML), DB schema and migration for embed entities, S3 upload/remove support, worker fetchers and tests, API response enrichment, GitHub helper, and migrates TypeBox imports/dependencies from @sinclair/typebox to typebox.

Changes

Cohort / File(s) Summary
Manifests / Catalog
pnpm-workspace.yaml, apps/api/package.json, apps/worker/package.json, packages/bullmq/package.json, packages/common/package.json, packages/post-images/package.json
Replaced @sinclair/typebox with typebox in manifests/catalog entries and dependency lists.
Import path updates
apps/api/src/routes/tasks/post-images.ts, packages/bullmq/src/tasks/post-image.ts, packages/bullmq/src/tasks/url-metadata.ts, packages/post-images/src/fetch-post-data.ts, apps/worker/src/tasks/sync-*/**/*
Updated TypeBox import sources from @sinclair/typeboxtypebox (including value variants).
API: URL metadata route
apps/api/src/routes/tasks/url-metadata.ts, apps/api/package.json
Extended response schema to include embed (gist/post/video), added mapping helpers to assemble enriched responses (images, embeds), and normalized URL comparisons to include query string.
DB schema & migration
packages/db/src/schema/url-metadata.ts, packages/db/drizzle/.../migration.sql, packages/db/drizzle/.../snapshot.json
Added tables url_metadata_gist, url_metadata_gist_file, url_metadata_post; extended url_metadata with gist/post FKs and fields embed_src, embed_width, embed_height, embed_type.
Worker: embed common & types
apps/worker/src/tasks/url-metadata/common.ts, apps/worker/src/tasks/url-metadata/utils/processImage.ts
Added exported EmbedData type, exported BUCKET (awaited s3.ensureBucket), and exported ProcessImageResult.
Worker: gist support
apps/worker/src/tasks/url-metadata/getEmbedDataFromGist.ts, .../getEmbedDataFromGist.test.ts
New gist handler: fetch gist metadata/files, upload file contents to S3 (MD5 keys), upsert DB records, delete stale rows and S3 objects; unit test added.
Worker: post (X) support
apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts, .../getEmbedDataFromPost.test.ts, apps/worker/src/tasks/url-metadata/data-providers/x.ts
New post fetcher: fetch FxTwitter post, process avatar/media images, upsert url_metadata_post, return embed id; unit test added.
Worker: video providers & dispatcher
apps/worker/src/tasks/url-metadata/getEmbedDataFromVideo.ts, .../data-providers/youtube.ts, .../vimeo.ts, .../twitch.ts, .../video.ts
Added YouTube/Vimeo/Twitch oEmbed providers and dispatcher getVideoDataFromUrl; video fetcher parses iframe, uploads thumbnail, returns embed data; exported host lists.
Worker: HTML embed fetcher
apps/worker/src/tasks/url-metadata/getEmbedDataFromHtml.ts
New HTML embed fetcher: fetch page, extract title, icons and OG banners, process images, return EmbedData with error handling.
Worker: FX embed typings
apps/worker/src/tasks/url-metadata/data-providers/fx-embed/twitter.ts, .../types.ts
Added extensive TypeScript type definitions for FX/Twitter API responses used by post fetcher.
Worker: processor refactor
apps/worker/src/tasks/url-metadata/processor.ts
Reworked processor to dispatch to embed fetchers (gist/post/video/html), require jobId, persist embedType/IDs/embedSrc/dimensions, and centralize embed error handling.
S3 utils & test setup
packages/s3/src/utils.ts, apps/worker/test-utils/setup.ts
Added remove(bucket,key) wrapper using DeleteObjectCommand; expanded test mocks to include s3.remove, stateful db urlMetadata* mocks and github-api.getGistById.
GitHub API helper
packages/github-api/src/getGistById.ts, packages/github-api/src/index.ts
Added getGistById(params) and re-exported it from package index.
Misc worker additions
apps/worker/src/tasks/url-metadata/* (many new modules & tests)
Multiple new modules, helpers, and tests supporting embed extraction, mapping, and persistence (gist, post, video, html).

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant API as API
    participant Worker as Worker
    participant GitHub as GitHub
    participant DB as DB
    participant S3 as S3

    Client->>API: enqueue /tasks/url-metadata (gist URL)
    API->>Worker: job received
    Worker->>GitHub: GET /gists/{id}
    GitHub-->>Worker: gist metadata + files
    Worker->>S3: upload file contents (MD5 keys)
    S3-->>Worker: content keys
    Worker->>DB: upsert gist + file rows, delete stale rows
    Worker->>S3: remove deleted files
    Worker->>DB: update url_metadata (gistId, embedType)
    Worker-->>API: job complete
    API->>DB: query enriched url_metadata
    API-->>Client: enriched response (metadata + embed)
Loading
sequenceDiagram
    participant Client as Client
    participant API as API
    participant Worker as Worker
    participant FX as FXTwitter
    participant Images as ImageProcessor
    participant S3 as S3
    participant DB as DB

    Client->>API: enqueue /tasks/url-metadata (tweet URL)
    API->>Worker: job received
    Worker->>FX: GET /{user}/status/{id}
    FX-->>Worker: tweet + media URLs
    Worker->>Images: process avatar -> upload
    Images->>S3: upload avatar
    S3-->>Worker: avatar key
    Worker->>Images: process media -> upload
    Images->>S3: upload media
    S3-->>Worker: media key
    Worker->>DB: upsert url_metadata_post and url_metadata (embedType=post)
    Worker-->>API: job complete
    API->>DB: query enriched url_metadata
    API-->>Client: enriched response (author, content, images)
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • update drizzle?? #111: touches DB schema and url_metadata migrations — strongly related to the newly added url_metadata_gist / url_metadata_post tables and migration.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.04% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add embed information to url-metadata' accurately summarizes the main objective of the PR, which is to add OEmbed and embed data support to the url-metadata task.
Linked Issues check ✅ Passed The PR fulfills all three primary requirements from issue #82: (1) implements platform detection for gists, posts, and videos; (2) adds database columns for embed information (embedSrc, embedWidth, embedHeight, embedType, gistId, postId); (3) uploads image fields to S3 and stores resulting keys.
Out of Scope Changes check ✅ Passed All code changes are directly related to supporting embed metadata processing. The TypeBox package migration and GitHub gist API additions are required dependencies for the embed feature implementation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch oembed-url-metadata

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

🧹 Nitpick comments (4)
pnpm-workspace.yaml (1)

6-6: TypeBox migration is complete; drift check would be a good preventive measure.

All source files are using the new typebox package correctly, and no legacy @sinclair/typebox references remain. A CI check to prevent reintroduction of the legacy package name is recommended as a future safeguard, but not required now.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pnpm-workspace.yaml` at line 6, Add a preventive CI drift check to ensure the
legacy package name '@sinclair/typebox' is never reintroduced: create a CI job
(or an npm script referenced by CI) named something like "check-typebox" that
searches the repository for the string '@sinclair/typebox' (excluding
node_modules/build artifacts) and fails the job if any matches are found; this
should be triggered on PRs and main branch updates and references the package
names "typebox" and "@sinclair/typebox" so reviewers can quickly locate the
check and its intent.
apps/worker/src/tasks/url-metadata/processor.ts (1)

10-10: Consider adding an "html" embed type for completeness.

The UrlMetadataEmbedType defines "gist" | "video" | "post", but the HTML fallback path leaves embedType as undefined. For clarity and future filtering/querying, you might want to include "html" in the union:

♻️ Suggested addition
-type UrlMetadataEmbedType = "gist" | "video" | "post";
+type UrlMetadataEmbedType = "gist" | "video" | "post" | "html";

And then set it explicitly in the else branch (line 28-29):

 } else {
+	embedType = "html";
 	embedPromise = getEmbedDataFromHtml(jobId, inputUrl, signal);
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/tasks/url-metadata/processor.ts` at line 10, The
UrlMetadataEmbedType union should include an "html" variant and the fallback
assignment should set embedType = "html" instead of leaving it undefined; update
the type alias UrlMetadataEmbedType to include "html" and modify the fallback
branch in the URL metadata parsing logic (the code that currently leaves
embedType undefined) to explicitly assign "html" so downstream filtering/queries
can rely on a concrete embedType.
apps/worker/src/tasks/url-metadata/data-providers/vimeo.ts (1)

3-25: Consider relaxing literal types for external API fields.

The provider_url: "https://vimeo.com/" is typed as a literal string. If Vimeo's API ever returns a slightly different value (e.g., without trailing slash), it could cause type mismatches at runtime with stricter type checking. Consider using string for fields that aren't critical to the application logic.

♻️ Suggested type relaxation
 export interface VimeoOEmbedResponse {
 	type: "video";
 	version: "1.0";
 	provider_name: "Vimeo";
-	provider_url: "https://vimeo.com/";
+	provider_url: string;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/tasks/url-metadata/data-providers/vimeo.ts` around lines 3 -
25, The VimeoOEmbedResponse interface uses literal string types for external API
fields (e.g., provider_url: "https://vimeo.com/") which is brittle; update the
interface by changing those literal types to more permissive types (for example
provider_url: string and similarly relax provider_name, version, and type if
they are not critical) so the shape accepts minor API variations while keeping
strongly typed fields like title, width, height, duration, thumbnail_url,
video_id, and uri unchanged.
apps/worker/src/tasks/url-metadata/common.ts (1)

17-17: Consider lazy initialization for the S3 bucket.

The top-level await runs at module import time, which means:

  1. Any import failure from s3.ensureBucket will prevent the module from loading.
  2. The bucket is created even if the importing code never uses it.
  3. Test mocking becomes trickier since this runs before test setup.

For a worker context this may be acceptable, but lazy initialization would be more resilient:

♻️ Suggested lazy initialization pattern
-export const BUCKET = await s3.ensureBucket(env.S3_BUCKET);
+let _bucket: Awaited<ReturnType<typeof s3.ensureBucket>> | undefined;
+
+export async function getBucket() {
+  if (!_bucket) {
+    _bucket = await s3.ensureBucket(env.S3_BUCKET);
+  }
+  return _bucket;
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/tasks/url-metadata/common.ts` at line 17, Replace the
top-level awaited export with a lazy initializer: remove the top-level "export
const BUCKET = await s3.ensureBucket(env.S3_BUCKET)" and add an async getter
(e.g., getBucket or ensureBucketInitialized) that caches the bucket after
calling s3.ensureBucket(env.S3_BUCKET) on first use; update callers to call this
getter instead of referencing BUCKET directly so the bucket creation runs lazily
and is easier to mock/test and to avoid module-import side effects.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/api/src/routes/tasks/url-metadata.ts`:
- Around line 107-109: Replace the use of logical OR with nullish coalescing for
optional numeric fields so valid 0 values are preserved: change occurrences of
"width: width || undefined" and "height: height || undefined" (and the analogous
occurrences at the later lines) to use "width: width ?? undefined" and "height:
height ?? undefined"; keep altText as "altText || undefined" only if empty
string should be treated as missing, otherwise change altText to "altText ??
undefined". Update the assignments where these symbols (width, height, altText)
are used in the URL metadata construction and at the later lines mentioned to
use ?? accordingly.
- Around line 113-115: The mapGist function currently throws when the side-table
row is missing; instead, change the behavior in mapGist (and the analogous
functions that throw for missing side rows—e.g., the mapPost path and the other
URL metadata mappers around the later throw sites) to return undefined when
db.query.urlMetadataGist.findFirst(...) (or the equivalent db query) returns no
row, and call the existing requeue/refresh helper (use the module's
enqueue/refresh function used elsewhere for metadata refresh) to schedule an
async refresh rather than throwing; ensure callers of mapGist handle undefined
as a cache-hit fallback so the request path does not 500.
- Line 237: The code currently force-casts result.embedType using "embedType:
result.embedType as 'post' | 'gist' | 'video' | undefined" which is unsafe;
replace this with a guarded mapper: add a small type‑guard (e.g.,
isValidEmbedType(value): value is 'post'|'gist'|'video') and set embedType to
isValidEmbedType(result.embedType) ? result.embedType : undefined (or handle
invalid values explicitly), updating the assignment site where embedType is set
from result.embedType so the response contract is enforced without unsafe
casting.

In `@apps/worker/src/tasks/url-metadata/data-providers/twitch.ts`:
- Line 42: twitchHosts currently lists ["clips.twitch.tv", "twitch.tv"] but
misses "www.twitch.tv", causing valid Twitch URLs to be ignored; update the
twitchHosts export (the twitchHosts constant in
apps/worker/src/tasks/url-metadata/data-providers/twitch.ts) to include
"www.twitch.tv" in the array alongside the existing entries so URL
matching/embedding recognizes both forms.
- Around line 29-38: The current fallback uses the page URL (embedUrl = url)
which produces invalid Twitch iframe src; instead detect non-clip Twitch URLs
and build a proper player endpoint (use "https://player.twitch.tv/" with either
?channel=<channelName> for channel pages or ?video=v<videoId> for video pages /
collections), preserve/add the parent query param on that player URL (use
embedUrl.searchParams.set("parent", parent)), and update the returned html to
use width="400" (minimum) and height="300"; change references to embedUrl and
url where the iframe src is constructed and handle the TODO for video/collection
paths by mapping path segments to the correct channel or video query parameter
before returning html.

In `@apps/worker/src/tasks/url-metadata/data-providers/video.ts`:
- Line 37: Exported videoHosts currently combines youtubeHosts and vimeoHosts
but omits twitchHosts, preventing Twitch URLs from being routed into the video
embed path despite getVideoDataFromUrl supporting Twitch; update the exported
videoHosts (symbol: videoHosts) to include twitchHosts alongside youtubeHosts
and vimeoHosts so Twitch URLs are recognized, and verify getVideoDataFromUrl
still handles Twitch-specific logic (symbols: getVideoDataFromUrl, twitchHosts).

In `@apps/worker/src/tasks/url-metadata/data-providers/youtube.ts`:
- Around line 27-47: The code builds youtubeUrl using the variable videoId
without ensuring it was set; update the parsing logic around videoId in
apps/worker/src/tasks/url-metadata/data-providers/youtube.ts (the videoId
variable and the youtubeUrl construction) to guard against missing/empty IDs:
after the if/else blocks that set videoId, check that videoId is a non-empty
string and if not, return null (or otherwise handle the invalid case) instead of
proceeding to construct `https://www.youtube.com/watch?v=${videoId}`.

In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromGist.test.ts`:
- Around line 33-35: The test uses a simplified mock chain
((db.delete(urlMetadataGistFile).where(undefined).returning as Mock)) that hides
the real where clause used by the implementation (and(eq(...),
not(inArray(...))); update the mock in getEmbedDataFromGist.test.ts to reflect
or assert the actual where predicate: either mock
db.delete(urlMetadataGistFile).where with a matcher that checks the passed
predicate (e.g., capture the where arg and verify it composes and(eq(...),
not(inArray(...)))) or replace the .where(undefined) with a more explicit
mockImplementation that validates the incoming where function/criteria before
returning Promise.resolve([]); alternatively, if simplifying is intentional add
a brief comment explaining why the simplified mock covers the case and cannot
hide regressions, referencing db.delete and urlMetadataGistFile so reviewers can
locate the change.

In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromGist.ts`:
- Around line 25-33: The function currently logs an invalid gist URL but
continues to call github.getGistById; change the invalid-path branch (when
!githubUsername || !gistId) to return immediately (e.g., return undefined) after
logging to avoid the network call and further processing instead of only setting
error = true; ensure this prevents execution of github.getGistById and
downstream code that expects valid gistId/inputUrl.

In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.test.ts`:
- Around line 97-101: The test incorrectly passes a URL string as the jobId to
getEmbedDataFromPost; update the test to pass a distinct mock job ID (e.g.,
"test-job-id" or similar) as the first argument while keeping the existing URL
(new URL(fakeFxTwitterUrl)) and AbortSignal, so the call mirrors production
usage and avoids conflating jobId with the input URL when generating S3 keys.

In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts`:
- Around line 33-37: The code in getEmbedDataFromPost uses new URL(avatarUrl)
and new URL(image.url) directly (variables avatarUrl and image.url) which can
throw synchronously; wrap the URL construction in a try/catch or pre-validate
the string before calling processImages so malformed URLs are skipped or
converted to undefined instead of throwing, and ensure any skipped images are
filtered out before calling processImages([ ... ]) and before mapping into new
URL for images so the downstream .catch handlers are not bypassed.

In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromVideo.ts`:
- Around line 37-45: In getEmbedDataFromVideo, protect parsing of
videoData.thumbnail_url before calling processImages: the expression new
URL(videoData.thumbnail_url) can throw synchronously and bypass the existing
.catch on the processImages promise; wrap the URL construction/validation in a
try/catch (or guard with a safe URL-parse helper) and only call processImages
when the URL was successfully created, and ensure any thrown error is caught and
logged/handled using the existing jobId/signal flow so banner assignment doesn't
crash the task.

In `@apps/worker/test-utils/setup.ts`:
- Around line 38-41: The mock deleteMockResponse currently sets returning =
vi.fn() which yields undefined when awaited; change returning to be a function
that returns an empty array by default (e.g., make returning an async/mock
function that returns []), so callers that await the .returning(...) can safely
iterate; update the returned object in deleteMockResponse (the where mock that
returns { returning }) accordingly so .where(...).returning(...) resolves to an
empty array by default.

In `@packages/db/src/schema/url-metadata.ts`:
- Line 24: The url_metadata_gist_file schema currently forces language as
notNull (varchar("language", { length: 16 }).notNull()), which fails when GitHub
returns null/omits files[...].language; change the column definition in
packages/db/src/schema/url-metadata.ts for the url_metadata_gist_file entry to
allow nulls (remove .notNull() or use .nullable() depending on the ORM) so
inserts succeed when language is absent, and run migrations/regen types if
applicable.

In `@packages/github-api/src/getGistById.ts`:
- Around line 26-29: The error message in getGistById incorrectly references
getCommits; update the throw in the block that checks if data is undefined or
response.error (in function getGistById) to mention getGistById instead of
getCommits and include the same response.response.status and response.error
details so the runtime message correctly identifies the failing operation.

---

Nitpick comments:
In `@apps/worker/src/tasks/url-metadata/common.ts`:
- Line 17: Replace the top-level awaited export with a lazy initializer: remove
the top-level "export const BUCKET = await s3.ensureBucket(env.S3_BUCKET)" and
add an async getter (e.g., getBucket or ensureBucketInitialized) that caches the
bucket after calling s3.ensureBucket(env.S3_BUCKET) on first use; update callers
to call this getter instead of referencing BUCKET directly so the bucket
creation runs lazily and is easier to mock/test and to avoid module-import side
effects.

In `@apps/worker/src/tasks/url-metadata/data-providers/vimeo.ts`:
- Around line 3-25: The VimeoOEmbedResponse interface uses literal string types
for external API fields (e.g., provider_url: "https://vimeo.com/") which is
brittle; update the interface by changing those literal types to more permissive
types (for example provider_url: string and similarly relax provider_name,
version, and type if they are not critical) so the shape accepts minor API
variations while keeping strongly typed fields like title, width, height,
duration, thumbnail_url, video_id, and uri unchanged.

In `@apps/worker/src/tasks/url-metadata/processor.ts`:
- Line 10: The UrlMetadataEmbedType union should include an "html" variant and
the fallback assignment should set embedType = "html" instead of leaving it
undefined; update the type alias UrlMetadataEmbedType to include "html" and
modify the fallback branch in the URL metadata parsing logic (the code that
currently leaves embedType undefined) to explicitly assign "html" so downstream
filtering/queries can rely on a concrete embedType.

In `@pnpm-workspace.yaml`:
- Line 6: Add a preventive CI drift check to ensure the legacy package name
'@sinclair/typebox' is never reintroduced: create a CI job (or an npm script
referenced by CI) named something like "check-typebox" that searches the
repository for the string '@sinclair/typebox' (excluding node_modules/build
artifacts) and fails the job if any matches are found; this should be triggered
on PRs and main branch updates and references the package names "typebox" and
"@sinclair/typebox" so reviewers can quickly locate the check and its intent.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1f71585 and 389e182.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (39)
  • apps/api/package.json
  • apps/api/src/routes/tasks/post-images.ts
  • apps/api/src/routes/tasks/url-metadata.ts
  • apps/worker/package.json
  • apps/worker/src/tasks/sync-author/processor.ts
  • apps/worker/src/tasks/sync-author/types.ts
  • apps/worker/src/tasks/sync-collection/processor.ts
  • apps/worker/src/tasks/sync-collection/types.ts
  • apps/worker/src/tasks/sync-post/processor.ts
  • apps/worker/src/tasks/sync-post/types.ts
  • apps/worker/src/tasks/url-metadata/common.ts
  • apps/worker/src/tasks/url-metadata/data-providers/fx-embed/twitter.ts
  • apps/worker/src/tasks/url-metadata/data-providers/fx-embed/types.ts
  • apps/worker/src/tasks/url-metadata/data-providers/twitch.ts
  • apps/worker/src/tasks/url-metadata/data-providers/video.ts
  • apps/worker/src/tasks/url-metadata/data-providers/vimeo.ts
  • apps/worker/src/tasks/url-metadata/data-providers/x.ts
  • apps/worker/src/tasks/url-metadata/data-providers/youtube.ts
  • apps/worker/src/tasks/url-metadata/getEmbedDataFromGist.test.ts
  • apps/worker/src/tasks/url-metadata/getEmbedDataFromGist.ts
  • apps/worker/src/tasks/url-metadata/getEmbedDataFromHtml.ts
  • apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.test.ts
  • apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts
  • apps/worker/src/tasks/url-metadata/getEmbedDataFromVideo.ts
  • apps/worker/src/tasks/url-metadata/processor.ts
  • apps/worker/src/tasks/url-metadata/utils/processImage.ts
  • apps/worker/test-utils/setup.ts
  • packages/bullmq/package.json
  • packages/bullmq/src/tasks/post-image.ts
  • packages/bullmq/src/tasks/url-metadata.ts
  • packages/common/package.json
  • packages/common/src/env.ts
  • packages/db/src/schema/url-metadata.ts
  • packages/github-api/src/getGistById.ts
  • packages/github-api/src/index.ts
  • packages/post-images/package.json
  • packages/post-images/src/fetch-post-data.ts
  • packages/s3/src/utils.ts
  • pnpm-workspace.yaml

Comment on lines 107 to +109
width: width || undefined,
height: height || undefined,
altText: altText || undefined,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use nullish coalescing for optional numeric fields.

Line [107], Line [108], Line [163], and Line [164] use ||, which drops valid 0 values. Use ?? so only null/undefined are omitted.

Suggested fix
-    width: width || undefined,
-    height: height || undefined,
+    width: width ?? undefined,
+    height: height ?? undefined,
@@
-    width: result.embedWidth || undefined,
-    height: result.embedHeight || undefined,
+    width: result.embedWidth ?? undefined,
+    height: result.embedHeight ?? undefined,

Also applies to: 163-164

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/routes/tasks/url-metadata.ts` around lines 107 - 109, Replace
the use of logical OR with nullish coalescing for optional numeric fields so
valid 0 values are preserved: change occurrences of "width: width || undefined"
and "height: height || undefined" (and the analogous occurrences at the later
lines) to use "width: width ?? undefined" and "height: height ?? undefined";
keep altText as "altText || undefined" only if empty string should be treated as
missing, otherwise change altText to "altText ?? undefined". Update the
assignments where these symbols (width, height, altText) are used in the URL
metadata construction and at the later lines mentioned to use ?? accordingly.

Comment on lines +29 to +38
if (!embedUrl) {
embedUrl = url;
}

// Set the "parent" property for embedding - https://dev.twitch.tv/docs/embed/everything/#usage
embedUrl.searchParams.set("parent", parent);

return {
html: `<iframe src="${embedUrl.toString()}" height="300"></iframe>`,
title: "Twitch Embed",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n apps/worker/src/tasks/url-metadata/data-providers/twitch.ts

Repository: playfulprogramming/hoof

Length of output: 1678


🌐 Web query:

Twitch embed iframe src requirements player.twitch.tv channel video

💡 Result:

For a non-interactive Twitch player iframe, the src must point to https://player.twitch.tv/ and include one content identifier plus at least one parent domain:

src formats

  • Live channel
    <iframe src="https://player.twitch.tv/?channel=<channel>&parent=<yourdomain.com>"></iframe>
  • VOD / past broadcast
    <iframe src="https://player.twitch.tv/?video=<videoID>&parent=<yourdomain.com>"></iframe>

Key requirements

  • HTTPS/SSL required for domains using Twitch embeds. [1]
  • parent is required: add the full domain (include subdomain like www if used). If the embed can appear on multiple domains, include multiple parent= entries. Missing/mismatched parent triggers a playback/config error. [1][2]
    src="https://player.twitch.tv/?channel=twitchpresents&parent=example.com&parent=www.example.com"
  • Minimum size (per Twitch embed requirements): width ≥ 400, height ≥ 300. [3]

Sources: Twitch embed requirements and examples from Twitch Developers docs and Twitch Developer Forums. [1][2][3]


Non-clip Twitch URLs produce invalid embed endpoints.

The code falls back to embedding the page URL directly (line 30: embedUrl = url). This results in iframe sources like https://twitch.tv/somechannel?parent=..., which Twitch does not recognize as valid embed endpoints. Twitch requires embeds to use https://player.twitch.tv/ with channel or video query parameters.

Additionally, the iframe has height="300" but no width attribute. Twitch embeds require a minimum width of 400 pixels.

The TODO comment on line 25 explicitly flags this limitation for video and collection URLs, which also need proper handling.

💡 Suggested fix
if (!embedUrl) {
	const [first, second] = url.pathname.split("/").filter(Boolean);
	embedUrl = new URL("https://player.twitch.tv/");
	if (first === "videos" && second) {
		embedUrl.searchParams.set("video", `v${second}`);
	} else if (first) {
		embedUrl.searchParams.set("channel", first);
	} else {
		return { html: "", title: "Twitch Embed" };
	}
}

Also add width to the iframe: width="400" minimum.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/tasks/url-metadata/data-providers/twitch.ts` around lines 29
- 38, The current fallback uses the page URL (embedUrl = url) which produces
invalid Twitch iframe src; instead detect non-clip Twitch URLs and build a
proper player endpoint (use "https://player.twitch.tv/" with either
?channel=<channelName> for channel pages or ?video=v<videoId> for video pages /
collections), preserve/add the parent query param on that player URL (use
embedUrl.searchParams.set("parent", parent)), and update the returned html to
use width="400" (minimum) and height="300"; change references to embedUrl and
url where the iframe src is constructed and handle the TODO for video/collection
paths by mapping path segments to the correct channel or video query parameter
before returning html.

};
}

export const twitchHosts = ["clips.twitch.tv", "twitch.tv"];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add www.twitch.tv to supported hosts.

Common Twitch URLs use www.twitch.tv; excluding it can skip valid embeds.

🛠️ Proposed fix
-export const twitchHosts = ["clips.twitch.tv", "twitch.tv"];
+export const twitchHosts = ["clips.twitch.tv", "twitch.tv", "www.twitch.tv"];
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export const twitchHosts = ["clips.twitch.tv", "twitch.tv"];
export const twitchHosts = ["clips.twitch.tv", "twitch.tv", "www.twitch.tv"];
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/tasks/url-metadata/data-providers/twitch.ts` at line 42,
twitchHosts currently lists ["clips.twitch.tv", "twitch.tv"] but misses
"www.twitch.tv", causing valid Twitch URLs to be ignored; update the twitchHosts
export (the twitchHosts constant in
apps/worker/src/tasks/url-metadata/data-providers/twitch.ts) to include
"www.twitch.tv" in the array alongside the existing entries so URL
matching/embedding recognizes both forms.

Comment on lines +33 to +37
const avatarUrl = post.author.avatar_url;
let avatarResult: ProcessImageResult | undefined = undefined;
if (avatarUrl) {
avatarResult = await processImages(
[new URL(avatarUrl)],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard provider URL parsing before image processing.

new URL(avatarUrl) (Line 37) and new URL(image.url) (Line 54) can throw before the downstream .catch(...) blocks run. A malformed upstream value can abort the whole task.

💡 Proposed fix
+	const safeParseUrl = (value: string, label: string): URL | undefined => {
+		try {
+			return new URL(value);
+		} catch (e) {
+			console.error(`Invalid ${label} URL for '${inputUrl}'`, e);
+			error = true;
+			return undefined;
+		}
+	};
+
 	const avatarUrl = post.author.avatar_url;
 	let avatarResult: ProcessImageResult | undefined = undefined;
 	if (avatarUrl) {
+		const parsedAvatarUrl = safeParseUrl(avatarUrl, "avatar");
+		if (!parsedAvatarUrl) {
+			avatarResult = undefined;
+		} else {
 		avatarResult = await processImages(
-			[new URL(avatarUrl)],
+			[parsedAvatarUrl],
 			72,
 			BUCKET,
 			"remote-posts",
 			jobId,
 			signal,
 		).catch((e) => {
 			console.error(`Error processing post image for '${inputUrl}'`, e);
 			error = true;
 			return undefined;
 		});
+		}
 	}
 ...
 	if (image) {
+		const parsedImageUrl = safeParseUrl(image.url, "post image");
+		if (!parsedImageUrl) {
+			imageResult = undefined;
+		} else {
 		imageResult = await processImages(
-			[new URL(image.url)],
+			[parsedImageUrl],
 			image.width,
 			BUCKET,
 			"remote-posts",
 			jobId,
 			signal,
 		).catch((e) => {
 			console.error(`Error processing post image for '${inputUrl}'`, e);
 			error = true;
 			return undefined;
 		});
+		}
 	}

Also applies to: 53-55

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts` around lines 33 -
37, The code in getEmbedDataFromPost uses new URL(avatarUrl) and new
URL(image.url) directly (variables avatarUrl and image.url) which can throw
synchronously; wrap the URL construction in a try/catch or pre-validate the
string before calling processImages so malformed URLs are skipped or converted
to undefined instead of throwing, and ensure any skipped images are filtered out
before calling processImages([ ... ]) and before mapping into new URL for images
so the downstream .catch handlers are not bypassed.

Comment on lines +37 to +45
const banner = videoData?.thumbnail_url
? await processImages(
[new URL(videoData.thumbnail_url)],
expectedThumbnailWidth,
BUCKET,
"remote-video-thumbnail",
jobId,
signal,
).catch((e) => {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Protect thumbnail URL parsing before calling processImages.

new URL(videoData.thumbnail_url) (Line 39) can throw before the current .catch(...) handler executes. This can fail the task unexpectedly on malformed provider data.

💡 Proposed fix
-	const banner = videoData?.thumbnail_url
-		? await processImages(
-				[new URL(videoData.thumbnail_url)],
-				expectedThumbnailWidth,
-				BUCKET,
-				"remote-video-thumbnail",
-				jobId,
-				signal,
-			).catch((e) => {
-				console.error(`Unable upload thumbnail image for ${inputUrl}`, e);
-				error = true;
-				return undefined;
-			})
-		: undefined;
+	let banner = undefined;
+	if (videoData?.thumbnail_url) {
+		try {
+			const thumbnailUrl = new URL(videoData.thumbnail_url);
+			banner = await processImages(
+				[thumbnailUrl],
+				expectedThumbnailWidth,
+				BUCKET,
+				"remote-video-thumbnail",
+				jobId,
+				signal,
+			).catch((e) => {
+				console.error(`Unable upload thumbnail image for ${inputUrl}`, e);
+				error = true;
+				return undefined;
+			});
+		} catch (e) {
+			console.error(`Invalid thumbnail URL for ${inputUrl}`, e);
+			error = true;
+		}
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromVideo.ts` around lines 37
- 45, In getEmbedDataFromVideo, protect parsing of videoData.thumbnail_url
before calling processImages: the expression new URL(videoData.thumbnail_url)
can throw synchronously and bypass the existing .catch on the processImages
promise; wrap the URL construction/validation in a try/catch (or guard with a
safe URL-parse helper) and only call processImages when the URL was successfully
created, and ensure any thrown error is caught and logged/handled using the
existing jobId/signal flow so banner assignment doesn't crash the task.

Comment on lines +38 to +41
const deleteMockResponse = () => {
const returning = vi.fn();
return { where: vi.fn(() => ({ returning })) };
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make mocked .returning(...) default to an array.

Current default vi.fn() returns undefined. Callers that iterate over await ...returning(...) can crash in tests.

💡 Proposed fix
 	const deleteMockResponse = () => {
-		const returning = vi.fn();
+		const returning = vi.fn(async () => []);
 		return { where: vi.fn(() => ({ returning })) };
 	};
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const deleteMockResponse = () => {
const returning = vi.fn();
return { where: vi.fn(() => ({ returning })) };
};
const deleteMockResponse = () => {
const returning = vi.fn(async () => []);
return { where: vi.fn(() => ({ returning })) };
};
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/test-utils/setup.ts` around lines 38 - 41, The mock
deleteMockResponse currently sets returning = vi.fn() which yields undefined
when awaited; change returning to be a function that returns an empty array by
default (e.g., make returning an async/mock function that returns []), so
callers that await the .returning(...) can safely iterate; update the returned
object in deleteMockResponse (the where mock that returns { returning })
accordingly so .where(...).returning(...) resolves to an empty array by default.

.references(() => urlMetadataGist.gistId, { onDelete: "cascade" }),
filename: varchar("filename", { length: 256 }).notNull(),
contentKey: varchar("content_key", { length: 256 }).notNull(),
language: varchar("language", { length: 16 }).notNull(),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

GitHub REST API GET gists gist_id files language field null omitted

💡 Result:

In the REST API “Get a gist” response, each entry under files may include a language property (the docs show it in examples), but it’s not something you can rely on always being a non-empty string. [1]

In practice, files[...].language is best treated as optional/nullable:

  • Many client SDKs model it as an optional field (e.g., *string with omitempty), meaning it can be missing or null in JSON. [2]
  • GitHub determines language via Linguist (used to detect languages and ignore/handle binary content), so if Linguist can’t confidently classify a file (unknown/unsupported extension, binary-ish content, etc.), the API may return null or omit it. [3]

If you need a stable value, fall back to your own detection (e.g., by filename/extension or running Linguist yourself) when language is null/absent. [3]

Sources: GitHub REST Gists docs [1], go-github GistFile model [2], GitHub Linguist README [3].


🏁 Script executed:

# Find and inspect the url-metadata.ts schema file
find . -name "url-metadata.ts" -type f

Repository: playfulprogramming/hoof

Length of output: 195


🏁 Script executed:

# Inspect the url-metadata.ts schema file to see the actual implementation
cat -n ./packages/db/src/schema/url-metadata.ts | head -50

Repository: playfulprogramming/hoof

Length of output: 2244


Make language optional in url_metadata_gist_file schema.

The language field is marked as notNull() but GitHub's REST API can omit or return null for files[...].language when Linguist cannot confidently classify the file. This causes insert failures when ingesting gist data.

Schema adjustment
-		language: varchar("language", { length: 16 }).notNull(),
+		language: varchar("language", { length: 16 }),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
language: varchar("language", { length: 16 }).notNull(),
language: varchar("language", { length: 16 }),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/db/src/schema/url-metadata.ts` at line 24, The
url_metadata_gist_file schema currently forces language as notNull
(varchar("language", { length: 16 }).notNull()), which fails when GitHub returns
null/omits files[...].language; change the column definition in
packages/db/src/schema/url-metadata.ts for the url_metadata_gist_file entry to
allow nulls (remove .notNull() or use .nullable() depending on the ORM) so
inserts succeed when language is absent, and run migrations/regen types if
applicable.

Comment on lines +26 to +29
if (typeof data === "undefined" || response.error) {
throw new Error(
`GitHub API (getCommits) returned ${response.response.status} ${response.error}`,
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix incorrect operation name in error message.

Line 28 references getCommits, but this function is getGistById. This makes runtime errors misleading.

🛠️ Proposed fix
-			`GitHub API (getCommits) returned ${response.response.status} ${response.error}`,
+			`GitHub API (getGistById) returned ${response.response.status} ${response.error}`,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (typeof data === "undefined" || response.error) {
throw new Error(
`GitHub API (getCommits) returned ${response.response.status} ${response.error}`,
);
if (typeof data === "undefined" || response.error) {
throw new Error(
`GitHub API (getGistById) returned ${response.response.status} ${response.error}`,
);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/github-api/src/getGistById.ts` around lines 26 - 29, The error
message in getGistById incorrectly references getCommits; update the throw in
the block that checks if data is undefined or response.error (in function
getGistById) to mention getGistById instead of getCommits and include the same
response.response.status and response.error details so the runtime message
correctly identifies the failing operation.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
apps/api/src/routes/tasks/url-metadata.ts (1)

132-156: ⚠️ Potential issue | 🟠 Major

Requeue cache repairs when embed side-data is missing.

These mappers now fail soft, but the request path never flips shouldSubmitJob when a cached gist/post row is incomplete. That leaves broken cache entries returning partial embed payloads until the normal stale window instead of repairing themselves on the next read.

Suggested direction
 async function mapEmbed(
 	result: UrlMetadataDbResult,
 ): Promise<Static<typeof UrlMetadataResponseSchema>["embed"]> {
 	if (result.embedType === "post") {
+		const post = result.postId ? await mapEmbedPost(result.postId) : undefined;
-		return {
-			type: "post",
-			post: result.postId ? await mapEmbedPost(result.postId) : undefined,
-		};
+		return post ? { type: "post", post } : undefined;
 	}
 	if (result.embedType === "gist") {
+		const gist = result.gistId ? await mapEmbedGist(result.gistId) : undefined;
-		return {
-			type: "gist",
-			gist: result.gistId ? await mapEmbedGist(result.gistId) : undefined,
-		};
+		return gist ? { type: "gist", gist } : undefined;
 	}
 	if (result.embedType === "video") {
 		return mapEmbedVideo(result);
 	}
 	return undefined;
 }
@@
-				const response = {
-					...mapUrlMetadata(result),
-					embed: await mapEmbed(result),
-				};
+				const embed = await mapEmbed(result);
+				if (
+					(result.embedType === "gist" || result.embedType === "post") &&
+					!embed
+				) {
+					shouldSubmitJob = true;
+				}
+
+				const response = {
+					...mapUrlMetadata(result),
+					embed,
+				};

Also applies to: 193-202, 276-279

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/routes/tasks/url-metadata.ts` around lines 132 - 156, The
mappers mapEmbedGist and mapEmbedPost currently return undefined on missing
side-data (e.g., when urlMetadataGistFile.findMany returns zero rows or
urlMetadataPost rows are incomplete) but do not signal the request flow to
requeue a cache-repair job; update these functions so that whenever they detect
incomplete embed side-data (gistFiles.length === 0 or required post fields
missing after urlMetadataPost.findFirst), they set the request's cache-repair
flag (shouldSubmitJob) or call the existing enqueue/submit job helper before
returning undefined — locate mapEmbedGist/mapEmbedPost, the calls to
urlMetadataGist.findFirst, urlMetadataGistFile.findMany and
urlMetadataPost.findFirst, and add the single-line mutation/invocation to
trigger submission of the repair job when incomplete data is detected.
🧹 Nitpick comments (2)
apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts (1)

75-92: Consider defensive checks for optional API fields.

The postData object accesses post.author.name, post.text, post.url, etc. directly. Per the TweetAPIResponse types, some nested fields may be undefined or null (e.g., avatar_url is string | null). If the FXTwitter API returns incomplete data, this could insert undefined into NOT NULL columns and cause a database error.

💡 Suggested validation
+	// Validate required fields before insert
+	if (!post.author?.name || !post.author?.screen_name || !post.text || !post.url) {
+		console.error(`Incomplete post data for '${inputUrl}'`);
+		return { error: true };
+	}
+
 	const postData = {
 		postId: dbPostId,
 		authorName: post.author.name,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts` around lines 75 -
92, The postData construction in getEmbedDataFromPost (the postData object)
reads nested optional fields like post.author.name, post.text, post.url,
post.created_at, avatarResult?.key, imageResult?.key and image?.altText without
guards; add defensive checks and sane defaults (or bail out) before building
postData: validate post.author and required strings (name, screen_name, text,
url, created_at) and coalesce nullable avatar/image fields (avatarResult,
imageResult, image) to safe values or set null/empty strings, and if required
NOT NULL DB columns would be missing, return an error or skip the record so you
never insert undefined into the DB. Ensure checks live near the code that builds
postData in getEmbedDataFromPost and reference the same identifiers (post,
post.author, avatarResult, imageResult, image) when implementing.
apps/worker/src/tasks/url-metadata/processor.ts (1)

59-64: Clarify handling of undefined embed fields in database insert.

When destructuring embedData, fields like gistId, postId, embedSrc, etc. may be undefined depending on the embed type. These are then passed directly to the database insert.

This works because the schema defines these columns as nullable (per packages/db/src/schema/url-metadata.ts), but you may want to explicitly coalesce to null for consistency with other fields in the result object (e.g., title: title ?? null).

♻️ Optional: Explicit null coalescing for consistency
-		gistId,
-		postId,
-		embedSrc,
-		embedWidth,
-		embedHeight,
-		embedType,
+		gistId: gistId ?? null,
+		postId: postId ?? null,
+		embedSrc: embedSrc ?? null,
+		embedWidth: embedWidth ?? null,
+		embedHeight: embedHeight ?? null,
+		embedType: embedType ?? null,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/tasks/url-metadata/processor.ts` around lines 59 - 64, The
destructured embed fields (gistId, postId, embedSrc, embedWidth, embedHeight,
embedType) from embedData can be undefined and are being passed straight into
the DB insert; update the code where embedData is mapped for insertion (the
object built for the DB insert in processor.ts) to explicitly coalesce each
optional embed field to null (e.g., gistId: gistId ?? null) so the insert always
uses null instead of undefined, matching how other fields like title are
handled.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/worker/src/tasks/url-metadata/data-providers/x.ts`:
- Around line 9-16: Accept an AbortSignal in getXPostData's signature and
forward it to fetchAsBot (pass the signal param through the options) so job
cancellation propagates; validate that the returned TweetAPIResponse.tweet is
present before returning (throw or return null/error if missing) to handle the
optional type; also update the caller getEmbedDataFromPost to pass its received
signal into getXPostData so the signal is threaded end-to-end.

In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts`:
- Around line 32-35: The abort signal passed into getEmbedDataFromPost is not
forwarded to getXPostData, so cancellation won't reach the underlying request;
update the call to getXPostData within getEmbedDataFromPost to include the
signal (e.g., pass { signal, userId, postId } or the existing options object
augmented with signal), keeping the existing .catch handling and error log
intact so cancellations propagate correctly.
- Line 91: The createdAt assignment in getEmbedDataFromPost.ts uses new
Date(post.created_at) without validating post.created_at; update the code that
builds the object (the createdAt property) to first check that post.created_at
exists and that new Date(post.created_at) yields a valid date (e.g., const d =
new Date(post.created_at); if (!post.created_at || isNaN(d.getTime())) then
either set createdAt to null/default or throw an error), and then assign
createdAt: d (or the fallback) so you avoid inserting "Invalid Date" into the
timestamp column.

---

Duplicate comments:
In `@apps/api/src/routes/tasks/url-metadata.ts`:
- Around line 132-156: The mappers mapEmbedGist and mapEmbedPost currently
return undefined on missing side-data (e.g., when urlMetadataGistFile.findMany
returns zero rows or urlMetadataPost rows are incomplete) but do not signal the
request flow to requeue a cache-repair job; update these functions so that
whenever they detect incomplete embed side-data (gistFiles.length === 0 or
required post fields missing after urlMetadataPost.findFirst), they set the
request's cache-repair flag (shouldSubmitJob) or call the existing
enqueue/submit job helper before returning undefined — locate
mapEmbedGist/mapEmbedPost, the calls to urlMetadataGist.findFirst,
urlMetadataGistFile.findMany and urlMetadataPost.findFirst, and add the
single-line mutation/invocation to trigger submission of the repair job when
incomplete data is detected.

---

Nitpick comments:
In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts`:
- Around line 75-92: The postData construction in getEmbedDataFromPost (the
postData object) reads nested optional fields like post.author.name, post.text,
post.url, post.created_at, avatarResult?.key, imageResult?.key and
image?.altText without guards; add defensive checks and sane defaults (or bail
out) before building postData: validate post.author and required strings (name,
screen_name, text, url, created_at) and coalesce nullable avatar/image fields
(avatarResult, imageResult, image) to safe values or set null/empty strings, and
if required NOT NULL DB columns would be missing, return an error or skip the
record so you never insert undefined into the DB. Ensure checks live near the
code that builds postData in getEmbedDataFromPost and reference the same
identifiers (post, post.author, avatarResult, imageResult, image) when
implementing.

In `@apps/worker/src/tasks/url-metadata/processor.ts`:
- Around line 59-64: The destructured embed fields (gistId, postId, embedSrc,
embedWidth, embedHeight, embedType) from embedData can be undefined and are
being passed straight into the DB insert; update the code where embedData is
mapped for insertion (the object built for the DB insert in processor.ts) to
explicitly coalesce each optional embed field to null (e.g., gistId: gistId ??
null) so the insert always uses null instead of undefined, matching how other
fields like title are handled.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2e328b19-832c-4e07-8d1c-00e54b1ef4e7

📥 Commits

Reviewing files that changed from the base of the PR and between 389e182 and 12a481c.

📒 Files selected for processing (4)
  • apps/api/src/routes/tasks/url-metadata.ts
  • apps/worker/src/tasks/url-metadata/data-providers/x.ts
  • apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts
  • apps/worker/src/tasks/url-metadata/processor.ts

numLikes: post.likes,
numReposts: post.reposts,
numReplies: post.replies,
createdAt: new Date(post.created_at),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Validate created_at before parsing as Date.

new Date(post.created_at) will produce an "Invalid Date" if created_at is undefined or malformed. This could cause unexpected behavior when inserting into the createdAt timestamp column.

🛡️ Proposed validation
-		createdAt: new Date(post.created_at),
+		createdAt: post.created_at ? new Date(post.created_at) : new Date(),

Or throw an error if created_at is required:

+	if (!post.created_at || isNaN(new Date(post.created_at).getTime())) {
+		console.error(`Invalid created_at for post '${inputUrl}'`);
+		return { error: true };
+	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
createdAt: new Date(post.created_at),
createdAt: post.created_at ? new Date(post.created_at) : new Date(),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts` at line 91, The
createdAt assignment in getEmbedDataFromPost.ts uses new Date(post.created_at)
without validating post.created_at; update the code that builds the object (the
createdAt property) to first check that post.created_at exists and that new
Date(post.created_at) yields a valid date (e.g., const d = new
Date(post.created_at); if (!post.created_at || isNaN(d.getTime())) then either
set createdAt to null/default or throw an error), and then assign createdAt: d
(or the fallback) so you avoid inserting "Invalid Date" into the timestamp
column.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
apps/worker/src/tasks/url-metadata/getEmbedDataFromVideo.ts (1)

9-20: Consider stopping traversal after finding the first iframe.

The visit callback continues traversing after finding an iframe, which means if there are multiple iframes, properties will be overwritten with the last one's attributes. This may not be intentional.

♻️ Proposed fix to return early
 function getIFrameAttributes(html: string) {
 	const tree = fromHtml(html);

 	let properties = {} as Record<string, unknown>;
 	visit(tree, { tagName: "iframe" }, (_node) => {
 		if (!_node) return;
 		const node: Element = _node;
 		properties = node.properties;
+		return false; // Stop traversal after first iframe
 	});

 	return properties;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromVideo.ts` around lines 9 -
20, The getIFrameAttributes function currently lets visit continue and so
properties get overwritten by later iframes; update the visit callback in
getIFrameAttributes to stop traversal once the first iframe is found by
returning visit.EXIT (or the equivalent sentinel) from the callback after
setting properties so the first match is returned immediately.
apps/worker/src/tasks/url-metadata/data-providers/vimeo.ts (1)

42-42: Include common Vimeo subdomains in vimeoHosts to match hostname patterns used by YouTube and Twitch.

Vimeo embeds use player.vimeo.com as the standard embed URL subdomain. The codebase already includes subdomains in similar arrays (YouTube includes www.youtube.com, Twitch includes clips.twitch.tv), so vimeoHosts should include at least player.vimeo.com for consistency and to handle embeds from that domain.

export const vimeoHosts = ["vimeo.com", "player.vimeo.com"];
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/tasks/url-metadata/data-providers/vimeo.ts` at line 42, The
vimeoHosts array currently only contains "vimeo.com" so embedded Vimeo URLs like
"player.vimeo.com" are missed; update the export vimeoHosts to include common
Vimeo subdomains (e.g., add "player.vimeo.com" and optionally "www.vimeo.com")
so hostname matching in the Vimeo data provider (export const vimeoHosts) will
handle embed and subdomain patterns consistently with YouTube/Twitch host lists.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/worker/src/tasks/url-metadata/data-providers/vimeo.ts`:
- Around line 32-39: The current fetch in vimeo.ts (inside the function using
fetchAsBot) blindly calls res.body.json() which will fail or return invalid data
for non-OK HTTP responses; update the promise callback to check res.ok (or
inspect res.status) after fetchAsBot returns, and if not OK throw or reject with
a clear error containing the status and statusText (or return a controlled
null/undefined), otherwise parse and return the JSON as VimeoOEmbedResponse;
reference the fetchAsBot call and the VimeoOEmbedResponse type so the check
occurs before calling res.body.json().

---

Nitpick comments:
In `@apps/worker/src/tasks/url-metadata/data-providers/vimeo.ts`:
- Line 42: The vimeoHosts array currently only contains "vimeo.com" so embedded
Vimeo URLs like "player.vimeo.com" are missed; update the export vimeoHosts to
include common Vimeo subdomains (e.g., add "player.vimeo.com" and optionally
"www.vimeo.com") so hostname matching in the Vimeo data provider (export const
vimeoHosts) will handle embed and subdomain patterns consistently with
YouTube/Twitch host lists.

In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromVideo.ts`:
- Around line 9-20: The getIFrameAttributes function currently lets visit
continue and so properties get overwritten by later iframes; update the visit
callback in getIFrameAttributes to stop traversal once the first iframe is found
by returning visit.EXIT (or the equivalent sentinel) from the callback after
setting properties so the first match is returned immediately.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cab77c96-159b-47bf-bcfd-a282a8886fe4

📥 Commits

Reviewing files that changed from the base of the PR and between 12a481c and dadc810.

📒 Files selected for processing (6)
  • apps/worker/src/tasks/url-metadata/data-providers/video.ts
  • apps/worker/src/tasks/url-metadata/data-providers/vimeo.ts
  • apps/worker/src/tasks/url-metadata/data-providers/x.ts
  • apps/worker/src/tasks/url-metadata/data-providers/youtube.ts
  • apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts
  • apps/worker/src/tasks/url-metadata/getEmbedDataFromVideo.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • apps/worker/src/tasks/url-metadata/data-providers/youtube.ts
  • apps/worker/src/tasks/url-metadata/data-providers/x.ts

Comment on lines +32 to +39
return await fetchAsBot({
url: new URL(
`https://vimeo.com/api/oembed.json?url=${encodeURIComponent(url.toString())}`,
),
method: "GET",
skipRobotsCheck: true,
signal,
}).then((res) => res.body.json() as Promise<VimeoOEmbedResponse>);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing error handling for failed HTTP responses.

The function assumes the fetch always succeeds and returns valid JSON. If Vimeo returns an error status code (e.g., 404 for invalid videos), calling res.body.json() on a non-OK response may produce unexpected results or throw.

🛡️ Proposed fix to add response status check
 export async function getVimeoOEmbedDataFromUrl(
 	url: URL,
 	signal: AbortSignal,
 ): Promise<VimeoOEmbedResponse> {
-	return await fetchAsBot({
+	const res = await fetchAsBot({
 		url: new URL(
 			`https://vimeo.com/api/oembed.json?url=${encodeURIComponent(url.toString())}`,
 		),
 		method: "GET",
 		skipRobotsCheck: true,
 		signal,
-	}).then((res) => res.body.json() as Promise<VimeoOEmbedResponse>);
+	});
+	if (res.statusCode < 200 || res.statusCode >= 300) {
+		throw new Error(`Vimeo oEmbed request failed with status ${res.statusCode}`);
+	}
+	return res.body.json() as Promise<VimeoOEmbedResponse>;
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return await fetchAsBot({
url: new URL(
`https://vimeo.com/api/oembed.json?url=${encodeURIComponent(url.toString())}`,
),
method: "GET",
skipRobotsCheck: true,
signal,
}).then((res) => res.body.json() as Promise<VimeoOEmbedResponse>);
export async function getVimeoOEmbedDataFromUrl(
url: URL,
signal: AbortSignal,
): Promise<VimeoOEmbedResponse> {
const res = await fetchAsBot({
url: new URL(
`https://vimeo.com/api/oembed.json?url=${encodeURIComponent(url.toString())}`,
),
method: "GET",
skipRobotsCheck: true,
signal,
});
if (res.statusCode < 200 || res.statusCode >= 300) {
throw new Error(`Vimeo oEmbed request failed with status ${res.statusCode}`);
}
return res.body.json() as Promise<VimeoOEmbedResponse>;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/tasks/url-metadata/data-providers/vimeo.ts` around lines 32 -
39, The current fetch in vimeo.ts (inside the function using fetchAsBot) blindly
calls res.body.json() which will fail or return invalid data for non-OK HTTP
responses; update the promise callback to check res.ok (or inspect res.status)
after fetchAsBot returns, and if not OK throw or reject with a clear error
containing the status and statusText (or return a controlled null/undefined),
otherwise parse and return the JSON as VimeoOEmbedResponse; reference the
fetchAsBot call and the VimeoOEmbedResponse type so the check occurs before
calling res.body.json().

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/db/drizzle/20260307174859_furry_dark_beast/migration.sql`:
- Around line 7-13: The column size for url_metadata_gist_file.language is too
small (varchar(16)) and must be widened to varchar(64) to accept multi-word
GitHub language labels; update the migration that creates the
"url_metadata_gist_file" table to change the "language" column from varchar(16)
to varchar(64) and also update the schema definition in the DB schema for
url-metadata (the exported url_metadata_gist_file/table schema or column
descriptor) to reflect the new max length so runtime inserts from
getEmbedDataFromGist.ts succeed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e4ec19df-27b2-44b0-8ee1-be11c02b38de

📥 Commits

Reviewing files that changed from the base of the PR and between dadc810 and 5b64ec6.

📒 Files selected for processing (2)
  • packages/db/drizzle/20260307174859_furry_dark_beast/migration.sql
  • packages/db/drizzle/20260307174859_furry_dark_beast/snapshot.json

Comment on lines +7 to +13
CREATE TABLE "url_metadata_gist_file" (
"gist_id" varchar(256),
"filename" varchar(256),
"content_key" varchar(256) NOT NULL,
"language" varchar(16) NOT NULL,
CONSTRAINT "url_metadata_gist_file_pkey" PRIMARY KEY("gist_id","filename")
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Ingest path:"
rg -n -C2 'language:\s*file\?\.language\s*\?\?\s*"text"' \
  apps/worker/src/tasks/url-metadata/getEmbedDataFromGist.ts

echo
echo "Current schema cap:"
rg -n -C1 'varchar\("language", \{ length: 16 \}\)|"language" varchar\(16\)' \
  packages/db/src/schema/url-metadata.ts \
  packages/db/drizzle/20260307174859_furry_dark_beast/migration.sql

Repository: playfulprogramming/hoof

Length of output: 826


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Full url_metadata_gist_file table definition ==="
rg -n -A10 'url_metadata_gist_file' packages/db/src/schema/url-metadata.ts | head -20

echo
echo "=== Check for hardcoded length assumptions or truncation logic ==="
rg -n 'language.*16|length.*16|truncate.*language' \
  apps/worker/src/tasks/url-metadata/ packages/db/src/schema/ || echo "No matches"

echo
echo "=== Verify GitHub's actual language label examples ==="
rg -n 'language|Language' apps/worker/src/tasks/url-metadata/getEmbedDataFromGist.ts | head -15

Repository: playfulprogramming/hoof

Length of output: 843


🌐 Web query:

GitHub gist API language field examples Jupyter Notebook maximum length

💡 Result:

language field in the GitHub Gist REST API (files.*.language)

In a Gist response, each file object can include a detected language string, e.g.:

  • Markdown file
    • "language": "Markdown" (example shown in GitHub’s Gists REST API docs) [1]
  • Ruby file
    • "language": "Ruby" (example shown in GitHub’s Gists REST API docs) [2]
  • Jupyter Notebook file (.ipynb)
    • The detected language name GitHub Linguist uses for .ipynb is "Jupyter Notebook" [3], so that’s what you should expect in files.*.language when GitHub can detect it.

Notes:

  • language is detection-derived (GitHub Linguist); it may be null if GitHub can’t confidently identify a language (commonly for binary/unknown content). [4]

“Maximum length”

  • There’s no published maximum length for the language string itself in the Gist REST API schema/docs (it’s just a short language name like "Markdown" / "Jupyter Notebook").
  • If you meant maximum file content length returned by the API: the Gist REST API returns up to 1 MB of content per file; beyond that, the file is marked truncated: true, and you should fetch the full file via raw_url. [1]

Sources: [1] [2] [3] [4]


Widen url_metadata_gist_file.language column to accommodate GitHub's multi-word language labels.

The ingest path at apps/worker/src/tasks/url-metadata/getEmbedDataFromGist.ts:80 writes GitHub's gist-file language value directly into this column without validation or truncation. GitHub's Linguist library returns multi-word labels such as "Jupyter Notebook" (17 characters), which exceeds the current varchar(16) constraint in both the migration and schema files. Inserts will fail when GitHub returns such labels.

Widen the column to varchar(64) in:

  • packages/db/drizzle/20260307174859_furry_dark_beast/migration.sql (line 11)
  • packages/db/src/schema/url-metadata.ts (line 24)
Proposed fix
-	"language" varchar(16) NOT NULL,
+	"language" varchar(64) NOT NULL,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/db/drizzle/20260307174859_furry_dark_beast/migration.sql` around
lines 7 - 13, The column size for url_metadata_gist_file.language is too small
(varchar(16)) and must be widened to varchar(64) to accept multi-word GitHub
language labels; update the migration that creates the "url_metadata_gist_file"
table to change the "language" column from varchar(16) to varchar(64) and also
update the schema definition in the DB schema for url-metadata (the exported
url_metadata_gist_file/table schema or column descriptor) to reflect the new max
length so runtime inserts from getEmbedDataFromGist.ts succeed.

@fennifith fennifith merged commit f37a454 into main Mar 7, 2026
3 checks passed
@fennifith fennifith deleted the oembed-url-metadata branch March 7, 2026 19:52
github-merge-queue bot pushed a commit to playfulprogramming/playfulprogramming that referenced this pull request Mar 21, 2026
Adds embed information from playfulprogramming/hoof#121 to provide &
cache the video/post/gist details

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Unified embed handling: richer support for video
(YouTube/Vimeo/Twitch), social posts, and Gist previews via metadata.
  * Gist previews now show truncated code content and filename.

* **Bug Fixes**
* More reliable embed fallbacks and external-link behavior for video
embeds.

* **Style**
* Improved video/social embed layout: centered, responsive sizing,
correct aspect-ratio handling, and better image rendering
(cross-origin-safe images and thumbnails).
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Process OEmbed data with url-metadata

1 participant