Skip to content

fix(spark): S3/GCS PyArrow filesystem resolution for staging paths#6442

Open
abhijeet-dhumal wants to merge 2 commits into
feast-dev:masterfrom
abhijeet-dhumal:fix/spark-staging-filesystem
Open

fix(spark): S3/GCS PyArrow filesystem resolution for staging paths#6442
abhijeet-dhumal wants to merge 2 commits into
feast-dev:masterfrom
abhijeet-dhumal:fix/spark-staging-filesystem

Conversation

@abhijeet-dhumal
Copy link
Copy Markdown
Contributor

@abhijeet-dhumal abhijeet-dhumal commented May 27, 2026

What this PR does / why we need it

Staging parquet reads fail on S3-compatible stores (MinIO, Ceph, custom AWS endpoints) because raw s3:// / s3a:// URIs are passed directly to pyarrow.dataset:

FileNotFoundError: s3://my-bucket/...

Replaces _normalize_staging_paths with _resolve_staging_filesystem which builds the correct pyarrow.fs.S3FileSystem or pyarrow.fs.GcsFileSystem from the URI scheme. Picks up AWS_ENDPOINT_URL_S3 for MinIO/LocalStack and AWS_DEFAULT_REGION from the environment. Local and file:// paths pass through unchanged.

Which issue(s) this PR fixes

Enables staging path reads for on-prem / private cloud Spark deployments (MinIO, Ceph, LocalStack).

Checks

  • I've made sure the tests are passing.
  • My commits are signed off (git commit -s)
  • My PR title follows conventional commits format

Testing Strategy

  • Unit tests — 12 cases covering S3 endpoint override, GCS, local, file://, MinIO, region fallback
  • Manual tests — staging reads from MinIO endpoint

@abhijeet-dhumal abhijeet-dhumal changed the title Fix/spark staging filesystem fix(spark): S3/GCS PyArrow filesystem for staging and offline-only feature view validation May 27, 2026
@abhijeet-dhumal abhijeet-dhumal force-pushed the fix/spark-staging-filesystem branch from 611bc8c to 2bf6136 Compare May 27, 2026 12:13
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
@abhijeet-dhumal abhijeet-dhumal force-pushed the fix/spark-staging-filesystem branch from 7c3fe0f to 394f421 Compare May 27, 2026 12:25
@abhijeet-dhumal abhijeet-dhumal changed the title fix(spark): S3/GCS PyArrow filesystem for staging and offline-only feature view validation fix(spark): S3/GCS PyArrow filesystem resolution for staging paths May 27, 2026
@abhijeet-dhumal abhijeet-dhumal marked this pull request as ready for review May 28, 2026 06:23
@abhijeet-dhumal abhijeet-dhumal requested a review from a team as a code owner May 28, 2026 06:23
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

)
kwargs["scheme"] = "https" if endpoint.startswith("https") else "http"
fs = pafs.S3FileSystem(**kwargs)
stripped = [p.replace("s3a://", "").replace("s3://", "") for p in paths]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better use p.removeprefix("s3a://").removeprefix("s3://"), same for https/http and other paths

return fs, stripped

if sample.startswith("gs://"):
import pyarrow.fs as pafs
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multiple times same import

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants