feat(spark): SparkSource query+path and pre-computed offline read for BatchFeatureView by abhijeet-dhumal · Pull Request #6440 · feast-dev/feast

abhijeet-dhumal · 2026-05-27T11:59:35Z

What this PR does / why we need it

get_historical_features() on a BatchFeatureView re-runs the full UDF on raw data every call. For embedding pipelines, that's 20–40 min of compute per training run even though features already exist from the last materialize.

Fix: Route get_historical_features() to read pre-computed parquet from batch_source.path instead of re-executing the UDF.

To support this, SparkSource now accepts query + path together:

query — raw data read during materialize()
path — write-back target and pre-computed read source for get_historical_features()

SparkSource(
    query="SELECT id, text, event_timestamp FROM bronze.documents",
    path="s3://my-bucket/feast/features/document_embeddings/",
)

Also allows BatchFeatureView with online=False, offline=True (offline-only) to skip the online validation check in get_historical_features(), so it can be used purely for training data without configuring an online store.

Falls back to live query if path doesn't exist yet (first run before any materialization).

Which issue(s) this PR fixes

N/A. Enables efficient training data retrieval for BatchFeatureView embedding pipelines without re-running UDFs.

Checks

I've made sure the tests are passing.
My commits are signed off (git commit -s)
My PR title follows conventional commits format

Testing Strategy

Unit tests — offline path routing, SparkSource constraint, graceful fallback
Manual tests — get_historical_features() reads from parquet, not UDF, after materialization

devin-ai-integration

Devin Review found 1 potential issue.

View 5 additional findings in Devin Review.

SparkSource previously required exactly one of table/query/path. This relaxes the constraint to allow query + path together: - query: used for reading raw data during materialization - path: used for offline write-back (offline=True) and as pre-computed read source in get_historical_features Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

… get_historical_features Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

…orical_features The function and its call were removed in this PR but the replacement (_apply_bfv_transformations_for_historical) lives in a separate PR (feast-dev#6440). Removing it here would silently return raw untransformed features for any BatchFeatureView with a Python UDF via the standard get_historical_features() API path (FeatureStore → passthrough_provider → SparkOfflineStore). Restoring the function and its call until feast-dev#6440 lands. Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

abhijeet-dhumal changed the title ~~Feat/spark bfv offline historical features~~ feat(spark): SparkSource query+path and pre-computed offline read for BatchFeatureView May 27, 2026

abhijeet-dhumal force-pushed the feat/spark-bfv-offline-historical-features branch from a98e23b to 57b2489 Compare May 27, 2026 14:50

abhijeet-dhumal marked this pull request as ready for review May 28, 2026 06:23

abhijeet-dhumal requested a review from a team as a code owner May 28, 2026 06:23

devin-ai-integration Bot reviewed May 28, 2026

View reviewed changes

Comment thread sdk/python/feast/feature_store.py

abhijeet-dhumal and others added 5 commits May 29, 2026 14:26

feat: read from offline path in get_historical_features for BFVs

04071f7

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

fix: graceful fallback when offline path is not readable

0852e5c

Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

style: ruff format spark.py and spark_source.py

09c470e

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

fix: allow offline-only BatchFeatureView to skip online validation in…

e30e146

… get_historical_features Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>

abhijeet-dhumal force-pushed the feat/spark-bfv-offline-historical-features branch from 57b2489 to e30e146 Compare May 29, 2026 08:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(spark): SparkSource query+path and pre-computed offline read for BatchFeatureView#6440

feat(spark): SparkSource query+path and pre-computed offline read for BatchFeatureView#6440
abhijeet-dhumal wants to merge 5 commits into
feast-dev:masterfrom
abhijeet-dhumal:feat/spark-bfv-offline-historical-features

abhijeet-dhumal commented May 27, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abhijeet-dhumal commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it

Which issue(s) this PR fixes

Checks

Testing Strategy

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

abhijeet-dhumal commented May 27, 2026 •

edited

Loading