feat(spark): SparkSource query+path and pre-computed offline read for BatchFeatureView#6440
Open
abhijeet-dhumal wants to merge 5 commits into
Open
Conversation
a98e23b to
57b2489
Compare
SparkSource previously required exactly one of table/query/path. This relaxes the constraint to allow query + path together: - query: used for reading raw data during materialization - path: used for offline write-back (offline=True) and as pre-computed read source in get_historical_features Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
… get_historical_features Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
57b2489 to
e30e146
Compare
abhijeet-dhumal
added a commit
to abhijeet-dhumal/feast
that referenced
this pull request
May 29, 2026
…orical_features The function and its call were removed in this PR but the replacement (_apply_bfv_transformations_for_historical) lives in a separate PR (feast-dev#6440). Removing it here would silently return raw untransformed features for any BatchFeatureView with a Python UDF via the standard get_historical_features() API path (FeatureStore → passthrough_provider → SparkOfflineStore). Restoring the function and its call until feast-dev#6440 lands. Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it
get_historical_features()on aBatchFeatureViewre-runs the full UDF on raw data every call. For embedding pipelines, that's 20–40 min of compute per training run even though features already exist from the lastmaterialize.Fix: Route
get_historical_features()to read pre-computed parquet frombatch_source.pathinstead of re-executing the UDF.To support this,
SparkSourcenow acceptsquery + pathtogether:query— raw data read duringmaterialize()path— write-back target and pre-computed read source forget_historical_features()Also allows
BatchFeatureViewwithonline=False, offline=True(offline-only) to skip the online validation check inget_historical_features(), so it can be used purely for training data without configuring an online store.Falls back to live query if
pathdoesn't exist yet (first run before any materialization).Which issue(s) this PR fixes
N/A. Enables efficient training data retrieval for
BatchFeatureViewembedding pipelines without re-running UDFs.Checks
git commit -s)Testing Strategy
get_historical_features()reads from parquet, not UDF, after materialization