Skip to content

Extract shared ParquetReadPlan for leaf column resolution#20913

Open
friendlymatthew wants to merge 2 commits intoapache:mainfrom
pydantic:pr-20854-followup
Open

Extract shared ParquetReadPlan for leaf column resolution#20913
friendlymatthew wants to merge 2 commits intoapache:mainfrom
pydantic:pr-20854-followup

Conversation

@friendlymatthew
Copy link
Contributor

Note: please review from the second commit. This PR is currently based off of #20854

Rationale for this change

This PR extracts a reusable build_parquet_read_plan function and ParquetReadPlan struct from the row filter's FilterCandidateBuilder::build

Today, struct-aware leaf projection only exists in the row filter path. The opener still uses ProjectionMask::roots which reads all leaves of a struct even when only one field is needed. By pulling this logic into a shared abstraction, the opener can reduce unnecessary I/O for struct-heavy schemas

@github-actions github-actions bot added the datasource Changes to the datasource crate label Mar 12, 2026
Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it the idea is that this can be re-used for the projection side of things because it was written in a reusable manner, even thought at the moment it is only used for the row filter?

I generally like this change :)

|| self.is_nested_type_supported(return_type)
{
// try to resolve all field name arguments to strinrg literals
// if any argument is not a string literal, we can not determine the exact
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what circumstances would get_field accept non-string literal arguments? Would this be a column e.g. get_field(struct_col, fields_col)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, get_field will always receive string literals. This is enforced at different levels:

The None fallback is purely defensive. It's not likely that we'll hit this in practice, but in the case we do, we'll just read out the entire struct. Though, we can also err here like we did before

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to either:

  • Do something slower but correct (read out entire struct)
  • Error explicitly

Since we are doing one of these I think it's good for now 👍🏻

@friendlymatthew
Copy link
Contributor Author

I take it the idea is that this can be re-used for the projection side of things because it was written in a reusable manner, even thought at the moment it is only used for the row filter?

Yes exactly. Here's a PR that does exactly just that! #20925

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants