Add apache-datafusion provider skelton#64998
Conversation
|
Thanks for adding this new provider @gopidesupavan!! Direct link: Suggest to add this to the relevant reference. | apache.datafusion | |
Srabasti
left a comment
There was a problem hiding this comment.
Link "https://airflow.apache.org/docs/apache-airflow-providers-apache-datafusion/0.1.0" gives 404 Error, as below.
I would be happy to update the link if you permit Sir! Was curious to learn more about this new provider, hence was checking out the links.
these will be published part of the release for now not required.. |
yes look at this about https://datafusion.apache.org/python/ functionalities it provides |
| +---------------------+-----------------------------------------------------+------------------------------------------------+ | ||
| | apache-beam | ``pip install 'apache-airflow[apache-beam]'`` | Apache Beam operators & hooks | | ||
| +---------------------+-----------------------------------------------------+------------------------------------------------+ | ||
| | apache-datafusion | ``pip install 'apache-airflow[apache-datafusion]'`` | Apache DataFusion provider package | |
There was a problem hiding this comment.
Alphabetical order: apache-cassandra should come before apache-datafusion here. The entry was inserted between apache-beam and apache-cassandra, which breaks the alphabetical sort the rest of the table follows. Move it down two rows so the order reads beam, cassandra, datafusion, drill.
| provider:apache-beam: | ||
| - providers/apache/beam/** | ||
|
|
||
| provider:apache-datafusion: |
There was a problem hiding this comment.
Same alphabetical-order issue as the docs table. Current order is apache-beam -> apache-datafusion -> apache-cassandra -> apache-drill. The labeler entries are alphabetical elsewhere in this file. Move the apache-datafusion block below apache-cassandra.
|
|
||
| .. note:: | ||
|
|
||
| This provider is currently not ready and only contains the initial package skeleton. |
There was a problem hiding this comment.
This file is marked "AUTOMATICALLY GENERATED" at the top, and the "currently not ready / package skeleton" sentence isn't in PROVIDER_README_TEMPLATE.rst.jinja2. It will be clobbered the next time the README is regenerated (at release time). If you want a durable disclaimer, the right place is either the template or the description: field in provider.yaml.
| description: | | ||
| `Apache DataFusion <https://datafusion.apache.org/>`__ | ||
|
|
||
| state: not-ready |
There was a problem hiding this comment.
Other in-tree skeleton providers (vespa, akeyless, common-ai, informatica) use state: ready with lifecycle: incubation even when the package only contains the skeleton. state: not-ready here will exclude the provider from regular builds and releases (see valid_states handling in dev/breeze/src/airflow_breeze/utils/packages.py). Is the intent to defer the first release until hooks/operators land? If so, fine. If you wanted "release at 0.1.0 as an incubating provider," switch to state: ready to match the others.
| provider_info = "airflow.providers.apache.datafusion.get_provider_info:get_provider_info" | ||
|
|
||
| [tool.flit.module] | ||
| name = "airflow.providers.apache.datafusion" |
There was a problem hiding this comment.
Other in-tree flit providers (cassandra, vespa, akeyless, ...) carry an explicit [tool.flit.sdist] block directly after [tool.flit.module], with the comment "Explicit sdist contents so the build does not rely on VCS information (flit 4.0 makes --no-use-vcs the default -- see pypa/flit#782)." This file is missing that block, so the sdist contents will depend on VCS state. Looks like the pyproject was generated from an older template -- regenerating (or copying the block from a recent provider like providers/apache/cassandra/pyproject.toml) should add it.
|
|
||
|
|
||
| def test_example(): | ||
| assert True |
There was a problem hiding this comment.
assert True will pass even if the provider metadata is wrong, the entry point is broken, or the package fails to import. Since this PR adds a new provider package, it would be worth asserting something real -- e.g. that airflow.providers.apache.datafusion.get_provider_info.get_provider_info() returns the expected package-name and name. That catches both the import path and the provider registration.
|
Quickest fix: git fetch upstream main && git rebase upstream/main
rm uv.lock && uv lock
git add uv.lock && git rebase --continue
git push --force-with-leaseAutomated nudge — ignore if you're not ready to rebase. This comment is updated in place on future |

Was generative AI tooling used to co-author this PR?
{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.