Add ClickHouse Provider#67080
Conversation
|
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide
|
|
@koletzilla @joe-clickhouse would you mind reviewing that as well? |
|
thanks for adding Clickhouse provider @BentsiLeviav |
joe-clickhouse
left a comment
There was a problem hiding this comment.
Hi @BentsiLeviav. Looks pretty good! From a clickhouse-connect perspective I have few comments. In short, the scheme name needs updating and I think the bulk insert should be changed to use an insert context or just the regular client insert method which will automatically stream.
joe-clickhouse
left a comment
There was a problem hiding this comment.
Hi @BentsiLeviav. The changes you made look good! I did notice one last thing that i've left a comment about in the code related to passing arbitrary kwargs to the client from the Connection extra level.
| extra: dict[str, Any] = conn.extra_dejson | ||
|
|
||
| # Merge client_kwargs: extra values are the base, constructor values override. | ||
| raw_client_kwargs = extra.get("client_kwargs") |
There was a problem hiding this comment.
I'm not an Airflow expert, but I think this might expose low-level clickhouse-connect client options at the Connection extra level, which may be too broad for an Airflow provider.
From a clickhouse-connect perspective, arbitrary client kwargs are useful when the caller owns the Python code. So I think ClickHouseHook(client_kwargs=...) is reasonable at the Dag author level. But for Connection extras, the provider should probably only expose a finite set of reviewed and documented fields like host, port, username, password, database, secure, verify, timeouts, compression, etc.
It looks like _HOOK_MANAGED_KWARGS prevents overriding hook-owned fields, which is good, but it still allows any other clickhouse_connect.get_client() kwarg through. That means a Connection configuration user can configure low-level transport and security behavior on behalf of any Dag that uses the connection.
Long story short, I think we should keep arbitrary client_kwargs as a hook constructor argument only, and promote individual kwargs to Connection extras when the provider intentionally supports and documents them. This seems more consistent with Airflow's guidance to allowlist Connection extras rather than forwarding arbitrary kwargs into underlying libraries, but I'll defer to the Airflow maintainers on the provider policy.
Reference on Connection configuration users:
There was a problem hiding this comment.
I saw postgres passes arbitrary extras through to the underlying library as well, which Airflow's security model explicitly accepts for trusted Connection configuration users.
As a previous Airflow user, such capability would have helped a lot (where you could update critical settings via the connections and they would be applied to all relevant Dags, rather than going one by one and updating the constructor. The same goes for common settings that could be shared in a single place, instead of each Dag separately.)
@eladkal could you or any other Airflow maintainer provide guidance on that? I'm ok with reverting this behaviour if needed, though I think it might be very useful. In addition, we can start with having this and revert the logic in the future if we get any user feedback about it.
There was a problem hiding this comment.
We normally use hook_params you can find many examples for that in the database related hooks across the source code.
There was a problem hiding this comment.
|
@BentsiLeviav A few things need addressing before review — see our Pull Request quality criteria.
Adding a new provider is a substantial change; please get a clean CI run before requesting maintainer review. No rush. Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you. Drafted-by: Claude Code (Opus 4.7); reviewed by @potiuk before posting |
|
Quickest fix: git fetch upstream main && git rebase upstream/main
rm uv.lock && uv lock
git add uv.lock && git rebase --continue
git push --force-with-leaseAutomated nudge — ignore if you're not ready to rebase. This comment is updated in place on future |
|
Needs rebase and resolve conflicts |
…ning The clickhouse-connect library only registers the clickhousedb:// SQLAlchemy dialect; clickhousedbs:// was never a valid scheme and would fail at engine creation. TLS is now enabled via ?secure=true, and tuning params (connect_timeout, send_receive_timeout, compress, verify) are forwarded as query-string arguments so SQLAlchemy-path users get the same settings as DB-API-path users. Tests updated accordingly.
…viders.clickhousedb
…ync README and docs
2f5c62e to
63c9d34
Compare
Description
Adds a new apache-airflow-providers-clickhouse provider that integrates Airflow with ClickHouse via the HTTP interface using the
clickhouse-connectlibrary.Scope of this implementation
ClickHouseHook- the core integration, extendingDbApiHookso all standardSQLExecuteQueryOperatorfeatures work out of the box (templating, handler, split_statements, etc.)bulk_insert_rows()for more performant inserts using clickhouse-connect's native insert pathget_uri()for SQLAlchemy-compatible connection strings (clickhousedb:///clickhousedbs://)Implementation decisions
DB-API 2.0adapter (ClickHouseConnection): clickhouse-connect doesn't expose a DB-API connection natively - we wrap its Client in a thin adapter soDbApiHook.run()works unmodified.commit()and
rollback()are intentional no-ops since ClickHouse has no transactions.session_settingsandclient_kwargscan be set at the connection level (via the extra JSON field) and overridden at the task level (via hook constructor arguments), with the constructor taking precedence on conflicts.apache-airflow/<version> apache-airflow-providers-clickhouse/<version>in the HTTP User-Agent (system.query_log), making queries traceable back to their Airflow source. Users can append a custom label via the client_name extra field.SQLExecuteQueryOperatorfromcommon.sqlcovers all standard SQL use cases.File structure (generated with Claude)
provider.yamlconn-fieldsschema used to generate the connection formpyproject.tomlclickhouse-connect >=0.7.0,common-sql >=1.32.0) — auto-generated from the Breeze templatesrc/.../hooks/clickhouse.pyClickHouseHook(extendsDbApiHook) andClickHouseConnection(thin DB-API 2.0 adapter wrapping theclickhouse-connectclient)src/.../get_provider_info.pyprovider.yamlby the Breeze release tooling — do not edit manuallysrc/airflow/__init__.py,src/airflow/providers/__init__.pyairflow.providersimplicit namespacesrc/.../clickhouse/__init__.py__version__ = "1.0.0") with minimum Airflow version guard — auto-generateddocs/connections/clickhouse.rstdocs/operators/clickhouse.rstSQLExecuteQueryOperatorandClickHouseHookdirectly, includingsession_settingsandbulk_insert_rowsexamplesdocs/index.rst,docs/conf.py,docs/changelog.rst,docs/security.rstdocs/integration-logos/ClickHouse.pngtests/unit/clickhouse/hooks/test_clickhouse.pytests/system/clickhouse/example_clickhouse.py.github/boring-cyborg.ymlprovider:clickhouselabel rule for automatic PR labellingscripts/ci/docker-compose/remove-sources.yml,tests-sources.yml