Skip to content

Add ClickHouse Provider#67080

Open
BentsiLeviav wants to merge 35 commits into
apache:mainfrom
BentsiLeviav:add-clickhouse-provider
Open

Add ClickHouse Provider#67080
BentsiLeviav wants to merge 35 commits into
apache:mainfrom
BentsiLeviav:add-clickhouse-provider

Conversation

@BentsiLeviav
Copy link
Copy Markdown

@BentsiLeviav BentsiLeviav commented May 18, 2026

Description

Adds a new apache-airflow-providers-clickhouse provider that integrates Airflow with ClickHouse via the HTTP interface using the clickhouse-connect library.

Scope of this implementation

  • ClickHouseHook - the core integration, extending DbApiHook so all standard SQLExecuteQueryOperator features work out of the box (templating, handler, split_statements, etc.)
  • Connection form UI with dedicated fields for TLS, timeouts, compression, session settings, and client kwargs
  • bulk_insert_rows() for more performant inserts using clickhouse-connect's native insert path
  • get_uri() for SQLAlchemy-compatible connection strings (clickhousedb:// / clickhousedbs://)
  • Connection type docs, operator how-to guide, and integration logo
  • 95 unit tests

Implementation decisions

  • DB-API 2.0 adapter (ClickHouseConnection): clickhouse-connect doesn't expose a DB-API connection natively - we wrap its Client in a thin adapter so DbApiHook.run() works unmodified. commit()
    and rollback() are intentional no-ops since ClickHouse has no transactions.
  • Two-level settings merge: both session_settings and client_kwargs can be set at the connection level (via the extra JSON field) and overridden at the task level (via hook constructor arguments), with the constructor taking precedence on conflicts.
  • Hook-managed kwargs protection: keys that the hook owns (host, port, username, password, database, secure, verify, client_name, settings) are stripped from any user-supplied client_kwargs so hook-managed values always win.
  • Client name: every query is tagged with apache-airflow/<version> apache-airflow-providers-clickhouse/<version> in the HTTP User-Agent (system.query_log), making queries traceable back to their Airflow source. Users can append a custom label via the client_name extra field.
  • No dedicated operators are added - SQLExecuteQueryOperator from common.sql covers all standard SQL use cases.

File structure (generated with Claude)

File(s) Purpose
provider.yaml Provider metadata: name, version, integrations, connection types, UI field behaviour, and conn-fields schema used to generate the connection form
pyproject.toml Package build config and dependencies (clickhouse-connect >=0.7.0, common-sql >=1.32.0) — auto-generated from the Breeze template
src/.../hooks/clickhouse.py Core implementation: ClickHouseHook (extends DbApiHook) and ClickHouseConnection (thin DB-API 2.0 adapter wrapping the clickhouse-connect client)
src/.../get_provider_info.py Auto-generated from provider.yaml by the Breeze release tooling — do not edit manually
src/airflow/__init__.py, src/airflow/providers/__init__.py Namespace package declarations required for the airflow.providers implicit namespace
src/.../clickhouse/__init__.py Version file (__version__ = "1.0.0") with minimum Airflow version guard — auto-generated
docs/connections/clickhouse.rst Connection configuration reference: all fields, their types, defaults, and JSON/URI examples
docs/operators/clickhouse.rst How-to guide: using SQLExecuteQueryOperator and ClickHouseHook directly, including session_settings and bulk_insert_rows examples
docs/index.rst, docs/conf.py, docs/changelog.rst, docs/security.rst Standard provider docs scaffold — mostly auto-generated
docs/integration-logos/ClickHouse.png Official ClickHouse logo used by the Apache Airflow website
tests/unit/clickhouse/hooks/test_clickhouse.py 95 unit tests covering connection building, settings/kwargs merge logic, database override, URI generation, bulk insert, UI widgets, and
autocommit semantics
tests/system/clickhouse/example_clickhouse.py System test / example DAG: create table → bulk insert → read rows → drop table
.github/boring-cyborg.yml Adds provider:clickhouse label rule for automatic PR labelling
scripts/ci/docker-compose/remove-sources.yml, tests-sources.yml Auto-updated by prek to mount the clickhouse provider sources/tests into the CI Docker environment

@boring-cyborg
Copy link
Copy Markdown

boring-cyborg Bot commented May 18, 2026

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example Dag that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@BentsiLeviav
Copy link
Copy Markdown
Author

@koletzilla @joe-clickhouse would you mind reviewing that as well?

@eladkal
Copy link
Copy Markdown
Contributor

eladkal commented May 18, 2026

thanks for adding Clickhouse provider @BentsiLeviav
I am happy to sponsor the provider. Please follow the procedure listed in https://github.com/apache/airflow/blob/main/providers/ACCEPTING_PROVIDERS.rst#discussion-thread-template we simplified the process of accepting new providers and now it requires just one mailing list thread. This is something that should be in done in parallel to code review of the PR.

Copy link
Copy Markdown

@joe-clickhouse joe-clickhouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @BentsiLeviav. Looks pretty good! From a clickhouse-connect perspective I have few comments. In short, the scheme name needs updating and I think the bulk insert should be changed to use an insert context or just the regular client insert method which will automatically stream.

Comment thread providers/clickhouse/src/airflow/providers/clickhouse/hooks/clickhouse.py Outdated
Comment thread providers/clickhouse/src/airflow/providers/clickhouse/hooks/clickhouse.py Outdated
Comment thread providers/clickhouse/tests/unit/clickhouse/hooks/test_clickhouse.py Outdated
Copy link
Copy Markdown

@joe-clickhouse joe-clickhouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @BentsiLeviav. The changes you made look good! I did notice one last thing that i've left a comment about in the code related to passing arbitrary kwargs to the client from the Connection extra level.

extra: dict[str, Any] = conn.extra_dejson

# Merge client_kwargs: extra values are the base, constructor values override.
raw_client_kwargs = extra.get("client_kwargs")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an Airflow expert, but I think this might expose low-level clickhouse-connect client options at the Connection extra level, which may be too broad for an Airflow provider.

From a clickhouse-connect perspective, arbitrary client kwargs are useful when the caller owns the Python code. So I think ClickHouseHook(client_kwargs=...) is reasonable at the Dag author level. But for Connection extras, the provider should probably only expose a finite set of reviewed and documented fields like host, port, username, password, database, secure, verify, timeouts, compression, etc.

It looks like _HOOK_MANAGED_KWARGS prevents overriding hook-owned fields, which is good, but it still allows any other clickhouse_connect.get_client() kwarg through. That means a Connection configuration user can configure low-level transport and security behavior on behalf of any Dag that uses the connection.

Long story short, I think we should keep arbitrary client_kwargs as a hook constructor argument only, and promote individual kwargs to Connection extras when the provider intentionally supports and documents them. This seems more consistent with Airflow's guidance to allowlist Connection extras rather than forwarding arbitrary kwargs into underlying libraries, but I'll defer to the Airflow maintainers on the provider policy.

Reference on Connection configuration users:

Connection configuration users

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw postgres passes arbitrary extras through to the underlying library as well, which Airflow's security model explicitly accepts for trusted Connection configuration users.

As a previous Airflow user, such capability would have helped a lot (where you could update critical settings via the connections and they would be applied to all relevant Dags, rather than going one by one and updating the constructor. The same goes for common settings that could be shared in a single place, instead of each Dag separately.)

@eladkal could you or any other Airflow maintainer provide guidance on that? I'm ok with reverting this behaviour if needed, though I think it might be very useful. In addition, we can start with having this and revert the logic in the future if we get any user feedback about it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We normally use hook_params you can find many examples for that in the database related hooks across the source code.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@potiuk
Copy link
Copy Markdown
Member

potiuk commented May 24, 2026

@BentsiLeviav A few things need addressing before review — see our Pull Request quality criteria.

  • Static checks, tests, and docs build all fail — 22 failing checks across Basic tests / Breeze unit tests, CI image checks / Static checks, CI image checks / Build documentation (--docs-only), and provider distribution test matrices. See the Checks tab for the full list and individual job logs.

Adding a new provider is a substantial change; please get a clean CI run before requesting maintainer review. No rush.


Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.


Drafted-by: Claude Code (Opus 4.7); reviewed by @potiuk before posting

@BentsiLeviav
Copy link
Copy Markdown
Author

@potiuk @eladkal can one of you re-run the tests?

I fixed most of them; there are some failures related to an Amazon provider (doesn't seem to be related to our changes at all)

Comment thread providers/clickhouse/src/airflow/providers/clickhouse/hooks/clickhouse.py Outdated
Comment thread providers/clickhouse/src/airflow/providers/clickhouse/hooks/clickhouse.py Outdated
@BentsiLeviav BentsiLeviav requested a review from koletzilla May 27, 2026 22:58
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 28, 2026

uv.lock on main just moved via #66632 ("Bump google-cloud-aiplatform to force upgrade of litellm"), commit 78039ba and this PR currently conflicts.

Quickest fix:

git fetch upstream main && git rebase upstream/main
rm uv.lock && uv lock
git add uv.lock && git rebase --continue
git push --force-with-lease

Automated nudge — ignore if you're not ready to rebase. This comment is updated in place on future uv.lock bumps.

Copy link
Copy Markdown

@koletzilla koletzilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :D

@eladkal
Copy link
Copy Markdown
Contributor

eladkal commented Jun 1, 2026

Needs rebase and resolve conflicts

…ning

The clickhouse-connect library only registers the clickhousedb:// SQLAlchemy
dialect; clickhousedbs:// was never a valid scheme and would fail at engine
creation. TLS is now enabled via ?secure=true, and tuning params
(connect_timeout, send_receive_timeout, compress, verify) are forwarded as
query-string arguments so SQLAlchemy-path users get the same settings as
DB-API-path users. Tests updated accordingly.
@BentsiLeviav BentsiLeviav force-pushed the add-clickhouse-provider branch from 2f5c62e to 63c9d34 Compare June 4, 2026 00:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants