Skip to content

Fix Weaviate tenant-aware ingestion#67298

Open
iwannagotobed wants to merge 1 commit into
apache:mainfrom
iwannagotobed:fix-weaviate-tenant-ingestion
Open

Fix Weaviate tenant-aware ingestion#67298
iwannagotobed wants to merge 1 commit into
apache:mainfrom
iwannagotobed:fix-weaviate-tenant-ingestion

Conversation

@iwannagotobed
Copy link
Copy Markdown
Contributor

@iwannagotobed iwannagotobed commented May 21, 2026

Summary

Fixes tenant-aware ingestion for Weaviate operators.

WeaviateIngestOperator and WeaviateDocumentIngestOperator already accepted a
tenant argument, but the value was not consistently applied to the underlying
hook operations.

As a result, users could configure tenant="..." on the operator while some
create, query, replace, or delete operations still ran against the base
collection instead of the tenant-scoped collection.

Why this matters

Weaviate multi-tenancy isolates objects by tenant within a collection. For a
multi-tenant collection, object operations must be performed through:

collection.with_tenant("<tenant>")

If the Airflow operator accepts a tenant parameter but does not apply it to the
actual Weaviate collection operation, the provider does not honor the user's
multi-tenancy boundary.

This can lead to confusing and risky behavior:

  • A Dag author sets tenant on the operator and expects data to be written into that tenant.
  • The ingest task appears successful, but tenant-scoped verification cannot find the object.
  • Document replacement logic may query or delete objects outside the intended
    tenant scope.

Reproduction

I reproduced the issue with a small Airflow UI Dag.

The collection is multi-tenant, the ingest operator receives
tenant="tenant-a", and the verification task reads from the tenant-scoped collection.

COLLECTION_NAME = "AirflowTenantRepro"
TENANT_NAME = "tenant-a"


@task
def create_collection():
    hook.create_collection(
        COLLECTION_NAME,
        vectorizer_config=None,
        multi_tenancy_config=Configure.multi_tenancy(
            enabled=True,
            auto_tenant_creation=True,
        ),
    )


ingest_with_tenant = WeaviateIngestOperator(
    task_id="ingest_with_tenant",
    conn_id="weaviate_default",
    collection_name=COLLECTION_NAME,
    input_data=SAMPLE_DATA,
    tenant=TENANT_NAME,
)


@task
def verify_tenant_data():
    collection = hook.get_collection(COLLECTION_NAME).with_tenant(TENANT_NAME)
    response = collection.query.fetch_objects(limit=10)

    if not response.objects:
        raise RuntimeError("Expected object was not found in tenant")

Before this fix, the ingest task completed successfully, but the tenant-scoped verification task failed because the expected object was not found in tenant-a.

스크린샷 2026-05-22 오전 1 01 33

Changes

This change makes the configured tenant flow through the provider consistently:

  • Passes tenant from WeaviateIngestOperator to WeaviateHook.batch_data().
  • Passes tenant from WeaviateDocumentIngestOperator to WeaviateHook.create_or_replace_document_objects().
  • Applies collection.with_tenant(tenant) inside WeaviateHook.batch_data() before batch insertion.
  • Applies tenant scoping to document ingestion paths, including existing-document lookup, replace/delete, final batch insert, rollback delete, and verbose aggregate query.
  • Adds optional tenant support to delete_object() and _delete_objects() so cleanup and rollback operations stay within the same tenant scope.
  • Adds unit test coverage for operator-to-hook tenant handoff and hook-level tenant collection usage.

Result

After the fix, the same Airflow UI reproduction Dag succeeds end to end:

  • create_collection: success
  • ingest_with_tenant: success
  • verify_tenant_data: success
  • cleanup_collection: success
스크린샷 2026-05-22 오전 2 05 54

Tests

I ran the relevant Weaviate provider tests with Breeze:

breeze run pytest providers/weaviate/tests/unit/weaviate/operators/test_weaviate.py providers/weaviate/tests/unit/weaviate/hooks/test_weaviate.py -xvs

Result:

54 passed

Was generative AI tooling used to co-author this PR?
  • Yes - Codex (GPT-5)

@potiuk potiuk added the ready for maintainer review Set after triaging when all criteria pass. label May 24, 2026
@kevinhongzl
Copy link
Copy Markdown
Contributor

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants