Skip to content

feat: add VectorChord benchmark support#745

Open
R3gardless wants to merge 14 commits intozilliztech:mainfrom
R3gardless:feat/add-vectorchord-benchmark
Open

feat: add VectorChord benchmark support#745
R3gardless wants to merge 14 commits intozilliztech:mainfrom
R3gardless:feat/add-vectorchord-benchmark

Conversation

@R3gardless
Copy link
Copy Markdown
Contributor

@R3gardless R3gardless commented Apr 2, 2026

Summary

  • Add VectorChord (vchord) PostgreSQL extension as a new benchmark target
  • Support two index types: vchordrq (IVF + RaBitQ) and vchordg (DiskANN-style graph)
  • Follow existing pgvector implementation patterns

Supported Features

Index Types

  • vchordrq: lists, probes, epsilon, residual_quantization, spherical_centroids, build_threads, degree_of_parallelism, rerank_in_table, max_scan_tuples
  • vchordg: m, ef_construction, bits, ef_search, beam_search, max_scan_tuples

Quantization: vector, halfvec, rabitq8, rabitq4

Metrics: L2, IP, COSINE (operator class auto-mapped per quantization type)

PostgreSQL Tuning: max_parallel_workers, max_parallel_maintenance_workers

CLI

vectordbbench vectorchordrq --help
vectordbbench vectorchordgraph --help

References

- Introduced VectorChord as a new database type in the DB enum.
- Added VCHORDRQ as a new index type in the IndexType enum.
- Introduced VectorChord client with support for embedding operations.
- Added configuration classes for VectorChord settings and parameters.
Integrate VectorChordRQ into the CLI for enhanced functionality. This addition allows users to utilize VectorChord in their benchmarks seamlessly. 🚀
Updated installation instructions and added command line usage for VectorChord (vchordrq) to enhance user experience. 🚀
- Introduced VectorChordGraph command to CLI for enhanced functionality.
- Added quantization and reranking options to VectorChord configurations.
@R3gardless
Copy link
Copy Markdown
Contributor Author

/assign @XuanYang-cn

@XuanYang-cn
Copy link
Copy Markdown
Collaborator

@R3gardless Thanks so much for this contribution — really appreciate the work you put in! One thing to flag: we upgraded to Pydantic v2 recently, and since your PR still uses v1 syntax, merging it would break the client. Could you sync with the latest main and update the Pydantic code to v2? Feel free to ping me if anything's unclear.

Enhanced VectorChordGraph with max_scan_tuples option for better control over tuple scanning. Updated configuration and CLI to support this new feature! 🚀
Updated the user_name attribute in VectorChordConfig to directly assign the string "postgres" instead of wrapping it in SecretStr. This simplifies the configuration and ensures proper handling of the default username. 🚀
…aphConfig ✨🔧

Updated the comment for max_scan_tuples to ensure clarity on its default value and range. This enhances code readability and maintainability. 🚀
@R3gardless
Copy link
Copy Markdown
Contributor Author

Hi @XuanYang-cn, I've addressed the PR review comments.

Thanks!

@sre-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: R3gardless, XuanYang-cn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Collaborator

@XuanYang-cn XuanYang-cn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding VectorChord support! The overall structure follows the existing client pattern well. A few issues to address before merge:



class VectorChordConfig(DBConfig):
user_name: SecretStr = "postgres"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: With pydantic v2, assigning a plain string default to a SecretStr-typed field stores it as str, not SecretStr. Then to_dict() calls self.user_name.get_secret_value() unconditionally, which will raise AttributeError: 'str' object has no attribute 'get_secret_value' whenever the default is used.

Fix: either change the default to SecretStr("postgres"), or add an isinstance guard in to_dict() like the pgvector client does:

user_str = self.user_name.get_secret_value() if isinstance(self.user_name, SecretStr) else self.user_name

Copy link
Copy Markdown
Contributor Author

@R3gardless R3gardless Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll apply SecretStr("postgres") as the default

class VectorChord(VectorDB):
"""Use psycopg instructions"""

conn: psycopg.Connection[Any] | None = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important — missing thread_safe = False.

psycopg connections are not thread-safe, but VectorChord inherits the default thread_safe = True from VectorDB. The pgvector client explicitly sets thread_safe = False. Without this, MPRunner may share connections across threads, risking data corruption under concurrent benchmarks.

Suggested fix — add before the conn declaration:

thread_safe = False

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

log.warning(f"Failed to insert data into vectorchord table ({self.table_name}), error: {e}")
return 0, e

def search_embedding(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: The filters parameter here is never passed by the benchmark framework. The framework calls prepare_filter() before the search loop to let the client pre-configure its query, but VectorChord doesn't override prepare_filter(), so _filtered_search is effectively dead code. Also supported_filter_types is not declared, defaulting to [NonFilter] only.

If filtered benchmarks are intended, prepare_filter() needs to be implemented (see pgvector's pattern). Otherwise, a code comment noting the intentional deferral would be helpful.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I implement prepare_filter and _generate_search_query function to support filtering benchmark as well.

for setting_name, setting_val in session_options.items():
command = sql.SQL("SET {setting_name} " + "= {setting_val};").format(
setting_name=sql.Identifier(setting_name),
setting_val=sql.Identifier(str(setting_val)),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: sql.Identifier() wraps the value in double quotes, producing SET "vchordrq.probes" = "10". For PostgreSQL GUC values, sql.Literal() (single-quoted string literal) would be more semantically correct, though double-quoted identifiers happen to work for numeric values in practice.

(This pattern is inherited from the pgvector client, so not a blocker.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using sql.Literal instead thanks!

metric_type: MetricType | None = None
create_index_before_load: bool = False
create_index_after_load: bool = True
quantization_type: str = "vector" # vector, halfvec, rabitq8, rabitq4
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: This field accepts any string but only "vector", "halfvec", "rabitq8", "rabitq4" are valid. The value flows into sql.SQL(col_type) in _create_table which bypasses escaping. Consider constraining with:

quantization_type: Literal["vector", "halfvec", "rabitq8", "rabitq4"] = "vector"

Copy link
Copy Markdown
Contributor Author

@R3gardless R3gardless Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I constrained quantization type using Literal

else:
with_clause = sql.SQL(";")

full_sql = (index_create_sql + with_clause).join(" ")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: sql.Composed.join(" ") interposes a space between every internal part of the composed SQL (not just between index_create_sql and with_clause). This produces extra whitespace like ON public. "table_name". PostgreSQL tolerates extra whitespace so it likely works, but the intent seems to be joining the two halves. Consider:

full_sql = index_create_sql + sql.SQL(" ") + with_clause

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggestion! I applied it.

@staticmethod
def _create_connection(**kwargs) -> tuple[Connection, Cursor]:
conn = psycopg.connect(**kwargs)
conn.cursor().execute("CREATE EXTENSION IF NOT EXISTS vchord CASCADE")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: This creates a temporary cursor that is never explicitly closed. Consider reusing the main cursor or closing it:

cur = conn.cursor()
cur.execute("CREATE EXTENSION IF NOT EXISTS vchord CASCADE")
cur.close()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I referred to the pgvector.py source code, which returns the main cursor. It will be closed in the main logic. Thanks.

…onality 🎉✨

- Updated quantization_type to use Literal for better type validation.
- Refactored search methods to streamline query generation and filtering.
- Added support for dynamic where clauses in search queries. 🔍
Updated the user_name initialization to use SecretStr for improved security. This change ensures that sensitive information is handled properly. 🚀
This update ensures that the vectorchord extension is created if it doesn't already exist when establishing a connection. This enhances the setup process for the VectorChord class. 🚀
Cleaned up the code by removing extra whitespace for better readability. Ensured that the connection and cursor assertions remain clear and concise.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants