feat: add VectorChord benchmark support#745
feat: add VectorChord benchmark support#745R3gardless wants to merge 14 commits intozilliztech:mainfrom
Conversation
- Introduced VectorChord as a new database type in the DB enum. - Added VCHORDRQ as a new index type in the IndexType enum.
- Introduced VectorChord client with support for embedding operations. - Added configuration classes for VectorChord settings and parameters.
Integrate VectorChordRQ into the CLI for enhanced functionality. This addition allows users to utilize VectorChord in their benchmarks seamlessly. 🚀
Updated installation instructions and added command line usage for VectorChord (vchordrq) to enhance user experience. 🚀
- Introduced VectorChordGraph command to CLI for enhanced functionality. - Added quantization and reranking options to VectorChord configurations.
|
/assign @XuanYang-cn |
|
@R3gardless Thanks so much for this contribution — really appreciate the work you put in! One thing to flag: we upgraded to Pydantic v2 recently, and since your PR still uses v1 syntax, merging it would break the client. Could you sync with the latest main and update the Pydantic code to v2? Feel free to ping me if anything's unclear. |
Enhanced VectorChordGraph with max_scan_tuples option for better control over tuple scanning. Updated configuration and CLI to support this new feature! 🚀
Updated the user_name attribute in VectorChordConfig to directly assign the string "postgres" instead of wrapping it in SecretStr. This simplifies the configuration and ensures proper handling of the default username. 🚀
…aphConfig ✨🔧 Updated the comment for max_scan_tuples to ensure clarity on its default value and range. This enhances code readability and maintainability. 🚀
|
Hi @XuanYang-cn, I've addressed the PR review comments. Thanks! |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: R3gardless, XuanYang-cn The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
XuanYang-cn
left a comment
There was a problem hiding this comment.
Thanks for adding VectorChord support! The overall structure follows the existing client pattern well. A few issues to address before merge:
|
|
||
|
|
||
| class VectorChordConfig(DBConfig): | ||
| user_name: SecretStr = "postgres" |
There was a problem hiding this comment.
Critical: With pydantic v2, assigning a plain string default to a SecretStr-typed field stores it as str, not SecretStr. Then to_dict() calls self.user_name.get_secret_value() unconditionally, which will raise AttributeError: 'str' object has no attribute 'get_secret_value' whenever the default is used.
Fix: either change the default to SecretStr("postgres"), or add an isinstance guard in to_dict() like the pgvector client does:
user_str = self.user_name.get_secret_value() if isinstance(self.user_name, SecretStr) else self.user_nameThere was a problem hiding this comment.
Thanks! I'll apply SecretStr("postgres") as the default
| class VectorChord(VectorDB): | ||
| """Use psycopg instructions""" | ||
|
|
||
| conn: psycopg.Connection[Any] | None = None |
There was a problem hiding this comment.
Important — missing thread_safe = False.
psycopg connections are not thread-safe, but VectorChord inherits the default thread_safe = True from VectorDB. The pgvector client explicitly sets thread_safe = False. Without this, MPRunner may share connections across threads, risking data corruption under concurrent benchmarks.
Suggested fix — add before the conn declaration:
thread_safe = False| log.warning(f"Failed to insert data into vectorchord table ({self.table_name}), error: {e}") | ||
| return 0, e | ||
|
|
||
| def search_embedding( |
There was a problem hiding this comment.
Note: The filters parameter here is never passed by the benchmark framework. The framework calls prepare_filter() before the search loop to let the client pre-configure its query, but VectorChord doesn't override prepare_filter(), so _filtered_search is effectively dead code. Also supported_filter_types is not declared, defaulting to [NonFilter] only.
If filtered benchmarks are intended, prepare_filter() needs to be implemented (see pgvector's pattern). Otherwise, a code comment noting the intentional deferral would be helpful.
There was a problem hiding this comment.
Thanks I implement prepare_filter and _generate_search_query function to support filtering benchmark as well.
| for setting_name, setting_val in session_options.items(): | ||
| command = sql.SQL("SET {setting_name} " + "= {setting_val};").format( | ||
| setting_name=sql.Identifier(setting_name), | ||
| setting_val=sql.Identifier(str(setting_val)), |
There was a problem hiding this comment.
Nit: sql.Identifier() wraps the value in double quotes, producing SET "vchordrq.probes" = "10". For PostgreSQL GUC values, sql.Literal() (single-quoted string literal) would be more semantically correct, though double-quoted identifiers happen to work for numeric values in practice.
(This pattern is inherited from the pgvector client, so not a blocker.)
There was a problem hiding this comment.
Using sql.Literal instead thanks!
| metric_type: MetricType | None = None | ||
| create_index_before_load: bool = False | ||
| create_index_after_load: bool = True | ||
| quantization_type: str = "vector" # vector, halfvec, rabitq8, rabitq4 |
There was a problem hiding this comment.
Suggestion: This field accepts any string but only "vector", "halfvec", "rabitq8", "rabitq4" are valid. The value flows into sql.SQL(col_type) in _create_table which bypasses escaping. Consider constraining with:
quantization_type: Literal["vector", "halfvec", "rabitq8", "rabitq4"] = "vector"There was a problem hiding this comment.
I constrained quantization type using Literal
| else: | ||
| with_clause = sql.SQL(";") | ||
|
|
||
| full_sql = (index_create_sql + with_clause).join(" ") |
There was a problem hiding this comment.
Bug: sql.Composed.join(" ") interposes a space between every internal part of the composed SQL (not just between index_create_sql and with_clause). This produces extra whitespace like ON public. "table_name". PostgreSQL tolerates extra whitespace so it likely works, but the intent seems to be joining the two halves. Consider:
full_sql = index_create_sql + sql.SQL(" ") + with_clauseThere was a problem hiding this comment.
Thanks for suggestion! I applied it.
| @staticmethod | ||
| def _create_connection(**kwargs) -> tuple[Connection, Cursor]: | ||
| conn = psycopg.connect(**kwargs) | ||
| conn.cursor().execute("CREATE EXTENSION IF NOT EXISTS vchord CASCADE") |
There was a problem hiding this comment.
Minor: This creates a temporary cursor that is never explicitly closed. Consider reusing the main cursor or closing it:
cur = conn.cursor()
cur.execute("CREATE EXTENSION IF NOT EXISTS vchord CASCADE")
cur.close()There was a problem hiding this comment.
I referred to the pgvector.py source code, which returns the main cursor. It will be closed in the main logic. Thanks.
…onality 🎉✨ - Updated quantization_type to use Literal for better type validation. - Refactored search methods to streamline query generation and filtering. - Added support for dynamic where clauses in search queries. 🔍
Updated the user_name initialization to use SecretStr for improved security. This change ensures that sensitive information is handled properly. 🚀
This update ensures that the vectorchord extension is created if it doesn't already exist when establishing a connection. This enhances the setup process for the VectorChord class. 🚀
Cleaned up the code by removing extra whitespace for better readability. Ensured that the connection and cursor assertions remain clear and concise.
Summary
vchord) PostgreSQL extension as a new benchmark targetvchordrq(IVF + RaBitQ) andvchordg(DiskANN-style graph)Supported Features
Index Types
vchordrq: lists, probes, epsilon, residual_quantization, spherical_centroids, build_threads, degree_of_parallelism, rerank_in_table, max_scan_tuplesvchordg: m, ef_construction, bits, ef_search, beam_search, max_scan_tuplesQuantization: vector, halfvec, rabitq8, rabitq4
Metrics: L2, IP, COSINE (operator class auto-mapped per quantization type)
PostgreSQL Tuning: max_parallel_workers, max_parallel_maintenance_workers
CLI
References