Automaticaly update BigQuery table schema when unknown fields are seen.#38058
Automaticaly update BigQuery table schema when unknown fields are seen.#38058reuvenlax wants to merge 1 commit intoapache:masterfrom
Conversation
|
/gemini review |
|
Warning Gemini encountered an error creating the summary. You can try again by commenting |
|
/gemini summary |
Summary of ChangesThis pull request introduces a significant enhancement to the BigQuery Storage Write API connector, allowing it to automatically adapt to evolving data schemas. When data with new fields or relaxed field constraints arrives, the system will now detect these changes, buffer the affected records, and dynamically update the BigQuery table's schema. This capability greatly simplifies data ingestion pipelines by removing the need for manual schema management in response to minor data evolution, ensuring data flow continues uninterrupted. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
|
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements automatic schema evolution for the BigQuery Storage Write API. It introduces a mechanism to detect schema mismatches using hashing and handles missing field errors by automatically patching the BigQuery table schema when permitted. Key changes include the UpgradeTableSchema utility for schema merging and a stateful buffering mechanism in StorageApiConvertMessages to hold records during schema updates. The review feedback recommends replacing System.err calls with proper logging, optimizing object allocations in loops, removing redundant semicolons, and using a fixed charset for hashing to ensure consistency across environments.
| ElementT value = element.getValue(); | ||
| boolean needsNewTimer = false; | ||
| if (value != null) { | ||
| System.err.println("BUFFERING ELEMENT"); |
| boolean schemaOutOfDate = false; | ||
| do { | ||
| try { | ||
| System.err.println("TRYING TO PATCH TO " + updatedSchema); |
| } | ||
| return; | ||
| } catch (IOException e) { | ||
| ApiErrorExtractor errorExtractor = new ApiErrorExtractor(); |
| collectors.clear(); | ||
| ; |
| this.isStruct = isStruct; | ||
| ; |
| prefix.isEmpty() | ||
| ? tableFieldSchema.getName() | ||
| : String.join(".", prefix, tableFieldSchema.getName()); | ||
| hashCodes.add(SCHEMA_HASH_FUNCTION.hashString(name.toLowerCase(), Charset.defaultCharset())); |
There was a problem hiding this comment.
Using Charset.defaultCharset() can lead to inconsistent results across different environments. It is safer to use a fixed charset like StandardCharsets.UTF_8 for hashing.
| hashCodes.add(SCHEMA_HASH_FUNCTION.hashString(name.toLowerCase(), Charset.defaultCharset())); | |
| hashCodes.add(SCHEMA_HASH_FUNCTION.hashString(name.toLowerCase(), StandardCharsets.UTF_8)); |
|
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
No description provided.