feat: add Azure Blob Storage support#5803
Conversation
Add Azure Blob Storage as a fourth storage backend alongside local, S3, and GCS. This enables users to store files (uploads, document loaders, chat attachments) in Azure Blob Storage. Changes: - Add Azure branches to all 11 storage functions in storageUtils.ts - Add Azure Blob Storage credential (connection string or account/key) - Add Azure Blob File document loader node - Add Azure multer storage engine for file uploads - Add @azure/storage-blob dependency - Update .env.example with Azure configuration variables - Add tests for Azure Blob Storage configuration and credential
Summary of ChangesHello @hztBUAA, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the platform's storage capabilities by integrating Azure Blob Storage as a new backend option. This allows users to leverage Azure's robust and scalable object storage service for managing files, providing greater flexibility and choice in deployment environments. The changes encompass new credential and document loader nodes, along with comprehensive updates to existing storage utility functions to support Azure-specific operations. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The addition of Azure Blob Storage support is a valuable enhancement to the storage capabilities of the platform. The implementation follows the established patterns for S3 and GCS backends. However, there are a few critical issues that must be addressed before merging: a syntax error in the server utilities that breaks local storage, incorrect usage of the Azure SDK's upload method signature, and inefficient memory usage during file uploads in the Multer storage engine. Addressing these will ensure the stability and performance of the new storage provider.
| } else { | ||
| return multer({ dest: getUploadPath() }) | ||
| } | ||
| } | ||
|
|
| const chunks: Buffer[] = [] | ||
| file.stream.on('data', (chunk: Buffer) => chunks.push(chunk)) | ||
| file.stream.on('end', async () => { | ||
| try { | ||
| const buffer = Buffer.concat(chunks) | ||
| await blockBlobClient.upload(buffer, buffer.length, { | ||
| blobHTTPHeaders: { blobContentType: file.mimetype } | ||
| }) | ||
| cb(null, { path: blobName, size: buffer.length }) | ||
| } catch (err) { | ||
| cb(err) | ||
| } | ||
| }) | ||
| file.stream.on('error', (err: Error) => cb(err)) |
There was a problem hiding this comment.
Manually buffering the entire file into memory using file.stream.on('data', ...) is inefficient and can lead to Out-Of-Memory (OOM) errors for large file uploads. It is recommended to use blockBlobClient.uploadStream() to pipe the data directly to Azure Blob Storage. This also avoids the incorrect upload method signature used here.
blockBlobClient.uploadStream(file.stream, undefined, undefined, {
blobHTTPHeaders: { blobContentType: file.mimetype }
}).then(() => {
cb(null, { path: blobName })
}).catch(cb)| await blockBlobClient.upload(bf, bf.length, { | ||
| blobHTTPHeaders: { blobContentType: mime, blobContentEncoding: 'base64' } | ||
| }) |
There was a problem hiding this comment.
The upload method in @azure/storage-blob v12 does not take the buffer length as a second argument. Passing bf.length here will cause it to be interpreted as the options object, which will lead to runtime errors or ignored headers. For Buffers, it is recommended to use uploadData which has the correct signature for this use case.
| await blockBlobClient.upload(bf, bf.length, { | |
| blobHTTPHeaders: { blobContentType: mime, blobContentEncoding: 'base64' } | |
| }) | |
| await blockBlobClient.uploadData(bf, { | |
| blobHTTPHeaders: { blobContentType: mime, blobContentEncoding: 'base64' } | |
| }) |
| const downloadResponse = await blockBlobClient.download(0) | ||
| const chunks: Buffer[] = [] | ||
| if (downloadResponse.readableStreamBody) { | ||
| for await (const chunk of downloadResponse.readableStreamBody) { | ||
| chunks.push(Buffer.isBuffer(chunk) ? chunk : Buffer.from(chunk)) | ||
| } | ||
| } | ||
| return Buffer.concat(chunks) |
There was a problem hiding this comment.
Instead of manually downloading the stream and concatenating chunks into a Buffer, you can use the built-in downloadToBuffer() method. This is more concise and efficient.
| const downloadResponse = await blockBlobClient.download(0) | |
| const chunks: Buffer[] = [] | |
| if (downloadResponse.readableStreamBody) { | |
| for await (const chunk of downloadResponse.readableStreamBody) { | |
| chunks.push(Buffer.isBuffer(chunk) ? chunk : Buffer.from(chunk)) | |
| } | |
| } | |
| return Buffer.concat(chunks) | |
| const blockBlobClient = containerClient.getBlockBlobClient(filePath) | |
| return await blockBlobClient.downloadToBuffer() |
| const downloadResponse = await blockBlobClient.download(0) | ||
| const chunks: Buffer[] = [] | ||
| if (downloadResponse.readableStreamBody) { | ||
| for await (const chunk of downloadResponse.readableStreamBody) { | ||
| chunks.push(Buffer.isBuffer(chunk) ? chunk : Buffer.from(chunk)) | ||
| } | ||
| } | ||
| const objectData = Buffer.concat(chunks) |
There was a problem hiding this comment.
The manual stream-to-buffer conversion can be simplified by using the downloadToBuffer() method provided by the Azure SDK.
| const downloadResponse = await blockBlobClient.download(0) | |
| const chunks: Buffer[] = [] | |
| if (downloadResponse.readableStreamBody) { | |
| for await (const chunk of downloadResponse.readableStreamBody) { | |
| chunks.push(Buffer.isBuffer(chunk) ? chunk : Buffer.from(chunk)) | |
| } | |
| } | |
| const objectData = Buffer.concat(chunks) | |
| const objectData = await blockBlobClient.downloadToBuffer() |
|
Thanks for the review and feedback. I am following up on this PR now and will either push the requested changes or reply point-by-point shortly. |
|
Quick follow-up: I am reviewing the feedback and will update this PR shortly. |
|
a similar PR is already in work: #5604 |
Summary
Closes #5411
.env.examplewith Azure Blob Storage configuration variables@azure/storage-blob(^12.26.0) as dependencyNew Environment Variables
STORAGE_TYPE=azureAZURE_BLOB_STORAGE_CONNECTION_STRINGAZURE_BLOB_STORAGE_ACCOUNT_NAMEAZURE_BLOB_STORAGE_ACCESS_KEYAZURE_BLOB_STORAGE_CONTAINER_NAMETest plan
getStorageType()with all storage typesgetAzureBlobConfig()(connection string, account/key, missing credentials)🤖 Generated with Claude Code