Add data ingestion quickstart for processing custom data#50558
Add data ingestion quickstart for processing custom data#50558
Conversation
Co-authored-by: gewarren <24882762+gewarren@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This pull request adds a new quickstart tutorial for the Microsoft.Extensions.DataIngestion library, demonstrating how to build an ETL pipeline for RAG scenarios. The quickstart shows users how to read Markdown documents, enrich them with AI, chunk them semantically, and store them in a vector database for semantic search. Additionally, the PR includes cleanup changes to other quickstart files, removing hardcoded model names from user secrets in favor of inline string values.
Key Changes
- New quickstart documentation showing end-to-end data ingestion pipeline for AI applications
- Sample code demonstrating pipeline composition with readers, enrichers, chunkers, and vector storage
- Code cleanup across existing quickstarts (text-to-image, structured-output) to simplify configuration
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/ai/toc.yml | Adds new quickstart entry under "Chat with your data (RAG)" section |
| docs/ai/quickstarts/process-data.md | New quickstart documentation for data ingestion pipeline |
| docs/ai/quickstarts/snippets/process-data/Program.cs | Complete C# example implementing data ingestion with Azure OpenAI |
| docs/ai/quickstarts/snippets/process-data/ProcessData.csproj | Project file with required NuGet packages |
| docs/ai/quickstarts/snippets/process-data/data/sample.md | Sample Markdown document for testing the pipeline |
| docs/ai/quickstarts/text-to-image.md | Removed unnecessary model name from user secrets configuration |
| docs/ai/quickstarts/structured-output.md | Removed unnecessary model name from user secrets configuration |
| docs/ai/quickstarts/snippets/text-to-image/azure-openai/Program.cs | Hardcoded model name instead of reading from user secrets |
| docs/ai/quickstarts/snippets/structured-output/Program.cs | Hardcoded model name instead of reading from user secrets |
| docs/ai/how-to/snippets/access-data/ArgumentsExample.cs | Hardcoded model name instead of reading from user secrets |
BillWagner
left a comment
There was a problem hiding this comment.
I had one comment on the code, and then this is ready to merge.
luisquintanilla
left a comment
There was a problem hiding this comment.
Looks good. Added a few comments
Summary
Adds quickstart documentation for Microsoft.Extensions.DataIngestion library, demonstrating complete ETL pipeline for RAG scenarios.
Contributes to #50534
Changes
Documentation
docs/ai/quickstarts/process-data.mdCode Snippets
Based on sample from https://github.com/luisquintanilla/DataIngestion and blog announcement at https://devblogs.microsoft.com/dotnet/introducing-data-ingestion-building-blocks-preview/
Warning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
devblogs.microsoft.com/usr/bin/curl curl -s REDACTED(dns block)/usr/bin/wget wget -q -O /tmp/blog.html REDACTED(dns block)If you need me to access, download, or install something from one of these locations, you can either:
Original prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.
Internal previews