future-agi · Anmol270 · Jun 24, 2026 · Jun 24, 2026 · Jun 24, 2026
diff --git a/src/pages/docs/release-notes.mdx b/src/pages/docs/release-notes.mdx
@@ -3,27 +3,65 @@
 description: "Latest Future AGI release notes covering new features, improvements, and bug fixes across datasets, evaluations, simulation, and observability products."
 ---
 
+## Week of 2026-06-18
+
+<div class="mb-12 pb-8 border-b border-[var(--color-border-subtle)] last:border-b-0">
+
+<div class="mt-6 mb-3 text-lg font-semibold">Bugs/Improvements</div>
+
+- **Custom Attribute Filter Dropdown Now Populates:** In some cases, the custom attribute dropdown in the dashboard was empty for projects with many unique span attributes. It now lists all available attributes.
+
+- **Saved View Column Selections Now Persist:** In some cases, deselecting a column in a saved Observe view immediately snapped back to the saved state. Column visibility changes now stick for the session.
+
+- **API Key Expiry Enforced Across All Gateway Components:** Expired API keys are now rejected consistently across all gateway entry points, including components that previously accepted synced keys past their expiration date.
+
+</div>
+
+## Week of 2026-06-11
+
+<div class="mb-12 pb-8 border-b border-[var(--color-border-subtle)] last:border-b-0">
+
+<div class="mt-6 mb-3 text-lg font-semibold">Features</div>
+
+- **Few-Shot Examples for LLM Judge:** When configuring a custom LLM evaluator, you can now attach a dataset of input/output examples. The judge uses these as few-shot references during scoring, producing more consistent and calibrated results across your eval runs.
+
+<div class="mt-6 mb-3 text-lg font-semibold">Bugs/Improvements</div>
+
+- **Trace List Loads Reliably for Large Accounts:** In some cases, the trace list failed to load for accounts with a high number of distinct users. This has been resolved.
+
+- **Annotation Filters in Eval Tasks Now Work for Voice Call Projects:** In some cases, eval tasks using annotation filters on voice call rows returned no results. Annotation filters now correctly match annotations across all project and row types.
+
+- **Trace View No Longer Crashes on Large Images:** In some cases, opening a trace containing a span with an embedded image larger than 50MB caused the page to fail to load. This no longer occurs.
+
+- **Eval Task Filter Conditions Show Readable Column Names:** In some cases, eval task filter conditions displayed internal identifiers instead of the column's display name. Filters now show human-readable names.
+
+- **App No Longer Crashes With Browser Translation Enabled:** In some cases, using a browser's built-in translation feature (such as Chrome Translate or Edge Translate) caused a page crash. This no longer occurs.
+
+- **Eval Template Deletion Cleans Up Dataset Columns:** Deleting an eval template now removes the associated eval columns and cells from your datasets automatically.
+
+</div>
+
 ## Week of 2026-06-04
 
 <div class="mb-12 pb-8 border-b border-[var(--color-border-subtle)] last:border-b-0">

 <div class="mt-6 mb-3 text-lg font-semibold">Bugs/Improvements</div>

 - **Revamped Tracing Filters:** Filters across the Trace and Span views have been rebuilt with a more consistent and reliable foundation. Text-based filters now handle case differences correctly, and the filter picker accurately resolves metric names across all namespaces.

 - **Call Recording on Error Feed Overview:** For simulation projects, the Error Feed cluster overview now shows the call recording player instead of the agent flow section. You can listen to the call directly while reviewing the error cluster without switching views.

 - **Customer Agent Task Completion Evaluator:** A new built-in system evaluator is now available: customer_agent_task_completion. It checks whether your agent fully completed the assigned task in a customer interaction, returning a Pass or Fail result. It takes your agent's prompt and the full conversation as inputs. This is especially useful in Simulation, we recommend adding it to your simulation eval runs to automatically verify task completion across scenarios.

 - **Pass/Fail Now Shown Correctly in Trace Eval Drawer:** Pass/Fail evaluations like PII were displayed as a percentage score in the trace eval drawer, which read as a confidence level rather than a verdict. They now render as Pass or Fail.

 - **Eval Save and Test Require Valid Template Variables:** The Save and Test buttons in an eval's instructions editor are now disabled until the instructions contain at least one valid template variable. A tooltip explains why the buttons are inactive, and the check applies to both the create and edit flows.

 - **Required Eval Field Mappings No Longer Dropped:** In some cases, creating a system eval failed because required field mappings were silently removed from the payload during the setup flow. Required mappings are now preserved and validated before submission.

 - **Composite Eval Test Run on Tasks Fixed:** Running a test on a composite eval from the Eval Task view was failing, preventing you from verifying evals on a single test row before running them across all entries. This has been resolved.

 - **Custom Code Eval Parameters Now Apply:** Parameters passed via the SDK when running custom code evals were being ignored for some cases. They now apply correctly.

 - **Legacy Observe Tabs and Charts UI Removed:** The old Charts UI and legacy tab bar were still appearing on Tracing tabs after the charts revamp. The outdated interface is now fully removed so only the updated UI is shown.

@@ -33,7 +71,7 @@

 - **Save View Button Visible and Tab Names Truncate Cleanly:** The Save View button was nearly invisible in dark theme due to low contrast. Long view names and tab labels also overflowed. Both issues are fixed: the button is clearly visible and long names truncate at the boundary.

 - **Agent Graph No Longer Shows a Blank Screen for Voice Bots:** Opening the Agent Graph for a voice bot trace showed a blank screen with no explanation. Voice projects now default to the appropriate graph view, and unsupported tabs show a tooltip explaining why they are unavailable.

 - **Tracing Graph Full Screen Now Works:** The full-screen button on the trace agent graph and path views was not functioning. Both views now open in browser full screen correctly.

@@ -43,17 +81,17 @@

 - **Task Status Updates Without a Page Refresh:** In some cases, task status in the list stayed stale until you refreshed the page. The task list now polls automatically while rows are in progress, so statuses update on their own.

 - **Full Variable Names Visible on Hover in Mapping:** Variable names in the task screen's variable mapping column were truncated with no way to read the full name. Hovering over a column key now shows the full variable name in a tooltip.

 - **Empty Dataset Cells No Longer Show as Objects in Eval Mapping:** In some cases, empty cells from a dataset appeared as a raw object in the eval variable mapping step instead of showing as blank. Empty cells are now displayed correctly.

 - **Evals Skip Instead of Failing When Required Attributes Are Missing:** When a span was missing a required mapped attribute, the eval was incorrectly marked as Failed. Evals are now skipped for those spans, keeping your pass and fail metrics accurate.

 - **Removed Member No Longer Sees Indefinite Loading on Login:** In some cases, an account that had been removed from an organization saw a loading state persist indefinitely after attempting to log in, requiring a page refresh to see the correct message. The page now resolves correctly without a refresh.

 - **Show More in Error Details Now Works:** In some cases, the Show More button in the error details section was not functioning. It now expands correctly. The error localizer also no longer runs for evals that already passed.

 - **Output Type Locked After Eval Creation:** Once an evaluation is created, the output type can no longer be changed. A tooltip now explains this directly in the interface so the restriction is clear.

 </div>

@@ -63,23 +101,23 @@

 <div class="mt-6 mb-3 text-lg font-semibold">Features</div>

 - **Perplexity Sonar Models Now Available for Evaluations:** You can now use Perplexity's full Sonar model family (sonar, sonar-pro, sonar-reasoning, sonar-reasoning-pro, and sonar-deep-research), including the Agent API for agentic workflows with built-in web search, when running evaluations. Contributed by the Perplexity team. [PR #650](https://github.com/future-agi/future-agi/pull/650).

 - **Use System Prompt as Context in Evals:** You can now leverage your agent's system prompt as context when running system and custom evals, giving you a more complete view of how your prompts affect model outputs.

 - **New API: Access Eval Task Data Across Two Axes:** Two new API endpoints are now live. You can access eval task results via API in two ways: a per-evaluator summary (pass rates, average scores, and choice distributions across the full task) and a per-span breakdown (each evaluator's result for every individual span). Both support optional date range filtering.

 <div class="mt-6 mb-3 text-lg font-semibold">Bugs/Improvements</div>

 - **Eval Results in Observe Now Display Correctly:** In some cases, eval results in Trace Observe were not displaying correctly. This has been resolved and results now appear as expected.

 - **Eval Type Now Saved Correctly When Creating New Evaluations:** When creating an LLM-as-judge or Code evaluation, the selected type was previously always saved as Agent. The correct eval type is now preserved in all cases.

 - **Eval Experience Improvements Across the Platform:** A set of improvements to the eval creation and review experience: linking directly to a specific eval version now opens on that version; long task error messages collapse to a one-line summary with a Show more toggle; results no longer show stale data when switching between dataset, tracing, and simulation panels; variable highlighting in the prompt editor reflects which variables are actually mapped; clicking an execution row now opens that specific run rather than always opening the latest; and the ground-truth embedding status now updates in realtime, with no page refresh needed.

 - **Sessions View from the Users Tab No Longer Times Out:** When navigating to the Sessions view from the Users tab, the page could get stuck on a loading screen or time out. Sessions now loads reliably from that entry point.

 - **Composite Evals No Longer Accept Other Composites as Children:** When building a composite evaluation, the child picker now only lists individual, non-composite evaluators. Previously, composite evals could be selected as children, which produced unexpected results.

 - **Usage and Billing Page Display Accuracy Improved:** Several display issues on the Usage and Billing page have been fixed: AI credits were showing incorrect units, time period labels on usage cards were inaccurate, chart axis labels showed duplicates or mixed formatting, and the pricing tier table now includes column headers and correct unit labels.

@@ -93,13 +131,13 @@

 - **Nested Variable References Now Work in LLM-as-Judge Templates:** Variables that reference nested properties using dot notation were not rendering correctly in custom prompt evaluator templates. They now resolve and render as expected.

 - **Fix with Falcon Now Only Appears on Failing Evals:** The Fix with Falcon option previously appeared on both passing and failing eval rows. It now only appears on evals that are failing, not on every row.

 - **Resuming a Completed Eval Task Now Shows a Clear Message:** In some cases, clicking Resume on a task that had already finished showed a raw error. It now shows a clear message indicating the task may have already completed.

 - **Instruction Validation Errors Now Visible When Creating Evals from Sessions:** When adding an LLM-as-judge evaluation from the Sessions view, validation errors on the instructions field were not shown, causing saves to silently fail. Error messages now appear inline as expected.

 - **Eval and Task List Chips Now Have Hover Feedback and a Stable Popover:** Chips in the Tasks and Evals list had no hover state, and the popover showing additional items closed as soon as the cursor moved toward it. Chips now darken on hover and the popover stays open as the cursor moves into it.

 - **Task Usage Table Columns No Longer Get Cut Off:** The Task Usage table was clipping right-side columns. The table now scrolls horizontally so every column stays visible.

@@ -111,11 +149,11 @@

 <div class="mt-6 mb-3 text-lg font-semibold">Features</div>

 - **Composite Evals Now Work Across Trace and Sessions:** You can now run composite evaluations against traces and sessions, not just individual spans. This lets you measure performance across multi-step conversations and grouped interactions in a single evaluation run.

 - **Custom Evals Run With Partial Inputs:** Custom evaluations now proceed even when some input fields are missing from your trace data, returning results with a warning indicator instead of failing entirely. System evaluators continue to require all inputs before running.

 - **Custom Variable Paths in Eval Task Mappings:** When configuring an eval task, you can now type any custom attribute path from your trace data rather than choosing from a fixed list. This gives you full flexibility when mapping trace fields to eval inputs, including deeply nested attributes.

 - **Dynamic API Columns Support Nested Field Paths:** When configuring a dynamic API column, you can now use dot notation to reference nested fields in the API response (for example, result.score.value). This lets you pull specific values from deeply structured API responses without needing to flatten them first.

@@ -131,15 +169,15 @@

 - **Voice Recordings Now Load Correctly from Error Feed:** Voice traces opened directly from the Error Feed were showing no recording even when one existed. Recordings now load correctly when viewing a voice trace from the Error Feed.

 - **Errored Evals Now Show a Clear Error Indicator:** Evaluations that encountered an error were previously shown as a blank dash in the trace and voice drawers, making it hard to tell that something had gone wrong. They now show a clear Error indicator so the status is immediately visible.

 - **Filter Picker Returns Correct Results for Matching Metric Names:** In some cases, when two metrics shared the same name across different namespaces, the filter property picker returned incorrect or missing results. The picker now returns the correct metric in all cases.

 - **Eval Results Now Load for All Trace Evaluations:** In some cases, evaluation results for trace-level evaluations were not appearing in the details panel even after the eval completed successfully. These results now display correctly.

 - **Eval Versions Save and Restore Correctly:** In some cases, saving a new version of an evaluation template or restoring an older one could result in incorrect configuration being applied. Versioning now works reliably, and context settings persist correctly across edits.

 - **Eval Creation Saves Correctly When Adding Multiple Evals:** In some cases, the Save & Add button in the eval picker stayed disabled even after completing all required steps, affecting the simulation, Observe, and dataset flows. The button now enables correctly once all steps are done, so you can save and continue adding evaluations without getting stuck.

 - **Large Traces Are Now Evaluated Completely:** Previously, evaluation prompts for very large traces were cut off well before the end of the trace content, causing evaluations to run on incomplete context. The limit has been significantly increased so that large traces are fully considered during evaluation.

@@ -157,7 +195,7 @@

 - **Tag Input in Trace Detail Now Responds to Clicks on the Enter Icon:** In the trace detail view, clicking the Enter icon in the tag input field now correctly adds the tag. Previously, only pressing the keyboard Enter key would work.

 - **Adding Evals with Number Inputs Now Works Reliably:** In some cases, adding an evaluation that included a number input field was causing unexpected errors. These errors no longer occur.

 </div>

@@ -169,41 +207,41 @@

 - **Self-Hosted Install:** Setting up Future AGI on your own machine is now straightforward. Clone the repo, cd into the folder, and run bin/install on macOS or Linux. You need Docker, Docker Compose, and at least 8 GB of RAM. That's it.

 - **Expanded Context Injection for Evals:** When configuring an eval, you can now choose exactly which context to inject as separate options: span metadata, trace IDs, session data, or call transcripts and recordings. If you already use variables in your eval, you can map context to them as before. If you do not, you can skip that step entirely. When running evals on sessions, the injected context includes depth into the underlying traces and spans, so you can see exactly where gaps occurred. When building an eval, the right context type is pre-selected automatically based on what you are evaluating, so there is less manual setup.

 <div class="mt-6 mb-3 text-lg font-semibold">Bugs/Improvements</div>

 - **Task Page Filters Apply to Eval Variable Mapping:** Filters you set on the task page now carry through when mapping eval variables. The right traces, spans, and sessions are already scoped for you, so there is no need to search for them manually.

 - **Image Evals Now Accept URLs:** Image-based evals now accept public HTTP/HTTPS URLs and signed S3 links as inputs. Pass the URL as a string directly in the input field. No file upload or base64 encoding needed. The platform fetches and processes the image server-side before running the eval.

 - **Code Evals More Reliable:** Built-in code evals now run in a consistent execution environment. Eval descriptions have also been updated to accurately reflect current behavior.

 - **Built-In Validators Improved:** Ten built-in validators have been updated for better accuracy. Email, HTML, SQL, URL, and XML validators now handle a wider range of inputs correctly. Scoring metrics including diff, kappa, word-level error rate, and Meteor score all produce more precise results.

 - **Eval Scores Are Consistent Regardless of Input Formatting:** Eval scores no longer vary based on incidental whitespace in inputs. All inputs are normalized before scoring, and comparing two identical empty values now returns a perfect match.

 - **Optional Eval Fields Now Have Sensible Defaults:** Code evals with optional numeric configuration fields now run with their default behavior when those fields are left blank. No configuration is needed unless you want to override the defaults.

 - **Structured Output Compatibility Improved:** Evals that use LLM as a judge were returning empty results for some nested schema shapes, particularly with models that do not fully support structured output. Both cases are now handled gracefully.

 - **Continuous Evals Now Run Reliably at Scale:** Always-on evals with sampling configured now process incoming data consistently over time, regardless of total volume seen so far.

 - **Task Submission Error Handling Improved:** If an eval configuration fails to save inside the Tasks wizard, you now see a clear error message immediately and can fix it before submitting. The wizard keeps your inputs intact.

 - **Saved Eval Settings Preserved on Re-edit:** Opening the edit view on a staged eval in Tasks was resetting the model selection and error localizer toggle back to defaults. Both settings are now correctly restored when you reopen an eval for editing.

 - **Session List Loads Faster:** The session list now loads more quickly, so you spend less time waiting.

 - **Playground Handles URL Inputs Reliably:** The Playground now processes URL inputs correctly without becoming unresponsive. In some cases, entering a URL as an input would cause the interface to stop responding until the page was refreshed.

 - **Observe Task Evals Now Validate Recording URLs:** Task evals in Observe now verify that recording URLs from your provider's webhook are reachable before running. If a URL is inaccessible, you get a clear error message so you can fix it quickly and get accurate results.

 - **Dot-Notation Now Supports All Nesting Patterns:** You can now use any variable notation style in eval prompts, including dot notation and deeply nested references.

 - **Only Published Evals Appear in the Eval Drawer:** Draft eval templates created during building or testing no longer show up in the eval selection drawer. Only published evals are visible there.

 - **Error Localizer Only Runs When Needed:** The error localizer now skips evals that already passed. It only runs when there is actually something to investigate.

 - **Dataset Column Deletion Is Faster:** Deleting columns from a dataset is now more efficient, especially for larger datasets.

@@ -215,17 +253,17 @@

 <div class="mt-6 mb-3 text-lg font-semibold">Bugs/Improvements</div>

 - **Improved Reliability for Voice Observability evals:** Traces, replays, and evals for voice calls now stay fully accessible long after a call ends. Vapi and Retell recording URLs rotate and expire on their own schedules, which causes playback to silently break on older calls. FutureAGI now stores a durable copy of every external recording at ingestion time, so your observability data and eval runs are no longer dependent on provider URL availability.

 - **Error Feed Now Works for Voice Simulation:** Eval-source clusters on VAPI and simulations were not rendering correctly. The Pattern Summary, Trends KPIs, and trace drawer all needed updates to support these project types. All three are now fixed, and clicking a voice trace now opens the voice call drawer as expected.

 - **Datasets: Select-All State Resets When Switching Datasets:** Switching datasets or tabs was preserving the previous selection state, causing incorrect behavior in delete, duplicate, and copy actions. Selection now resets cleanly on every dataset switch.

 - **Trace Attribute Drawer: Long Values Are Expandable and Rows Are Easier to Scan:** Long string values in the span attributes drawer were clipped with no way to see the full content. Values are now click-to-expand, and dividers between rows make it easier to tell where one attribute ends and the next begins.

 - **Eval List Shows Correct Default Version:** The evals list now correctly shows the current default version for each template instead of always showing V1.

 - **Zero Eval Scores Now Render:** Eval score rendering was treating a score of 0 as empty. Dataset grids, eval logs, and datapoint drawers now correctly display zero scores.

 - **j/k Navigation Shortcuts No Longer Swallow Text Input:** The j and k row navigation shortcuts were intercepting keystrokes globally, blocking you from typing those letters into comment fields and text inputs in the detail panel. These shortcuts now correctly yield to focused text inputs.

@@ -233,7 +271,7 @@

 - **Workspace Invite Fixed for Existing Users:** In few cases, existing org members invited to a new workspace were not receiving the invitation email and could not see the new workspace in their list. The invite flow now correctly sends the email and grants access uniformly.

 - **Eval "Created By" Now Shows Organization Name for Legacy Evals:** Evals without creator metadata were showing "User" in the Created By column. They now fall back to the organization display name, and filtering by creator also matches on organization name.

 </div>

@@ -245,7 +283,7 @@

 - **Jinja2 Template Support in Prompt Editor and Agent Playground:** You can now write prompts using Jinja2 syntax `{% if %}`, `{% for %}`, filters, and other Jinja2 constructs alongside the existing Mustache `{{ }}` format. A new Template Format dropdown lets you switch between Mustache and Jinja2 in the prompt workbench, run prompt view, and agent playground node forms. The backend renders Jinja2 templates safely, and variable extraction for the inputs panel works correctly in both modes.

 - **Annotation and Eval Metrics as Columns When Adding to Dataset:** When adding traces to a dataset from the trace drawer, you can now include annotation scores and eval metric results as dataset columns. Select the metrics you want during the "Add to Dataset" flow and they'll be carried over as column values on each row, letting you capture quality signals directly in the dataset without a separate export step.

 <div class="mt-6 mb-3 text-lg font-semibold">Bugs/Improvements</div>

@@ -253,7 +291,7 @@

 - **Voice Call Talk-Time Split Fixed:** Talk-time percentages in the voice call drawer were showing a blank for certain call types. Per-role totals now calculate correctly across all transcript formats.

 - **Prompt Workbench Eval Delete Fixed:** Deleting an evaluation run in the Prompt Workbench now succeeds for all types of prompts. Previously, for specific cases it would show a failure toast and leave the eval in the list.

 - **Annotation Queue CSV Export Now Works:** You can now export annotation queue data as CSV directly from the analytics tab. The Export CSV button is fully functional and ready to use.

@@ -275,19 +313,19 @@

 - **Error Feed:** A new feed that puts all your AI errors, hallucinations, and pipeline issues in one place. Issues are scanned and scored automatically as new traces come in, and you can run a Deep Analysis on any issue to get a ranked list of likely causes and suggested fixes. For each issue, you get a quick summary of what went wrong, examples of traces that failed compared to ones that worked, an agent flow diagram, and a trend chart. You can triage each issue by setting its status, severity, and assignee, create a Linear ticket in one click, and if a fixed issue comes back, it is automatically flagged as a regression. The trend chart also marks when each release went out, so you can spot the release that likely caused the problem.

 - **Observe Revamp:** The Tracing experience has been rebuilt around saved views, stronger search, and a more connected layout. Filter the list using natural language with Ask AI, or build queries with Basic and Query modes. Save custom views (filters, columns, sort, density) and switch between them, with compacted layouts for quicker scanning. Search across traces, spans, and agent flow, and view the full agent / graph flow to understand how your agent is moving between steps. You can run evals or add traces to a dataset right from the list. Navigation is now consistent across Trace, Sessions, and Users: prev/next, group by, view trace, view sessions, and replay all route to the right place, and filters carry over between tabs.

 - **Experiments Revamp:** The Experiments flow has been rebuilt from inside a dataset. You can now name your experiment, pick its type (LLM, TTS, STT, or Image), and add the prompts and agents you want to compare (each with version selection and the option to stack multiple models). Running experiments on agents is fully supported. Add evaluations from the same picker as the Evals page, and optionally pick a column from your dataset to compare results against. The Experiments list shows status, model count, and eval count per run, so you can monitor progress, edit experiments, rerun, or add more evaluations anytime.

 - **Evaluations Revamp:** We rebuilt the Evals page. Variable mapping is much easier now, with the relevant data points shown right next to the field you are mapping. There is a new test area on the same page where you can try your eval on custom inputs, a dataset, a real trace, or a simulation before saving. You can also bundle multiple evals together (composite evals), and write evals in Python or JavaScript. Evals can now connect to external tools and the internet to enhance their judgements, and you can configure how detailed the explanation should be. The list has filter chips for use cases like RAG, Hallucination, Conversation, Code, PDF, Text, and Safety, plus trend charts and error rates per eval.

 <div class="mt-6 mb-3 text-lg font-semibold">Bugs/Improvements</div>

 - **AWS Bedrock added to Agent Command Center Gateway:** AWS Bedrock is now available as a provider in the Agent Command Center Gateway, alongside OpenAI, Anthropic, Google (Gemini), Azure OpenAI, Cohere, Groq, Together AI, Fireworks AI, Mistral AI, and Custom/Self-hosted.

 - **Tighter guardrails:** Guardrail configuration shows the full keyword blocklist setup, and guardrails reliably trigger when a score crosses the limit you set.

 - **Command Center is more reliable across providers:** A set of provider-specific behaviours have been tightened up: GPT-5 routing, multimodal Gemini handling, full-length session IDs, cleaner auth errors for invalid keys, default cost tracking, cache TTL retention, and immediate webhook delivery logs.

 </div>

@@ -337,15 +375,15 @@

 <div class="mt-6 mb-3 text-lg font-semibold">Features</div>

 - **Dashboards:** Create custom dashboards to track agent performance across eval scores, system metrics, cost, and experiment progress in 1 central place. All the data that was previously scattered across datasets, eval logs, Observe, and experiments is now queryable in one place. Add widgets, filter by agent or time range, and compare performance over time.

 - **Falcon AI:** A context-aware AI assistant embedded in the platform. It picks up the context of whatever page you are on, so you can ask questions or trigger actions directly against your current data. Supports trace debugging, simulation, eval creation, dataset building.

 - **MCP Server:** FutureAGI now ships an MCP server that you can connect to your IDE. Supported clients include Cursor, Claude Code, VS Code, Claude Desktop, and Windsurf. Once connected, your coding assistant has access to your evaluations, datasets, experiments, traces, and prompts. You can also configure which tool groups are available to connected clients from the settings page.

 <div class="mt-6 mb-3 text-lg font-semibold">Bugs/Improvements</div>

 - **Eval not added when using a different column for mapping in run optimization:** Fixed a bug where selecting a different column for eval mapping during run optimization would silently drop the eval instead of adding it.

 - **Annotation queue status not updating correctly:** Fixed the queue status state flow where an already active queue still showed the "Activate" option, allowing it to be activated multiple times.

@@ -373,17 +411,17 @@

 <div class="mt-6 mb-3 text-lg font-semibold">Bugs/Improvements</div>

 - **Skip reasons now shown for evals and CSAT in voice simulate:** Evals and CSAT are now automatically skipped for calls where there was no meaningful conversation or the audio was under 5 seconds. A skip reason is shown directly in the cell so you always know why a particular eval was not scored.

 - **Audio and image output types in Prompt Workbench:** Prompt Workbench now supports audio and image as output types when running prompts. This makes it straightforward to test prompts that generate voice or visual outputs directly from the Workbench.

 - **Custom eval names now work correctly across workspaces:** Custom eval lookups via the SDK are now scoped to the correct workspace, fixing an error that some users were seeing when the same eval name existed in multiple workspaces.

 - **Full eval explanation now visible in test view:** The explanation output box in the eval test view now grows automatically to show the complete reasoning. Previously the text was getting cut off.

 - **Dataset name now fills in automatically when uploading a file:** When you upload a CSV or JSON file to create a dataset, the name field is now pre-filled using the filename with special characters removed. You can still edit it freely after.

 - **Clearer errors when a model does not support your input type:** When running evals with a model that has modality restrictions, you now see a specific message explaining exactly what is not supported instead of a generic error.

 </div>

@@ -436,7 +474,7 @@
 <div class="mt-6 mb-3 text-lg font-semibold">Bugs/Improvements</div>

 - **Observability Traces Show Evaluation Data:**
  Observability now lets users filter traces by evaluation data. With the **Show Traces with Evals** toggle, users can view only interactions that have been evaluated without manually scrolling through all traces.

 - **Workspace Admins Can Access Keys:**
  Workspace administrators can now view keys, eliminating dependency on workspace owners for key retrieval and enabling better key management.
@@ -445,7 +483,7 @@
  When clicking **Create Scenarios** from an agent definition, the scenario generation form now pre-fills with agent details, reducing manual data entry and speeding up scenario creation.

 - **Reasoning Model Support:**
  Added support for reasoning models with nullable **runprompt** parameters, enabling advanced reasoning workflows and more flexible parameter configurations.

 - **Better Scenario Naming:**
  Improved naming conventions for auto-generated scenarios, making it easier to identify and distinguish between different test scenarios in large test suites.
@@ -472,7 +510,7 @@
  Made the session history header sticky when there is only one item to view, removing unnecessary tab navigation and simplifying the single-session review experience.

 - **Improved Error Handling for API Keys and Prompt Execution:**
  Enhanced platform-wide error handling. When API keys are missing or misconfigured, users now see clear, actionable guidance directing them to update their settings. Prompt execution errors also provide specific, helpful details instead of generic messages, making issues easier to understand and resolve quickly.

 - **Voice Observability Project Deletion Fixed:**
  Resolved error handling during voice observability project deletion attempts, ensuring projects delete cleanly or provide clear feedback on deletion status.
@@ -489,7 +527,7 @@
 <div class="mt-6 mb-3 text-lg font-semibold">Features</div>

 - **Image Output Support in Datasets and Prompt Workbench:**  
  Users can now generate and view image outputs directly in Dataset Run Prompt and Prompt Workbench when working with image models. This enables complete multimodal workflows for testing and experimenting with models that generate visual content.

 - **Multiple Image Upload Support in Datasets:**  
  Users can now upload multiple images to a single dataset column using comma-separated values in JSON or CSV files. This enables more flexible data handling for image-based evaluations and experiments, with full support for accessing and leveraging images in prompt sections across run prompt and experiment workflows.
@@ -506,7 +544,7 @@
  Synthetic data generation performance has been optimized, significantly reducing the time required to create and populate dataset rows. This streamlines dataset creation workflows and enables faster iteration during testing and development.

 - **Enhanced Dataset Upload Handling:**  
  Improved column type detection and validation during JSON and CSV uploads. The system now better handles JSON objects, arrays, empty lists, numeric and boolean values, and datetime formats, resulting in more accurate data inference and fewer upload errors.

 - **More Natural Chat Simulation Personas:**  
  Chat simulation personas now generate more natural, human-like conversations. Personas avoid overly formal patterns (such as repeated  **thank you** responses) and produce more realistic conversational flows that better reflect real user interactions.
@@ -533,7 +571,7 @@
  Enhanced HTML parsing and rendering to ensure prompt outputs display with correct formatting and spacing.

 - **Better Error Messaging in Error Localizer:**  
  Error Localizer now provides more actionable and accurate error messages when evaluation failures occur, helping teams diagnose and resolve issues more quickly.

 - **Clearer Optimization Parameters Display:**  
  Optimization parameters configured before running Fix My Agent are now visible on the results page, providing full transparency into the settings used for each optimization run.
@@ -642,7 +680,7 @@
  Trace tables are now significantly faster with smoother scrolling and improved alignment, enabling quick and comfortable analysis of large volumes of trace data at scale.

 - **PDF & Document Preview Across the Platform:**
  Uploaded PDFs and documents can now be previewed directly across datasets and experiments, allowing instant verification of file contents without downloading and reducing errors and rework.

 - **Enhanced Audio Player Experience:**
  The audio player now loads audio only when the play button is clicked. This reduces table load time, removes lag in audio-heavy views, and makes reviewing voice conversations faster and smoother.
@@ -662,10 +700,10 @@
  You can now add custom instructions while creating scenarios. These inputs influence scenario generation, giving you better control over how scenarios are created.

 - **Observe Table Performance Improvements:**  
  Observe tables are now more stable and performant for large datasets. Simplified table cells improve scrolling, rendering speed, and overall readability.

 - **Enhanced Eval Mapping with Prompt and Knowledge Base Inputs:**  
  Eval mapping now supports both prompt-related columns and Knowledge Bases as selectable inputs. This makes evaluation setup clearer, reduces configuration confusion, and enables more accurate, context-aware evaluations across the platform.

 - **Fetch Agent Definition from Providers:**  
  Agent definitions including prompts and description can now be fetched directly from providers like VAPI or Retell using API key and assistant ID. This reduces manual configuration and keeps agent setups in sync.
@@ -685,8 +723,8 @@
 <div class="mb-12 pb-8 border-b border-[var(--color-border-subtle)] last:border-b-0">

 <div class="mt-6 mb-3 text-lg font-semibold">Bugs/Improvements</div>
 - **Filters for Evals in Dataset Summary:**  
  You can now filter Dataset Summary by specific evaluations. This helps you focus only on relevant evals, and summary charts update automatically based on the selected filters.

 - **Default Prompt Tokens Update Based on Model Selection:**  
  In Prompt Workbench, default token limits now update automatically when you change the model. This avoids token mismatch issues and removes the need for manual corrections.
@@ -741,7 +779,7 @@

 <div class="mt-6 mb-3 text-lg font-semibold">Features</div>

 - **Smarter Debugging with Actionable Simulation Insights (Fixmyagent):**
  Simulation results now deliver intelligent, context-aware suggestions to resolve both agent-level and infrastructure issues. Developers can quickly identify problems across prompts, model configurations, and runtime setups, with targeted recommendations for faster resolution. Users can also filter simulation calls to view only those with valid suggestions, enabling more focused debugging and faster optimization.

 <div class="mt-6 mb-3 text-lg font-semibold">Bugs/Improvements</div>
@@ -757,7 +795,7 @@
 <div class="mt-6 mb-3 text-lg font-semibold">Bugs/Improvements</div>

 - **Documentation Links Added Across Observe:**
  Introduced direct documentation links across LLM Tracing, Sessions, Evals & Tasks, Alerts, and Users. Added a tooltip for Scheduled Runs in Evals & Tasks to improve clarity and onboarding.

 </div>
 ## Week of 2025-12-01
@@ -773,7 +811,7 @@
  Improved the primary graph dropdown for easier metric switching and refined error handling in observation evaluations to deliver clearer and more accurate failure reporting.

 - **Prompt Workbench Improvements:**
  Prompt Workbench now provides a smoother experience with live WebSocket streaming in Improve Prompt and fixes for Groq model execution. Additional UI refinements include smoother tab interactions, restored metadata visibility, and resolved overflow issues.

 - **Fixed Processing of Audio Type:**
  Resolved inconsistent parsing of audio URLs that caused errors during audio rendering and experiment execution. Audio inputs now load and process reliably across all workflows.
@@ -803,7 +841,7 @@
  Experiments now load significantly faster during re-runs, reducing wait times and improving responsiveness across iterations.

 - **Enhancements in Observe:**  
  Observe received multiple usability, stability, and backend improvements to deliver a more consistent experience across traces, sessions, and analytics. Updates include sticky filters, clearer pagination, improved table layouts, refined metadata visibility, streamlined pricing logic, improved JSON and payload handling, corrected evaluation log counts, more accurate session ordering, and several data consistency fixes. LLM tracing also now includes clearer copies and tooltips for improved understanding of model transitions and reasoning.

 - **Filters Freezing UI in Observe:**  
  Fixed an issue where applying filters caused the Observe interface to freeze.
@@ -826,7 +864,7 @@
 <div class="mt-6 mb-3 text-lg font-semibold">Features</div>

 - **Support for Custom Voices in Run Prompt and Experiments:**  
  Developers can now use custom voices from Eleven Labs and Cartesia, enabling fine-grained control over voice style, brand identity, and experiment fidelity.

 </div>

@@ -859,7 +897,7 @@
 - **Dataset Audio Evaluations Not Working:**  
  Fixed an issue where dataset audio evaluations would time out for large audio files. Evaluation throughput is now stable across large datasets.

 - **Fix Redundant Eval Mapping Issue in Run Test:**  
  Corrected redundant or inconsistent evaluation mappings to ensure inputs and outputs in Run Test match the expected configuration.

 </div>
@@ -872,11 +910,11 @@

 - **Show Reasoning Column in Simulate:** A reasoning column has been added to simulation results, allowing users to view the logic behind evaluation outcomes. This helps teams better interpret model decisions and debug unexpected behaviors.

 - **TraceAI Livekit SDK Release:** Support added for tracing Livekit-based agents, enabling visibility into audio events and voice interactions for improved debugging and analysis.
 <div class="mt-6 mb-3 text-lg font-semibold">Bugs/Improvements</div>

 - **Workbench UI: Hover Tooltip Additions:**
  Hover-based tooltips have been added across the Workbench interface, providing contextual guidance and reducing confusion while navigating or editing prompts.

 - **General Bug Fixes in Simulate and Observe:**
 Resolved several platform stability issues, including validation errors that blocked evaluation configurations from being saved, inconsistent filter behavior in prototype and project views caused by incorrect parameter formatting, and pagination problems on the User Dashboard resulting in more consistent and reliable performance across the platform
@@ -894,7 +932,7 @@
 <div class="mt-6 mb-3 text-lg font-semibold">Bugs/Improvements</div>

 - **New TTS Model Integrations for Run Prompt and Experiments:**
  Added support for Cartesia, Hume, Neuphonics, and LMNT TTS models, expanding the range of available voices and synthesis characteristics.

 - **Enhanced Simulation Behaviors and Realism:**
  Simulation output now features more natural persona logic, frustration modeling, improved background noise handling, and smoother conversational transitions for more realistic interactions.
@@ -928,7 +966,7 @@
  The Usage Summary experience has been fully redesigned to provide detailed visibility into workspace-level activity. All API call logs across Traces, Observe, Simulation, and Error Analysis now include workspace attribution. A new cumulative usage API provides long-term consumption insights with improved cost and count tracking for financial clarity.

 - **New Agent Definition UX with Multi-Step Flow:**
  The Agent Definition workflow has been rebuilt into a guided three-step setup—Basic Information, Configuration, and Behaviour. The updated layout improves discoverability, adds a contextual resource panel, and introduces row-level table actions.

 - **Prompt Workbench Revamp:**
  The Workbench UI has been redesigned to simplify prompt version management and improve collaboration. Prompt versions now follow a commit-based history model, making it easier to review, compare, and maintain consistency across experiments.
@@ -1022,7 +1060,7 @@
 <div class="mt-6 mb-3 text-lg font-semibold">Features</div>

 - **Add Rows in Simulate Scenarios:** Scenario tables can now be expanded with maximum flexibility. Rows can be added manually for precision control, generated intelligently using AI for rapid test case creation, or imported directly from existing datasets to leverage historical data. This enhancement streamlines scenario building and dramatically reduces setup time for complex simulations.
 - **Run Evaluations for Completed Test Runs:** New evaluations can now be executed on already completed test runs without rerunning entire simulations, delivering significant time and cost savings. Users can select desired test runs via checkboxes, click Run Evals, and choose specific evaluations to execute. This targeted approach enables efficient resource utilization, faster iteration on evaluation metrics, and flexible experimentation with different criteria.
 - **Agent Definition Version Selection:** Specific Agent Definition Versions can now be selected when creating new test runs and directly from the test run details page. This enhancement provides greater control over testing workflows and ensures reproducibility across experiments, making version comparison seamless and reliable.

 <div class="mt-6 mb-3 text-lg font-semibold">Bugs/Improvements</div>
@@ -1037,15 +1075,15 @@
 <div class="mt-6 mb-3 text-lg font-semibold">What's New</div>
 <div class="mt-6 mb-3 text-lg font-semibold">Features</div>

 - **Voice Observability Through Vapi Integration:** Voice interactions are now fully observable within the platform. Assistant call logs from Vapi, including voice simulations, are automatically captured and displayed in your Observe project alongside other project data, enabling comprehensive monitoring and analysis of voice-based interactions.

 - **Eval Groups in Experiment and Optimization:** Evaluation groups can now be configured, created, and applied directly within Experiment and Optimization workflows. This integrated approach reduces workflow friction and accelerates the evaluation setup process.

 <div class="mt-6 mb-3 text-lg font-semibold">Bugs/Improvements</div>

 - **Media Visualization in Eval Playground:** Media columns now render actual image and audio content instead of raw URL strings, providing complete context and improved clarity in evaluation results.

 - **Accelerated Learning & Improved Accessibility:** Implemented a View Docs button across all major modules to streamline access to relevant documentation. Additionally, specific documentation links have been added directly to individual Evals, enabling quicker understanding and more efficient usage.

 - **Contextual Flow Analysis Display:** The interface has been streamlined by removing flow analysis views from dataset-based scenarios where they are not applicable, resulting in a cleaner and more intuitive user experience.

@@ -1061,7 +1099,7 @@

 <div class="mt-6 mb-3 text-lg font-semibold">Features</div>

 - **Simulate via SDK:** You can now simulate realistic, ultra-low-latency customer calls against your deployed LiveKit agents directly through the SDK. This update enables fully local testing without external dependencies, automatically records high-fidelity WAVs and transcripts over the WebRTC stream, and integrates with AI Evaluation for end-to-end performance evaluation. Developers gain full ownership and flexibility—with self-hosted control, customizable ASR, TTS, and model configurations—while cutting simulation costs by roughly 60–70%.

 - **Selective Test Rerun in Simulate:** Users now have precise control over simulation testing with the ability to rerun individual calls. You can choose to rerun the complete call with evaluations or re-execute evaluations independently, enabling targeted debugging and validation without requiring full test restarts.

@@ -1077,7 +1115,7 @@

 - **Evaluation Group Management:** Users can now configure and create evaluation groups directly from datasets and simulate, streamlining evaluation setup and saving time.

 - **Default evals group:** Access preconfigured evaluation groups for use cases like RAG, computer vision, etc., and save time in evaluation setup.

 - **Advanced Simulation Management:** Test executions now auto-refresh with real-time data, giving users instant visibility into ongoing runs. Users can stop simulations at any point to prevent unnecessary calls and costs. Enhanced features include Visual Workflow Tracing to pinpoint agent deviations, Real-Time Test Control to efficiently manage test execution, and Comprehensive Performance Metrics (latency, interruption response time, etc.) for precise agent evaluation and optimization.

@@ -1107,7 +1145,7 @@

 - **Large Dataset Upload Stability:** Improved dataset upload experience by resolving loading issues for large CSV/JSON files, enhancing stability and user visibility.

 - **Custom Evaluation Editing Fixes:** Resolved bugs in the Evals Playground to ensure smoother and more reliable editing of custom evaluations.
 - **Group Evaluation UI/UX Improvements:** Refined the user interface and experience when editing group evaluations, making the process more intuitive and consistent.

 </div>
@@ -1119,14 +1157,14 @@
 <div class="mt-6 mb-3 text-lg font-semibold">What's New</div>

 **Features**
 - **Advanced Evaluation Group Management:** Streamline your evaluation workflows with comprehensive CRUD operations for evaluation groups. Create, view, edit, and delete evaluation groups seamlessly, then apply them directly to tasks and prompts for consistent scoring across your AI applications. Enhanced with intelligent popovers that display eval input details, LLM/Knowledge Base dependencies, and linked evaluations during the grouping process.
 - **Enhanced Call Management & Audio Controls:** Manage your voice AI testing with the completely revamped Call Details Drawer that displays associated scenarios for each test run. Features a sophisticated multi-channel audio player for separate visualization and playback of assistant and customer audio streams.
 - **Flexible Call Recording Downloads:** Export call recordings in multiple formats (Caller Audio, Agent Audio, Mono Audio, Stereo Audio) to match your analysis workflow requirements. Coupled with granular audio field selection in evaluations for precise control over which conversation segments to score and analyze.

 **Bugs/Improvements**
 - **Enhanced Collaboration Features:** Boost team productivity with collaborator support in prompts, allowing you to add and view team members working on specific prompts. Track prompt ownership with visible Created By fields and organize your work more efficiently with sorting capabilities for sample folders, prompts, and prompt templates.
 - **Annotation & Prompt Import Fixes in Dataset:** Enhanced annotation workflows by preventing empty label view selections and resolving prompt overflow issues in Run Experiment interfaces.
 - **Filter Issues for Evals Selection:** Bug fix for eval type filters on evaluations drawer across the platform.

 </div>

@@ -1142,10 +1180,10 @@
 - **Enhanced Plans & Pricing Experience:** Navigate pricing options effortlessly with our completely redesigned pricing page featuring interactive plan comparison cards, a dynamic price calculator, and detailed plan breakdowns. The new design provides clear visibility into feature tiers and helps you make informed decisions about your subscription.

 **Bugs/Improvements**
 - **Enhanced Observability & Dashboard Accuracy:** Resolved filtering issues for User ID across User Details Dashboard and Observe sections. Improved project selector clarity in Observe Eval Task Drawer and fixed workspace-level OTEL trace creation issues for more reliable monitoring.
 - **UI/UX Enhancements:** Streamlined simulation flow interfaces for better user experience and standardized decimal precision across the platform (displaying 2 decimal places for all numeric values).
 - **Enhanced Data Visibility in Dataset Summary:** Understand exactly how many data points contributed to your summary results and evaluation metrics, helping with complete transparency.
 - **Code Snippet for Running Evals via SDK:** Copy-paste ready terminal commands to run any evaluation without manual configuration by leveraging code snippet on the evals playground.
 - **Unified Design System:** Experience consistent interactions across the platform with our custom DatePicker component, ensuring a polished and cohesive user experience throughout your workflow.

 </div>
@@ -1157,11 +1195,11 @@
 <div class="mt-6 mb-3 text-lg font-semibold">What's New</div>

 **Features**
 - **Comprehensive Annotation Quality Dashboard:** Monitor annotation quality at scale with our centralized analytics dashboard. Track key metrics including annotator agreement rates, completion times, and advanced quality scores (cosine similarity, Pearson correlation, Fleiss' kappa) to ensure your training data meets the highest standards.
 - **Enterprise-Grade Multi-Workspace Security:** Deploy with confidence using our complete RBAC framework. Create isolated workspaces, manage team members with full CRUD capabilities (edit, deactivate, resend invitations), and implement role-based access controls that scale with your organization's security requirements.
 - **Advanced Observability with Feed Insights:** Gain unprecedented visibility into agent performance with the new Feed Insights tab in the Observe section. Identify failed stages, affected spans, view error cluster events, track user counts, and analyze trend data over time for rapid issue diagnosis and agent optimization.
 - **Intelligent Onboarding Navigation:** Experience streamlined onboarding with our redesigned sidebar that prominently highlights the 'Get Started' section until all 7 onboarding steps are completed. This ensures new users follow a structured path to success before transitioning to the regular navigation experience.
 - **No Config Evals – Agent Compass for AI Teams:** AI agent developers often struggle to identify performance bottlenecks and system failures across complex execution flows. Traditional evaluation methods and system metrics offer only fragmented, span-level visibility—leaving teams blind to the bigger picture. As a result, diagnosing latency spikes, inefficient prompts, or tool-call failures becomes a time-consuming, manual process. Without actionable, trace-level insights, performance optimization turns reactive, error-prone, and expensive.

 **Bugs/Improvements**
 - **Improved Observability Reliability:** Enhanced backend resilience for incomplete span creation scenarios and fixed issues when OpenTelemetry exports fail partially, ensuring complete trace visibility.
@@ -1174,7 +1212,7 @@
  #### What's New

  **Features**
  - **Add Rows in Evals Tab of Prompt Workbench:** Instantly add new rows with variable values in the evaluations screen, allowing you to generate outputs and evaluate without returning to the Prompt Workbench homepage.
  - **Trace Linked to Prompt Workbench:** View comprehensive performance metrics (latency, cost, tokens, evaluation metrics) for each prompt version linked to traces (and spans) across development, staging, and production environments via the Metrics section in Prompt Workbench.
  - **Critical Issue Detection & Mitigation Advice on Datasets:** Get actionable, AI-powered insights with recommendations to improve your agent's performance and accelerate your path to production.
  - **Access FAGI from AWS Marketplace:** Sign up or sign in to the FAGI platform via AWS Marketplace and leverage AWS contracts and billing to work with FAGI.
@@ -1190,7 +1228,7 @@
  - **Support for Video in Trace Detail Screen:** Added support for viewing videos in the Trace Details screen.
  - **Fixed Scroll Issue in Agent Description Box (Simulation):** Enabled scroll functionality via mouse in the agent description box within the simulation module.
  - **Error Handling on Simulation Page:** Improved error handling for low credit balances on the simulation homepage to enhance user experience.
  - **Credit Utilization for Error Localizer:** Added visibility of credit utilization for the error localizer in the usage summary screen.
 </div>

 ## Week of 2025-08-19
@@ -1200,17 +1238,17 @@

  **Features**
  - **Comparison Summary:** Compare evaluations and prompt summaries of two different datasets now with detailed graphs and scores.
  - **Function Evals:** Enable adding and editing function-type custom evals from the list of evals supported by Future AGI.
  - **Edit Synthetic Dataset:** Edit existing synthetic datasets directly or create a new version from changes.
  - **Document Column Support in Dataset:** New document column type to upload/store files in cells (TXT, DOC, DOCX, PDF).
  - **User Tab in Dashboard and Observe:** Searchable, filterable user list and detailed user view with metrics, interactive charts, synced time filters, and traces/sessions tabs.
  - **Displaying the Timestamp Column in Trace/Spans:** Added Start Time and End Time columns in Observe → LLM Tracing and Prototype → All Runs → Run Details.
  - **Configure Labels:** Configure system and custom labels per prompt version in Prompt Management.
  - **Async Evals via SDK:** Run evaluation asynchronously for long-running evaluations or larger datasets.

  **Bugs/Improvements**
  - SDK Codes: Update the SDK codes for columns and rows on create dataset, add rows, and landing dataset page.
  - Fixed the editable issue in custom evals form: Incorrect config was displayed on evals page for function evals.
  - The bottom section for trace detail drawer disappeared: Dragging the bottom section caused the entire bottom area to disappear; behavior corrected.
  - UI screen optimization for different screen sizes.
  - Bug fixes for updates summary screen - color, text, and font alignment.
@@ -1241,12 +1279,12 @@

  **Features**
  - **Voice Simulation:** New testing infrastructure that deploys AI agents to conduct real conversations with your voice systems, analyzing actual audio, not just transcripts.
  - **Edit Evals Config:** Now edit the config (prompt/criteria) for your custom evals via evals playground, but with the restriction of no variable addition.

  **Bugs/Improvements**
  - Bug fix for dynamic column creation via Weviate.
  - Reduced dependencies for TraceAI packages (HTTPS & GRPC).
  - Automated eval refinement: Retune your evals in evals playground by providing feedback.
  - Markdown now available as a default option for improved readability.
  - Support for video (traces and spans) in Observe project.
 </div>
@@ -1257,9 +1295,9 @@
 #### What's New

  **Features**
  - **Edit, Duplicate, and Delete Custom Evals:** Now duplicate, edit, or delete evaluations if they are not in use anymore or logic is outdated.
  - **Bulk Annotation/User Feedback:** Bulk annotate your observe traces with user feedback directly using API or SDK.
  - **JSON View for Evals Log:** Access evals log data in JSON format in evals playground.

  **Bugs/Improvements**
  - Span name visibility in traces for Observe and Prototype.
@@ -1285,14 +1323,14 @@
  - Fixed audio icon.
  - Error handling for network connection issues.
  - Bug fixes for prompt workbench versioning issues.
  - Changed the color mapping for deterministic type evals.
  - Updated loaders for evals playground.
  - Pagination fix in Observe.
  - Added clear functionality in add to dataset column mapping fields in Observe.
  - Clear graph property when Observe changes; fixed thumbs down icon not rendering.
  - Generate variable bug fix in prompt workbench.
  - Experiment page break on content tab switch.
  - Fixed the created_at 30-day filter on evals log section.
 </div>

 ## Week of 2025-07-14
@@ -1301,7 +1339,7 @@
  #### What's New

  **Bugs/Improvements**
  - Prevented overscroll in X direction for entire platform.
  - Glitch after refreshing while generating sample data.
  - Error message update for doc uploads and save button status for doc upload.
  - Variable auto-population issue in compare prompt for multiple versions.
@@ -1338,10 +1376,10 @@
  **Bugs/Improvements**
  - Improved error handling in prompt versioning.
  - Removed unnecessary keys from evaluation outputs.
  - Better handling of required keys to column names in add_evaluation in dataset.
  - Removed TraceAI code from FutureAGI SDK - experiment rerun fix.
  - SSO login issues.
  - Eval ranking fixes.
  - Fixed sizing and view issue in dataset when row size is adjusted.
  - Fixed sidebar item not showing active style when child page is active globally.
  - Edit integer type has red background in edit field.
@@ -1361,25 +1399,25 @@
  - Prototype / Configure project.
  - Trace details view for Observe/Prototype.
  - Allow search in dataset.
  - Run insights view - evals (deployed without the error modal part).
  - Improved user flow for synthetic data creation with "best practices" for each input.
  - Add to dataset flow from Prototype.
  - API for Gmail account signup.
  - Enabling search within data.
  - First-time user experience walkthrough for newly onboarded users.
  - Quick filters for annotations view in Prototype and Observe.
  - Compare runs in Prototype.
  - Diff view for compare dataset.
  - Enhancement of Observe and Prototype.
  - Addition of new evals for audio - conversational and completeness evals.

  **Bugs/Improvements**
  - New choice for Tone Eval if none of the choices are suitable.
  - Bug on experiment view.
  - UI/UX bugs - knowledge base and audio support for evals.
  - Required input field column detail not coming on Audio Quality evals.
  - UX changes for loader of plan screen.
  - Changed the color and the percentage of the eval chips in experiment.
 </div>

 ## Week of 2025-04-21
@@ -1396,11 +1434,11 @@
  **Bugs/Improvements**
  - Rate limit hit better UI.
  - Audio and knowledge base bug fixes.
  - Improved wrong evals view.
  - Fixes in compare dataset.
  - Changed the logo URL.
  - Filter issue fixed in Prototype.
  - Rate limit error message to upgrade the plan.
  - Experiment optimization under datasets to work faster.
  - Huggingface error handling for different datasets.
 </div>