GML-2076 router improvement by chengbiao-jin · Pull Request #37 · tigergraph/graphrag

chengbiao-jin · 2026-05-06T23:16:33Z

PR Type

Enhancement, Documentation, Tests

Description

Add auto retriever selection flow
- Rule + LLM selector fallback
- Default chat mode becomes auto
- Track selection telemetry metadata
Add trace logging and trace UI
- Persist per-message backend traces
- Expose authenticated trace endpoint
- Show node timing and token usage
Improve answer safety and ingestion
- Short-circuit empty retrieval hallucinations
- Support Excel and robust CSV decoding
- Warn about unsupported ingest files
Update docs, version, and CI
- Add tuning guideline and changelog
- Bump release to 1.4.0
- Skip builds for doc-only pushes

Diagram Walkthrough

flowchart LR
  UI["Chat UI defaults to Auto"] 
  Router["UI router saves trace logs"] 
  Selector["RetrieverSelector chooses method"] 
  Agent["Agent records steps and token usage"] 
  Response["Response includes retriever metadata"] 
  TracePage["Trace Logs page renders trace details"] 
  Metrics["Prometheus selection counter"] 
  Docs["Docs and CI updates"]

  UI -- "sends auto retriever" --> Selector
  Selector -- "sets chosen method" --> Agent
  Agent -- "embeds metadata and usage" --> Response
  Response -- "saved per message" --> Router
  Router -- "served via trace API" --> TracePage
  Selector -- "increments" --> Metrics
  Response -- "documented in" --> Docs

File Walkthrough

Relevant files

Enhancement

14 files

ui.py `Save and serve per-message trace logs`	+65/-4
agent_graph.py `Add auto retriever selection and OOC guard`	+113/-11
method_selector.py `Implement rule-based and LLM retriever chooser`	+251/-0
agent.py `Capture per-node traces and token totals`	+117/-5
base_llm.py `Add context-local LLM usage collection helpers`	+61/-0
text_extractors.py `Improve CSV and Excel text extraction`	+42/-3
prometheus_metrics.py `Add retriever method selection metric`	+5/-0
TraceLogs.tsx `Add full trace logs inspection page`	+947/-0
CustomChatMessage.tsx `Show retriever badge and open traces`	+52/-0
ActionProvider.tsx `Preserve last user query for traces`	+7/-1
Interact.tsx `Add role-gated View Trace action`	+23/-8
Bot.tsx `Default chat retrieval mode to Auto`	+5/-4
IngestGraph.tsx `Warn about skipped and spreadsheet uploads`	+26/-0
main.tsx `Register authenticated trace page route`	+6/-0

Tests

1 files

test_method_selector.py `Cover selector rules, LLM path, fallback`	+292/-0

Documentation

2 files

README.md `Add tuning guideline and release updates`	+117/-2
CHANGELOG.md Document `1.4.0` retrieval improvements	+10/-0

Dependencies

2 files

package.json `Add UI packages for trace page support`	+3/-3
requirements.txt `Add Excel parsing runtime dependencies`	+2/-0

Configuration changes

7 files

nginx.conf `Proxy trace UI routes through nginx`	+8/-0
docker-compose.yml `Mount persistent trace logs volume`	+1/-0
cloud-build-nightly.yaml `Skip nightly cloud builds on docs changes`	+9/-0
cloud-build-deploy-ci.yaml `Skip cloud CI builds on docs changes`	+9/-0
onprem-build-nightly.yaml `Skip nightly on-prem builds on docs changes`	+9/-0
onprem-build.yaml `Skip on-prem builds on docs changes`	+9/-0
VERSION Bump project version to `1.4.0`	+1/-1

…IO, remove unused endpoint

…ensions

Adds an "Auto" mode that picks among the four search methods per question using deterministic rules first, falling back to an LLM call. Manual method selection still works as a transitional override.

Surfaces the auto-selection in the UI, adds Prometheus telemetry for the selector's distribution, and short-circuits to an honest "no info" answer when the chosen retriever returns nothing.

tg-pr-agent · 2026-05-06T23:17:30Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 PR contains tests
🔒 Security concerns Path traversal: The `/ui/trace/{message_id}` handler concatenates `message_id` into a file path and reads it without validation. If `message_id` can contain `../` or path separators, an authenticated caller may be able to access unintended files. Sensitive information exposure: Trace files now store `user_query`, `natural_language_response`, `query_sources`, and detailed `agent_steps` including serialized node inputs/outputs and token usage. Depending on what `context`, `schema_mapping`, retrieval results, or prompts contain, this may expose internal data or user content through disk persistence and the trace API. Review retention, access scope, redaction, and whether message IDs are guessable.
⚡ Recommended focus areas for review Path Traversal The trace endpoint builds a filesystem path directly from `message_id` and does not validate or sanitize it before opening the file. A crafted identifier containing path separators could allow reading arbitrary JSON files reachable from the trace log directory path resolution. @router.get(route_prefix + "/trace/{message_id}") def get_trace_log( message_id: str, creds: Annotated[tuple[list[str], HTTPBasicCredentials], Depends(ui_basic_auth)], ): filepath = os.path.join(TRACE_LOGS_DIR, f"{message_id}.json") if not os.path.exists(filepath): raise HTTPException(status_code=404, detail="Trace log not found") with open(filepath, "r") as f: return json.load(f) Sensitive Traces Trace logs persist full user queries, model responses, and serialized agent node inputs/outputs to disk and expose them through an API. This should be reviewed carefully for privacy, retention, and whether internal prompts, schema details, or conversation context can be unintentionally surfaced to privileged users. def _save_trace_log(message_id: str, conversation_id: str, user_query: str, resp: GraphRAGResponse, elapsed: float): try: os.makedirs(TRACE_LOGS_DIR, exist_ok=True) _cleanup_old_traces() # Strip chunk text from query_sources to keep trace files small. # final_retrieval contains the full text of every retrieved chunk. query_sources = dict(resp.query_sources) if resp.query_sources else {} result = query_sources.get("result") if isinstance(result, dict) and "final_retrieval" in result: result = {k: v for k, v in result.items() if k != "final_retrieval"} query_sources = {query_sources, "result": result} trace_data = { "message_id": message_id, "conversation_id": conversation_id, "user_query": user_query, "response_time": elapsed, "response_type": resp.response_type, "answered_question": resp.answered_question, "query_sources": query_sources, "natural_language_response": resp.natural_language_response, "timestamp": time.time(), } filepath = os.path.join(TRACE_LOGS_DIR, f"{message_id}.json") with open(filepath, "w") as f: json.dump(trace_data, f, default=str) State Leakage** LLM usage collection is started per request and reset per node, but there is no visible cleanup at the end of execution or on exceptions. If the same execution context is reused, collected usage could bleed across requests or leave stale state behind, which would distort telemetry and trace data. # Start collecting LLM usage so we can attribute tokens/cost per node. start_usage_collection() for output in self.agent.stream({"question": input_data["input"], "conversation": input_data["conversation"]}): for key, value in output.items(): step_end = time.time() step_duration = round(step_end - step_start, 3) # Grab usage accumulated during this node and reset for next node. node_usage = get_collected_usage() or [] input_tokens = sum(int(u.get("input_tokens", 0) or 0) for u in node_usage) output_tokens = sum(int(u.get("output_tokens", 0) or 0) for u in node_usage) total_tokens = sum(int(u.get("total_tokens", 0) or 0) for u in node_usage) cost = sum(float(u.get("cost", 0) or 0) for u in node_usage) agent_steps.append({ "node": key, "duration_s": step_duration, "input": _safe(prev_state), "output": _node_output(key, value), "usage": { "input_tokens": input_tokens, "output_tokens": output_tokens, "total_tokens": total_tokens, "cost": cost, "calls": [ { "caller_name": u.get("caller_name"), "input_tokens": u.get("input_tokens", 0), "output_tokens": u.get("output_tokens", 0), "total_tokens": u.get("total_tokens", 0), "cost": u.get("cost", 0), } for u in node_usage ], }, }) # Reset the collector for the next node. start_usage_collection() prev_state = value LogWriter.info( f"request_id={req_id_cv.get()} executed node {key} ({step_duration}s)" ) step_start = step_end # Backfill entry with routing decision if len(agent_steps) >= 2 and agent_steps[0]["node"] == "entry": next_node = agent_steps[1]["node"] _ROUTE_LABELS = {"supportai": "vector_search", "map_question_to_schema": "db_search", "lookup_history": "history_lookup"} agent_steps[0]["output"] = _safe({"routing_decision": _ROUTE_LABELS.get(next_node, next_node)}) if value["answer"].query_sources is None: value["answer"].query_sources = {} value["answer"].query_sources["agent_steps"] = agent_steps # Aggregate total LLM usage across all nodes for the Token Overview UI. total_usage = { "input_tokens": sum(int(s.get("usage", {}).get("input_tokens", 0) or 0) for s in agent_steps), "output_tokens": sum(int(s.get("usage", {}).get("output_tokens", 0) or 0) for s in agent_steps), "total_tokens": sum(int(s.get("usage", {}).get("total_tokens", 0) or 0) for s in agent_steps), "cost": sum(float(s.get("usage", {}).get("cost", 0) or 0) for s in agent_steps), } value["answer"].query_sources["token_usage"] = total_usage

tg-pr-agent · 2026-05-06T23:18:02Z

+@router.get(route_prefix + "/trace/{message_id}")
+def get_trace_log(
+    message_id: str,
+    creds: Annotated[tuple[list[str], HTTPBasicCredentials], Depends(ui_basic_auth)],
+):
+    filepath = os.path.join(TRACE_LOGS_DIR, f"{message_id}.json")
+    if not os.path.exists(filepath):
+        raise HTTPException(status_code=404, detail="Trace log not found")
+    with open(filepath, "r") as f:
+        return json.load(f)


Suggestion: The message_id is interpolated directly into a filesystem path, so a crafted value like ../../../tmp/x can escape TRACE_LOGS_DIR and read arbitrary JSON files. Restrict message_id to a safe filename format and verify the resolved path stays under TRACE_LOGS_DIR before opening it. [security, importance: 9]

Suggested change

@router.get(route_prefix + "/trace/{message_id}")

def get_trace_log(

message_id: str,

creds: Annotated[tuple[list[str], HTTPBasicCredentials], Depends(ui_basic_auth)],

):

filepath = os.path.join(TRACE_LOGS_DIR, f"{message_id}.json")

if not os.path.exists(filepath):

raise HTTPException(status_code=404, detail="Trace log not found")

with open(filepath, "r") as f:

return json.load(f)

@router.get(route_prefix + "/trace/{message_id}")

def get_trace_log(

message_id: str,

creds: Annotated[tuple[list[str], HTTPBasicCredentials], Depends(ui_basic_auth)],

):

if not re.fullmatch(r"[A-Za-z0-9_-]+", message_id):

raise HTTPException(status_code=400, detail="Invalid message_id")

base_dir = os.path.abspath(TRACE_LOGS_DIR)

filepath = os.path.abspath(os.path.join(base_dir, f"{message_id}.json"))

if os.path.commonpath([base_dir, filepath]) != base_dir:

raise HTTPException(status_code=400, detail="Invalid message_id")

if not os.path.exists(filepath):

raise HTTPException(status_code=404, detail="Trace log not found")

with open(filepath, "r") as f:

return json.load(f)

tg-pr-agent · 2026-05-06T23:18:02Z

+  const message = resolvedMessage || (apiData ? {
+    content: apiData.natural_language_response,
+    response_time: apiData.response_time,
+    response_type: apiData.response_type,
+    query_sources: apiData.query_sources,
+  } : null);
+  const userQuery = stateUserQuery || sessionMessage?.userQuery || apiData?.user_query;
+
+  const trace = useMemo(
+    () => buildTraceFromMessage(message, userQuery),
+    [message, userQuery]


Suggestion: When the fetch fails or no trace exists, message becomes null but buildTraceFromMessage still runs and renders a fake "completed" trace with "N/A" content. Guard against missing data and show an explicit not-found/error state instead, so users do not see misleading trace details. [possible issue, importance: 7]

Suggested change

const message = resolvedMessage || (apiData ? {

content: apiData.natural_language_response,

response_time: apiData.response_time,

response_type: apiData.response_type,

query_sources: apiData.query_sources,

} : null);

const userQuery = stateUserQuery || sessionMessage?.userQuery || apiData?.user_query;

const trace = useMemo(

() => buildTraceFromMessage(message, userQuery),

[message, userQuery]

const message = resolvedMessage || (apiData ? {

content: apiData.natural_language_response,

response_time: apiData.response_time,

response_type: apiData.response_type,

query_sources: apiData.query_sources,

} : null);

const userQuery = stateUserQuery || sessionMessage?.userQuery || apiData?.user_query;

const trace = useMemo(

() => (message ? buildTraceFromMessage(message, userQuery) : null),

[message, userQuery]

);

if (!loading && !trace) {

return (

<div className="min-h-screen bg-background flex items-center justify-center">

<p className="text-muted-foreground">Trace data not found.</p>

</div>

);

}

tg-pr-agent · 2026-05-06T23:18:02Z

+        elif extension in ['.xlsx', '.xls']:
+            import pandas as pd
+            engine = 'openpyxl' if extension == '.xlsx' else 'xlrd'
+            try:
+                xl = pd.ExcelFile(file_path, engine=engine)
+            except Exception:
+                xl = pd.ExcelFile(file_path)
+            sheet_texts = []
+            for sheet_name in xl.sheet_names:
+                # Always read with header=None so no data row is silently
+                # consumed as column names for headerless spreadsheets.
+                df = xl.parse(sheet_name, header=None)
+                if df.empty:
+                    continue
+                df = df.fillna('')
+                # Detect header row: first row is all non-empty strings with
+                # no purely numeric values → treat as column names.
+                first_row = df.iloc[0]
+                if all(isinstance(v, str) and v.strip() for v in first_row):
+                    df.columns = first_row.tolist()
+                    df = df.iloc[1:].reset_index(drop=True)
+                else:
+                    df.columns = [f"Column {i + 1}" for i in range(len(df.columns))]
+                sheet_md = df.to_markdown(index=False)
+                sheet_texts.append(f"## Sheet: {sheet_name}\n\n{sheet_md}")
+            return "\n\n".join(sheet_texts) if sheet_texts else "[Excel file is empty or contains no data]"


Suggestion: The header detection treats any non-empty string first row as headers, so ordinary string-valued first records are dropped from the extracted text. Use a stricter heuristic, such as requiring unique column names and at least one subsequent data row, before removing the first row. [possible issue, importance: 6]

Suggested change

elif extension in ['.xlsx', '.xls']:

import pandas as pd

engine = 'openpyxl' if extension == '.xlsx' else 'xlrd'

try:

xl = pd.ExcelFile(file_path, engine=engine)

except Exception:

xl = pd.ExcelFile(file_path)

sheet_texts = []

for sheet_name in xl.sheet_names:

# Always read with header=None so no data row is silently

# consumed as column names for headerless spreadsheets.

df = xl.parse(sheet_name, header=None)

if df.empty:

continue

df = df.fillna('')

# Detect header row: first row is all non-empty strings with

# no purely numeric values → treat as column names.

first_row = df.iloc[0]

if all(isinstance(v, str) and v.strip() for v in first_row):

df.columns = first_row.tolist()

df = df.iloc[1:].reset_index(drop=True)

else:

df.columns = [f"Column {i + 1}" for i in range(len(df.columns))]

sheet_md = df.to_markdown(index=False)

sheet_texts.append(f"## Sheet: {sheet_name}\n\n{sheet_md}")

return "\n\n".join(sheet_texts) if sheet_texts else "[Excel file is empty or contains no data]"

elif extension in ['.xlsx', '.xls']:

import pandas as pd

engine = 'openpyxl' if extension == '.xlsx' else 'xlrd'

try:

xl = pd.ExcelFile(file_path, engine=engine)

except Exception:

xl = pd.ExcelFile(file_path)

sheet_texts = []

for sheet_name in xl.sheet_names:

# Always read with header=None so no data row is silently

# consumed as column names for headerless spreadsheets.

df = xl.parse(sheet_name, header=None)

if df.empty:

continue

df = df.fillna('')

first_row = df.iloc[0]

first_row_values = [str(v).strip() for v in first_row]

looks_like_header = (

len(df) > 1

and all(first_row_values)

and len(set(first_row_values)) == len(first_row_values)

and not any(v.isdigit() for v in first_row_values)

)

if looks_like_header:

df.columns = first_row_values

df = df.iloc[1:].reset_index(drop=True)

else:

df.columns = [f"Column {i + 1}" for i in range(len(df.columns))]

sheet_md = df.to_markdown(index=False)

sheet_texts.append(f"## Sheet: {sheet_name}\n\n{sheet_md}")

return "\n\n".join(sheet_texts) if sheet_texts else "[Excel file is empty or contains no data]"

…, excel header detection

chengbiao-jin and others added 23 commits May 6, 2026 11:56

Add UI test

a7bf04c

Update version number

222d680

feat: Add Trace Logs UI for agent execution observability

4d6918f

fix: avoid intermediate variable in agent.py return

2b33762

feat: capture node input/output in agent_steps and display in Trace Logs

3e260c7

fix: move _safe_serialize above loop, remove unused LuClock import

200895d

feat: role-gated View Trace, remove truncation, fix routing labels

e6c98ab

feat: add trace_logs volume mount in docker-compose

722d115

feat: add trace log save/fetch endpoints in ui.py and nginx /trace route

7250ddd

fix: update trace route to /trace/:messageId in main.tsx

e359f98

fix: show all durations in seconds in TraceLogs

39c1263

feat: add LLM token usage tracking per node and Token Overview tab

ef0dae0

fix: deduplicate LLM caller names in Token Overview table

2696382

fix: add auth to trace endpoint, fix browser auth dialog, async file …

ad307d9

…IO, remove unused endpoint

fix: remove chunk text from citations in trace logs UI and backend JSON

02a326d

feat: move Citations tab first, add 30-day trace log cleanup

51bbb65

feat(GML-2086): add Excel and CSV extraction support with UI warning

ade9289

fix(GML-2086): preserve all rows for headerless Excel sheets

698724b

fix(GML-2086): align supported_extensions dict with get_supported_ext…

62ea52e

…ensions

fix(GML-2086): handle non-UTF-8 encodings in CSV extraction

f8ccdd6

feat(GML-2086): show unsupported file type warning in upload UI

6a2bda4

feat(GML-2076): add Auto retrieval method selection (Phase 1)

934edd7

Adds an "Auto" mode that picks among the four search methods per question using deterministic rules first, falling back to an LLM call. Manual method selection still works as a transitional override.

feat(GML-2076): observability + out-of-corpus short-circuit (Phase 1.5)

3051fc3

Surfaces the auto-selection in the UI, adds Prometheus telemetry for the selector's distribution, and short-circuits to an honest "no info" answer when the chosen retriever returns nothing.

tg-pr-agent Bot reviewed May 6, 2026

View reviewed changes

chengbiao-jin and others added 3 commits May 6, 2026 16:40

Merge branch 'release_1.4.0' into GML-2076-Router-Improvement

37d9af2

fix: trace log path traversal + superuser auth, usage collector reset…

fa9b5b1

…, excel header detection

fix(TraceLogs): show not-found state when trace data missing

2a570a8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GML-2076 router improvement#37

GML-2076 router improvement#37
chengbiao-jin wants to merge 26 commits intorelease_1.4.0from
GML-2076-Router-Improvement

chengbiao-jin commented May 6, 2026 •

edited by tg-pr-agent Bot

Loading

Uh oh!

tg-pr-agent Bot commented May 6, 2026

Uh oh!

tg-pr-agent Bot May 6, 2026

Uh oh!

tg-pr-agent Bot May 6, 2026

Uh oh!

tg-pr-agent Bot May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chengbiao-jin commented May 6, 2026 • edited by tg-pr-agent Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

tg-pr-agent Bot commented May 6, 2026

PR Reviewer Guide 🔍

Uh oh!

tg-pr-agent Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

tg-pr-agent Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

tg-pr-agent Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chengbiao-jin commented May 6, 2026 •

edited by tg-pr-agent Bot

Loading