Skip to content

GML-2076 router improvement#37

Open
chengbiao-jin wants to merge 26 commits intorelease_1.4.0from
GML-2076-Router-Improvement
Open

GML-2076 router improvement#37
chengbiao-jin wants to merge 26 commits intorelease_1.4.0from
GML-2076-Router-Improvement

Conversation

@chengbiao-jin
Copy link
Copy Markdown
Collaborator

@chengbiao-jin chengbiao-jin commented May 6, 2026

PR Type

Enhancement, Documentation, Tests


Description

  • Add auto retriever selection flow

    • Rule + LLM selector fallback
    • Default chat mode becomes auto
    • Track selection telemetry metadata
  • Add trace logging and trace UI

    • Persist per-message backend traces
    • Expose authenticated trace endpoint
    • Show node timing and token usage
  • Improve answer safety and ingestion

    • Short-circuit empty retrieval hallucinations
    • Support Excel and robust CSV decoding
    • Warn about unsupported ingest files
  • Update docs, version, and CI

    • Add tuning guideline and changelog
    • Bump release to 1.4.0
    • Skip builds for doc-only pushes

Diagram Walkthrough

flowchart LR
  UI["Chat UI defaults to Auto"] 
  Router["UI router saves trace logs"] 
  Selector["RetrieverSelector chooses method"] 
  Agent["Agent records steps and token usage"] 
  Response["Response includes retriever metadata"] 
  TracePage["Trace Logs page renders trace details"] 
  Metrics["Prometheus selection counter"] 
  Docs["Docs and CI updates"]

  UI -- "sends auto retriever" --> Selector
  Selector -- "sets chosen method" --> Agent
  Agent -- "embeds metadata and usage" --> Response
  Response -- "saved per message" --> Router
  Router -- "served via trace API" --> TracePage
  Selector -- "increments" --> Metrics
  Response -- "documented in" --> Docs
Loading

File Walkthrough

Relevant files
Enhancement
14 files
ui.py
Save and serve per-message trace logs                                       
+65/-4   
agent_graph.py
Add auto retriever selection and OOC guard                             
+113/-11
method_selector.py
Implement rule-based and LLM retriever chooser                     
+251/-0 
agent.py
Capture per-node traces and token totals                                 
+117/-5 
base_llm.py
Add context-local LLM usage collection helpers                     
+61/-0   
text_extractors.py
Improve CSV and Excel text extraction                                       
+42/-3   
prometheus_metrics.py
Add retriever method selection metric                                       
+5/-0     
TraceLogs.tsx
Add full trace logs inspection page                                           
+947/-0 
CustomChatMessage.tsx
Show retriever badge and open traces                                         
+52/-0   
ActionProvider.tsx
Preserve last user query for traces                                           
+7/-1     
Interact.tsx
Add role-gated View Trace action                                                 
+23/-8   
Bot.tsx
Default chat retrieval mode to Auto                                           
+5/-4     
IngestGraph.tsx
Warn about skipped and spreadsheet uploads                             
+26/-0   
main.tsx
Register authenticated trace page route                                   
+6/-0     
Tests
1 files
test_method_selector.py
Cover selector rules, LLM path, fallback                                 
+292/-0 
Documentation
2 files
README.md
Add tuning guideline and release updates                                 
+117/-2 
CHANGELOG.md
Document `1.4.0` retrieval improvements                                   
+10/-0   
Dependencies
2 files
package.json
Add UI packages for trace page support                                     
+3/-3     
requirements.txt
Add Excel parsing runtime dependencies                                     
+2/-0     
Configuration changes
7 files
nginx.conf
Proxy trace UI routes through nginx                                           
+8/-0     
docker-compose.yml
Mount persistent trace logs volume                                             
+1/-0     
cloud-build-nightly.yaml
Skip nightly cloud builds on docs changes                               
+9/-0     
cloud-build-deploy-ci.yaml
Skip cloud CI builds on docs changes                                         
+9/-0     
onprem-build-nightly.yaml
Skip nightly on-prem builds on docs changes                           
+9/-0     
onprem-build.yaml
Skip on-prem builds on docs changes                                           
+9/-0     
VERSION
Bump project version to `1.4.0`                                                   
+1/-1     

chengbiao-jin and others added 23 commits May 6, 2026 11:56
Adds an "Auto" mode that picks among the four search methods per
question using deterministic rules first, falling back to an LLM call.
Manual method selection still works as a transitional override.
Surfaces the auto-selection in the UI, adds Prometheus telemetry for
the selector's distribution, and short-circuits to an honest "no info"
answer when the chosen retriever returns nothing.
@tg-pr-agent
Copy link
Copy Markdown

tg-pr-agent Bot commented May 6, 2026

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 PR contains tests
🔒 Security concerns

Path traversal:
The /ui/trace/{message_id} handler concatenates message_id into a file path and reads it without validation. If message_id can contain ../ or path separators, an authenticated caller may be able to access unintended files.

Sensitive information exposure: Trace files now store user_query, natural_language_response, query_sources, and detailed agent_steps including serialized node inputs/outputs and token usage. Depending on what context, schema_mapping, retrieval results, or prompts contain, this may expose internal data or user content through disk persistence and the trace API. Review retention, access scope, redaction, and whether message IDs are guessable.

⚡ Recommended focus areas for review

Path Traversal

The trace endpoint builds a filesystem path directly from message_id and does not validate or sanitize it before opening the file. A crafted identifier containing path separators could allow reading arbitrary JSON files reachable from the trace log directory path resolution.

@router.get(route_prefix + "/trace/{message_id}")
def get_trace_log(
    message_id: str,
    creds: Annotated[tuple[list[str], HTTPBasicCredentials], Depends(ui_basic_auth)],
):
    filepath = os.path.join(TRACE_LOGS_DIR, f"{message_id}.json")
    if not os.path.exists(filepath):
        raise HTTPException(status_code=404, detail="Trace log not found")
    with open(filepath, "r") as f:
        return json.load(f)
Sensitive Traces

Trace logs persist full user queries, model responses, and serialized agent node inputs/outputs to disk and expose them through an API. This should be reviewed carefully for privacy, retention, and whether internal prompts, schema details, or conversation context can be unintentionally surfaced to privileged users.

def _save_trace_log(message_id: str, conversation_id: str, user_query: str, resp: GraphRAGResponse, elapsed: float):
    try:
        os.makedirs(TRACE_LOGS_DIR, exist_ok=True)
        _cleanup_old_traces()

        # Strip chunk text from query_sources to keep trace files small.
        # final_retrieval contains the full text of every retrieved chunk.
        query_sources = dict(resp.query_sources) if resp.query_sources else {}
        result = query_sources.get("result")
        if isinstance(result, dict) and "final_retrieval" in result:
            result = {k: v for k, v in result.items() if k != "final_retrieval"}
            query_sources = {**query_sources, "result": result}

        trace_data = {
            "message_id": message_id,
            "conversation_id": conversation_id,
            "user_query": user_query,
            "response_time": elapsed,
            "response_type": resp.response_type,
            "answered_question": resp.answered_question,
            "query_sources": query_sources,
            "natural_language_response": resp.natural_language_response,
            "timestamp": time.time(),
        }
        filepath = os.path.join(TRACE_LOGS_DIR, f"{message_id}.json")
        with open(filepath, "w") as f:
            json.dump(trace_data, f, default=str)
State Leakage

LLM usage collection is started per request and reset per node, but there is no visible cleanup at the end of execution or on exceptions. If the same execution context is reused, collected usage could bleed across requests or leave stale state behind, which would distort telemetry and trace data.

# Start collecting LLM usage so we can attribute tokens/cost per node.
start_usage_collection()

for output in self.agent.stream({"question": input_data["input"], "conversation": input_data["conversation"]}):

    for key, value in output.items():
        step_end = time.time()
        step_duration = round(step_end - step_start, 3)

        # Grab usage accumulated during this node and reset for next node.
        node_usage = get_collected_usage() or []
        input_tokens = sum(int(u.get("input_tokens", 0) or 0) for u in node_usage)
        output_tokens = sum(int(u.get("output_tokens", 0) or 0) for u in node_usage)
        total_tokens = sum(int(u.get("total_tokens", 0) or 0) for u in node_usage)
        cost = sum(float(u.get("cost", 0) or 0) for u in node_usage)

        agent_steps.append({
            "node": key,
            "duration_s": step_duration,
            "input": _safe(prev_state),
            "output": _node_output(key, value),
            "usage": {
                "input_tokens": input_tokens,
                "output_tokens": output_tokens,
                "total_tokens": total_tokens,
                "cost": cost,
                "calls": [
                    {
                        "caller_name": u.get("caller_name"),
                        "input_tokens": u.get("input_tokens", 0),
                        "output_tokens": u.get("output_tokens", 0),
                        "total_tokens": u.get("total_tokens", 0),
                        "cost": u.get("cost", 0),
                    }
                    for u in node_usage
                ],
            },
        })
        # Reset the collector for the next node.
        start_usage_collection()

        prev_state = value
        LogWriter.info(
            f"request_id={req_id_cv.get()} executed node {key} ({step_duration}s)"
        )
        step_start = step_end

# Backfill entry with routing decision
if len(agent_steps) >= 2 and agent_steps[0]["node"] == "entry":
    next_node = agent_steps[1]["node"]
    _ROUTE_LABELS = {"supportai": "vector_search", "map_question_to_schema": "db_search", "lookup_history": "history_lookup"}
    agent_steps[0]["output"] = _safe({"routing_decision": _ROUTE_LABELS.get(next_node, next_node)})

if value["answer"].query_sources is None:
    value["answer"].query_sources = {}
value["answer"].query_sources["agent_steps"] = agent_steps

# Aggregate total LLM usage across all nodes for the Token Overview UI.
total_usage = {
    "input_tokens": sum(int(s.get("usage", {}).get("input_tokens", 0) or 0) for s in agent_steps),
    "output_tokens": sum(int(s.get("usage", {}).get("output_tokens", 0) or 0) for s in agent_steps),
    "total_tokens": sum(int(s.get("usage", {}).get("total_tokens", 0) or 0) for s in agent_steps),
    "cost": sum(float(s.get("usage", {}).get("cost", 0) or 0) for s in agent_steps),
}
value["answer"].query_sources["token_usage"] = total_usage

Comment on lines +387 to +396
@router.get(route_prefix + "/trace/{message_id}")
def get_trace_log(
message_id: str,
creds: Annotated[tuple[list[str], HTTPBasicCredentials], Depends(ui_basic_auth)],
):
filepath = os.path.join(TRACE_LOGS_DIR, f"{message_id}.json")
if not os.path.exists(filepath):
raise HTTPException(status_code=404, detail="Trace log not found")
with open(filepath, "r") as f:
return json.load(f)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The message_id is interpolated directly into a filesystem path, so a crafted value like ../../../tmp/x can escape TRACE_LOGS_DIR and read arbitrary JSON files. Restrict message_id to a safe filename format and verify the resolved path stays under TRACE_LOGS_DIR before opening it. [security, importance: 9]

Suggested change
@router.get(route_prefix + "/trace/{message_id}")
def get_trace_log(
message_id: str,
creds: Annotated[tuple[list[str], HTTPBasicCredentials], Depends(ui_basic_auth)],
):
filepath = os.path.join(TRACE_LOGS_DIR, f"{message_id}.json")
if not os.path.exists(filepath):
raise HTTPException(status_code=404, detail="Trace log not found")
with open(filepath, "r") as f:
return json.load(f)
@router.get(route_prefix + "/trace/{message_id}")
def get_trace_log(
message_id: str,
creds: Annotated[tuple[list[str], HTTPBasicCredentials], Depends(ui_basic_auth)],
):
if not re.fullmatch(r"[A-Za-z0-9_-]+", message_id):
raise HTTPException(status_code=400, detail="Invalid message_id")
base_dir = os.path.abspath(TRACE_LOGS_DIR)
filepath = os.path.abspath(os.path.join(base_dir, f"{message_id}.json"))
if os.path.commonpath([base_dir, filepath]) != base_dir:
raise HTTPException(status_code=400, detail="Invalid message_id")
if not os.path.exists(filepath):
raise HTTPException(status_code=404, detail="Trace log not found")
with open(filepath, "r") as f:
return json.load(f)

Comment on lines +739 to +749
const message = resolvedMessage || (apiData ? {
content: apiData.natural_language_response,
response_time: apiData.response_time,
response_type: apiData.response_type,
query_sources: apiData.query_sources,
} : null);
const userQuery = stateUserQuery || sessionMessage?.userQuery || apiData?.user_query;

const trace = useMemo(
() => buildTraceFromMessage(message, userQuery),
[message, userQuery]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: When the fetch fails or no trace exists, message becomes null but buildTraceFromMessage still runs and renders a fake "completed" trace with "N/A" content. Guard against missing data and show an explicit not-found/error state instead, so users do not see misleading trace details. [possible issue, importance: 7]

Suggested change
const message = resolvedMessage || (apiData ? {
content: apiData.natural_language_response,
response_time: apiData.response_time,
response_type: apiData.response_type,
query_sources: apiData.query_sources,
} : null);
const userQuery = stateUserQuery || sessionMessage?.userQuery || apiData?.user_query;
const trace = useMemo(
() => buildTraceFromMessage(message, userQuery),
[message, userQuery]
const message = resolvedMessage || (apiData ? {
content: apiData.natural_language_response,
response_time: apiData.response_time,
response_type: apiData.response_type,
query_sources: apiData.query_sources,
} : null);
const userQuery = stateUserQuery || sessionMessage?.userQuery || apiData?.user_query;
const trace = useMemo(
() => (message ? buildTraceFromMessage(message, userQuery) : null),
[message, userQuery]
);
if (!loading && !trace) {
return (
<div className="min-h-screen bg-background flex items-center justify-center">
<p className="text-muted-foreground">Trace data not found.</p>
</div>
);
}

Comment on lines +640 to +665
elif extension in ['.xlsx', '.xls']:
import pandas as pd
engine = 'openpyxl' if extension == '.xlsx' else 'xlrd'
try:
xl = pd.ExcelFile(file_path, engine=engine)
except Exception:
xl = pd.ExcelFile(file_path)
sheet_texts = []
for sheet_name in xl.sheet_names:
# Always read with header=None so no data row is silently
# consumed as column names for headerless spreadsheets.
df = xl.parse(sheet_name, header=None)
if df.empty:
continue
df = df.fillna('')
# Detect header row: first row is all non-empty strings with
# no purely numeric values → treat as column names.
first_row = df.iloc[0]
if all(isinstance(v, str) and v.strip() for v in first_row):
df.columns = first_row.tolist()
df = df.iloc[1:].reset_index(drop=True)
else:
df.columns = [f"Column {i + 1}" for i in range(len(df.columns))]
sheet_md = df.to_markdown(index=False)
sheet_texts.append(f"## Sheet: {sheet_name}\n\n{sheet_md}")
return "\n\n".join(sheet_texts) if sheet_texts else "[Excel file is empty or contains no data]"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The header detection treats any non-empty string first row as headers, so ordinary string-valued first records are dropped from the extracted text. Use a stricter heuristic, such as requiring unique column names and at least one subsequent data row, before removing the first row. [possible issue, importance: 6]

Suggested change
elif extension in ['.xlsx', '.xls']:
import pandas as pd
engine = 'openpyxl' if extension == '.xlsx' else 'xlrd'
try:
xl = pd.ExcelFile(file_path, engine=engine)
except Exception:
xl = pd.ExcelFile(file_path)
sheet_texts = []
for sheet_name in xl.sheet_names:
# Always read with header=None so no data row is silently
# consumed as column names for headerless spreadsheets.
df = xl.parse(sheet_name, header=None)
if df.empty:
continue
df = df.fillna('')
# Detect header row: first row is all non-empty strings with
# no purely numeric values → treat as column names.
first_row = df.iloc[0]
if all(isinstance(v, str) and v.strip() for v in first_row):
df.columns = first_row.tolist()
df = df.iloc[1:].reset_index(drop=True)
else:
df.columns = [f"Column {i + 1}" for i in range(len(df.columns))]
sheet_md = df.to_markdown(index=False)
sheet_texts.append(f"## Sheet: {sheet_name}\n\n{sheet_md}")
return "\n\n".join(sheet_texts) if sheet_texts else "[Excel file is empty or contains no data]"
elif extension in ['.xlsx', '.xls']:
import pandas as pd
engine = 'openpyxl' if extension == '.xlsx' else 'xlrd'
try:
xl = pd.ExcelFile(file_path, engine=engine)
except Exception:
xl = pd.ExcelFile(file_path)
sheet_texts = []
for sheet_name in xl.sheet_names:
# Always read with header=None so no data row is silently
# consumed as column names for headerless spreadsheets.
df = xl.parse(sheet_name, header=None)
if df.empty:
continue
df = df.fillna('')
first_row = df.iloc[0]
first_row_values = [str(v).strip() for v in first_row]
looks_like_header = (
len(df) > 1
and all(first_row_values)
and len(set(first_row_values)) == len(first_row_values)
and not any(v.isdigit() for v in first_row_values)
)
if looks_like_header:
df.columns = first_row_values
df = df.iloc[1:].reset_index(drop=True)
else:
df.columns = [f"Column {i + 1}" for i in range(len(df.columns))]
sheet_md = df.to_markdown(index=False)
sheet_texts.append(f"## Sheet: {sheet_name}\n\n{sheet_md}")
return "\n\n".join(sheet_texts) if sheet_texts else "[Excel file is empty or contains no data]"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants