Skip to content

DRAFT: basic Profiler support and WIP hierarchical model level to op level profiling support#1247

Open
WuSiYu wants to merge 2 commits intoModelTC:mainfrom
WuSiYu:perf2
Open

DRAFT: basic Profiler support and WIP hierarchical model level to op level profiling support#1247
WuSiYu wants to merge 2 commits intoModelTC:mainfrom
WuSiYu:perf2

Conversation

@WuSiYu
Copy link
Copy Markdown
Collaborator

@WuSiYu WuSiYu commented Apr 1, 2026

1. feat(misc): Profiler support

use --enable_profiling=MODE to enable, currently support torch_profile and nvtx (use with NVIDIA Nsight system) mode

2. WIP feat: hierarchical model level to op level profiling support

To use:

  1. set GlobalPerfContext.STATIC_DISABLED = False in lightllm/utils/profiler.py
  2. (Optional): Adjust the sampling rate in lightllm/common/basemodel/basemodel.py using GlobalPerfContext.begin_with_sample_rate(xxx)
  3. Start lightllm and send requests, the results (log and jsonl) will be saved to the working directory

WuSiYu added 2 commits April 1, 2026 04:00
use --enable_profiling=MODE to enable, currently support torch_profile and nvtx (use with NVIDIA Nsight system) mode
To use:
1. set GlobalPerfContext.STATIC_DISABLED = False in lightllm/utils/profiler.py
2. (Optional): Adjust the sampling rate in lightllm/common/basemodel/basemodel.py using GlobalPerfContext.begin_with_sample_rate(xxx)
3. Start lightllm and send requests, the results (log and jsonl) will be saved to the working directory
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a comprehensive profiling framework featuring kernel-level timing and process-level tracing (Torch Profiler/NVTX), integrated across model layers, Triton kernels, and server APIs. Significant feedback was provided regarding a performance-degrading GPU-CPU synchronization in grouped_matmul, thread-safety vulnerabilities in the global context and decorators, and a bug in wrap_func preventing multiple calls. Further improvements were suggested to optimize memory usage via __slots__, avoid blocking synchronizations in the main thread, and remove redundant JSON validation logic.

Comment on lines +722 to +723
m_total = int(expert_to_token_num.sum().item())
grouped_matmul.record_shape(m=m_total, k=k, n=n)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The expert_to_token_num.sum().item() call is a synchronous operation that forces a GPU-to-CPU synchronization. This will significantly degrade performance, especially since it runs unconditionally even when profiling is disabled (the record_shape call becomes a no-op but the sum and item transfer still occur). It should be guarded by a check to is_perf_counter_active().

Suggested change
m_total = int(expert_to_token_num.sum().item())
grouped_matmul.record_shape(m=m_total, k=k, n=n)
if grouped_matmul.is_perf_counter_active():
m_total = int(expert_to_token_num.sum().item())
grouped_matmul.record_shape(m=m_total, k=k, n=n)

Comment on lines +266 to +269
stacks = cls.threadlocal_active_counter_stack
cls.threadlocal_active_counter_stack = DefaultDict(list)
counter_list = cls.recorded_counters
cls.recorded_counters = []
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The GlobalPerfContext and the PerfCounter decorator are not thread-safe for concurrent inference.

  1. finalize_async clears the threadlocal_active_counter_stack and recorded_counters for ALL threads in the process. If multiple threads are performing inference simultaneously (e.g., in overlap mode or multi-threaded decoding), one thread calling finalize_async will clear or corrupt the profiling data for other threads.
  2. The decorator assigns attributes like record_shape directly to the wrapped_func object (lines 401-403). Concurrent calls to the same decorated function will overwrite these attributes, leading to race conditions where one thread records data for another thread's counter.

Comment on lines +426 to +428
def wrapped_func(*args, **kwds):
with self:
return func(*args, **kwds)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The wrap_func implementation reuses the same PerfCounter instance (self). Since PerfCounter prevents restarts (line 353), a function wrapped this way can only be called once. Subsequent calls will raise a RuntimeError. The wrapper should create a new PerfCounter instance per call, similar to the logic in the __call__ decorator implementation.

]

class PerfCounter:
__slot__ = ("name", "type", "shapes", "depth",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Typo in attribute name: __slot__ should be __slots__ to correctly restrict attribute creation and optimize memory usage.

Suggested change
__slot__ = ("name", "type", "shapes", "depth",
__slots__ = ("name", "type", "shapes", "depth",

if torch.cuda.is_current_stream_capturing():
return

torch.cuda.synchronize()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

torch.cuda.synchronize() is called in the main thread during finalize_async. This will block the inference loop until all GPU operations are finished, which can significantly impact throughput. Consider recording a completion event and having the worker thread wait for it asynchronously to avoid blocking the main inference loop.

Comment on lines +222 to +224
json_str = json.dumps({key: getattr(counter, key) for key in keys} | {"marker": marker})
_ = json.loads(json_str) # validate
f.write(json_str + '\n')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The validation step json.loads(json_str) is redundant and adds unnecessary overhead. json.dumps is guaranteed to produce a valid JSON string if it completes successfully.

Suggested change
json_str = json.dumps({key: getattr(counter, key) for key in keys} | {"marker": marker})
_ = json.loads(json_str) # validate
f.write(json_str + '\n')
json_str = json.dumps({key: getattr(counter, key) for key in keys} | {"marker": marker})
f.write(json_str + '\n')

Comment on lines +194 to +196
logger.error(f"Error in finalizing counters: {e}")
raise e
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The return statement is unreachable because it follows a raise e statement.

Suggested change
logger.error(f"Error in finalizing counters: {e}")
raise e
return
except Exception as e:
logger.error(f"Error in finalizing counters: {e}")
raise e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant