DRAFT: basic Profiler support and WIP hierarchical model level to op level profiling support#1247
DRAFT: basic Profiler support and WIP hierarchical model level to op level profiling support#1247WuSiYu wants to merge 2 commits intoModelTC:mainfrom
Conversation
use --enable_profiling=MODE to enable, currently support torch_profile and nvtx (use with NVIDIA Nsight system) mode
To use: 1. set GlobalPerfContext.STATIC_DISABLED = False in lightllm/utils/profiler.py 2. (Optional): Adjust the sampling rate in lightllm/common/basemodel/basemodel.py using GlobalPerfContext.begin_with_sample_rate(xxx) 3. Start lightllm and send requests, the results (log and jsonl) will be saved to the working directory
There was a problem hiding this comment.
Code Review
This pull request implements a comprehensive profiling framework featuring kernel-level timing and process-level tracing (Torch Profiler/NVTX), integrated across model layers, Triton kernels, and server APIs. Significant feedback was provided regarding a performance-degrading GPU-CPU synchronization in grouped_matmul, thread-safety vulnerabilities in the global context and decorators, and a bug in wrap_func preventing multiple calls. Further improvements were suggested to optimize memory usage via __slots__, avoid blocking synchronizations in the main thread, and remove redundant JSON validation logic.
| m_total = int(expert_to_token_num.sum().item()) | ||
| grouped_matmul.record_shape(m=m_total, k=k, n=n) |
There was a problem hiding this comment.
The expert_to_token_num.sum().item() call is a synchronous operation that forces a GPU-to-CPU synchronization. This will significantly degrade performance, especially since it runs unconditionally even when profiling is disabled (the record_shape call becomes a no-op but the sum and item transfer still occur). It should be guarded by a check to is_perf_counter_active().
| m_total = int(expert_to_token_num.sum().item()) | |
| grouped_matmul.record_shape(m=m_total, k=k, n=n) | |
| if grouped_matmul.is_perf_counter_active(): | |
| m_total = int(expert_to_token_num.sum().item()) | |
| grouped_matmul.record_shape(m=m_total, k=k, n=n) |
| stacks = cls.threadlocal_active_counter_stack | ||
| cls.threadlocal_active_counter_stack = DefaultDict(list) | ||
| counter_list = cls.recorded_counters | ||
| cls.recorded_counters = [] |
There was a problem hiding this comment.
The GlobalPerfContext and the PerfCounter decorator are not thread-safe for concurrent inference.
finalize_asyncclears thethreadlocal_active_counter_stackandrecorded_countersfor ALL threads in the process. If multiple threads are performing inference simultaneously (e.g., in overlap mode or multi-threaded decoding), one thread callingfinalize_asyncwill clear or corrupt the profiling data for other threads.- The decorator assigns attributes like
record_shapedirectly to thewrapped_funcobject (lines 401-403). Concurrent calls to the same decorated function will overwrite these attributes, leading to race conditions where one thread records data for another thread's counter.
| def wrapped_func(*args, **kwds): | ||
| with self: | ||
| return func(*args, **kwds) |
There was a problem hiding this comment.
The wrap_func implementation reuses the same PerfCounter instance (self). Since PerfCounter prevents restarts (line 353), a function wrapped this way can only be called once. Subsequent calls will raise a RuntimeError. The wrapper should create a new PerfCounter instance per call, similar to the logic in the __call__ decorator implementation.
| ] | ||
|
|
||
| class PerfCounter: | ||
| __slot__ = ("name", "type", "shapes", "depth", |
| if torch.cuda.is_current_stream_capturing(): | ||
| return | ||
|
|
||
| torch.cuda.synchronize() |
There was a problem hiding this comment.
torch.cuda.synchronize() is called in the main thread during finalize_async. This will block the inference loop until all GPU operations are finished, which can significantly impact throughput. Consider recording a completion event and having the worker thread wait for it asynchronously to avoid blocking the main inference loop.
| json_str = json.dumps({key: getattr(counter, key) for key in keys} | {"marker": marker}) | ||
| _ = json.loads(json_str) # validate | ||
| f.write(json_str + '\n') |
There was a problem hiding this comment.
The validation step json.loads(json_str) is redundant and adds unnecessary overhead. json.dumps is guaranteed to produce a valid JSON string if it completes successfully.
| json_str = json.dumps({key: getattr(counter, key) for key in keys} | {"marker": marker}) | |
| _ = json.loads(json_str) # validate | |
| f.write(json_str + '\n') | |
| json_str = json.dumps({key: getattr(counter, key) for key in keys} | {"marker": marker}) | |
| f.write(json_str + '\n') |
| logger.error(f"Error in finalizing counters: {e}") | ||
| raise e | ||
| return |
1. feat(misc): Profiler support
use --enable_profiling=MODE to enable, currently support torch_profile and nvtx (use with NVIDIA Nsight system) mode
2. WIP feat: hierarchical model level to op level profiling support
To use: