Skip to content

group by v2#680

Open
Honglei-Qiu wants to merge 11 commits intoPaddlePaddle:developfrom
Honglei-Qiu:group_v2
Open

group by v2#680
Honglei-Qiu wants to merge 11 commits intoPaddlePaddle:developfrom
Honglei-Qiu:group_v2

Conversation

@Honglei-Qiu
Copy link
Copy Markdown
Contributor

PR Category

Feature Enhancement

Description

新增group分组规则

@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Mar 23, 2026

Thanks for your contribution!

Copy link
Copy Markdown
Collaborator

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

统计下每种分组方法,都能产生多少个group

WHERE s.deleted = 0
AND s.sample_type != 'full_graph'
) sub
WHERE sub.rn = 1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个判断的作用是什么?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

去重吧,防止一个sample被重复选取


def get_v2_group_members(candidates: list[CandidateGraph], num_dtypes: int):
# Index candidates by op_seq
by_op_seq = defaultdict(list)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

优化下所有的变量命名

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

b.input_shapes_bucket_id,
b.input_dtypes_bucket_id,
s.graph_hash,
ROW_NUMBER() OVER (
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

graph_hash不需要了吧?ROW_NUMBER在这里的作用是什么?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在每个 (op_seq, shapes, dtypes) 分区内,按创建时间排序编号,然后只取 rn = 1(最早的那条)。作用是桶内去重:同一个桶里可能有多个样本,只保留一个代表。
不过现在代码改了很多

"""

# Index candidates by op_seq
by_op_seq = defaultdict(list)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by_op_seq这样的变量名太抽象了

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

candidates_by_op_seq,润色一下

for c in candidates:
by_op_seq[c.op_seq_bucket_id].append(c)

rule3_selected_uids = set()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不要以relux_这样的方式命名

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the SQLite grouping insertion script by adding “group by v2” rules to generate additional graph_net_sample_groups beyond the existing v1 bucket-based grouping.

Changes:

  • Refactors graph_net_sample_groups_insert.py into clearer query/generation/insert phases with per-rule stats output.
  • Adds v2 grouping logic (Rule 4 dtype coverage + Rule 3 sparse sampling) controlled by --num_dtypes.
  • Updates DB access to use a read-only query() path via sqlite3 alongside SQLAlchemy inserts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +76 to +92
SELECT
sub.sample_uid,
sub.op_seq_bucket_id,
sub.input_shapes_bucket_id,
sub.sample_type,
group_concat(sub.sample_uid, ',') AS all_uids
FROM (
SELECT
s.uuid AS sample_uid,
s.sample_type,
b.op_seq_bucket_id,
b.input_shapes_bucket_id
FROM graph_sample s
JOIN graph_net_sample_buckets b ON s.uuid = b.sample_uid
ORDER BY s.create_at ASC, s.uuid ASC
) sub
GROUP BY sub.sample_type, sub.op_seq_bucket_id, sub.input_shapes_bucket_id;
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query_bucket_groups selects sub.sample_uid as the bucket "head" without aggregating it or using a deterministic window function. In SQLite, selecting a non-GROUP BY column in an aggregate query can return an arbitrary row, so head_uid (and thus Rule 2 heads) may be nondeterministic. Consider selecting the head via MIN(...)/MAX(...) on a stable key, or using a window function to pick the first row by (create_at, uuid) and aggregating the rest separately.

Suggested change
SELECT
sub.sample_uid,
sub.op_seq_bucket_id,
sub.input_shapes_bucket_id,
sub.sample_type,
group_concat(sub.sample_uid, ',') AS all_uids
FROM (
SELECT
s.uuid AS sample_uid,
s.sample_type,
b.op_seq_bucket_id,
b.input_shapes_bucket_id
FROM graph_sample s
JOIN graph_net_sample_buckets b ON s.uuid = b.sample_uid
ORDER BY s.create_at ASC, s.uuid ASC
) sub
GROUP BY sub.sample_type, sub.op_seq_bucket_id, sub.input_shapes_bucket_id;
WITH buckets AS (
SELECT
s.uuid AS sample_uid,
s.sample_type,
b.op_seq_bucket_id,
b.input_shapes_bucket_id,
FIRST_VALUE(s.uuid) OVER (
PARTITION BY s.sample_type, b.op_seq_bucket_id, b.input_shapes_bucket_id
ORDER BY s.create_at ASC, s.uuid ASC
) AS head_uid
FROM graph_sample s
JOIN graph_net_sample_buckets b ON s.uuid = b.sample_uid
)
SELECT
MIN(head_uid) AS head_uid,
op_seq_bucket_id,
input_shapes_bucket_id,
sample_type,
group_concat(sample_uid, ',') AS all_uids
FROM buckets
GROUP BY sample_type, op_seq_bucket_id, input_shapes_bucket_id;

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

b.op_seq_bucket_id,
b.input_shapes_bucket_id
FROM graph_sample s
JOIN graph_net_sample_buckets b ON s.uuid = b.sample_uid
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query_bucket_groups doesn’t filter out deleted samples or full_graph samples (unlike query_v2_candidates). If graph_net_sample_buckets contains rows for deleted/full_graph samples (e.g., from older runs), this script will generate groups for data that is supposed to be excluded. Add WHERE s.deleted = 0 AND s.sample_type != 'full_graph' (or equivalent) to keep v1/v2 selection consistent.

Suggested change
JOIN graph_net_sample_buckets b ON s.uuid = b.sample_uid
JOIN graph_net_sample_buckets b ON s.uuid = b.sample_uid
WHERE s.deleted = 0
AND s.sample_type != 'full_graph'

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants