[2026春季][T1-1-1] HyosungSink by HyosungSink · Pull Request #79 · InfiniTensor/ntops

HyosungSink · 2026-05-18T12:46:42Z

Summary

Implement T1-1-1 operators in ntops: rad2deg, copysign, lcm, lgamma, nextafter.
Add kernels, torch wrappers, registration, correctness tests, and per-operator performance coverage.
Merge the Iluvatar adaptation into this NVIDIA PR branch, with NVIDIA paths preserved behind device/architecture gates.
Correctness tests are split by operator: 30 cases per operator, 150 total.
Performance tests are split by operator: 20 cases per operator, 100 total.

Validation

NVIDIA L4/sm_89 correctness passed: nextafter targeted 30/30 and full T1-1-1 suite 150/150.
NVIDIA L4/sm_89 performance passed: 100/100 cases, every case ratio >= 0.9, minimum ratio 0.913x.
NVIDIA L4/sm_89 InfiniCore integration passed for all five operators through the use_ntops path.
Iluvatar MR-V100 passed: correctness 150/150, performance 100/100, and InfiniCore --iluvatar integration for all five operators.

NVIDIA Performance Summary

Operator	Cases	NVIDIA min	Iluvatar min
rad2deg	20	0.969x	0.998x
copysign	20	0.964x	0.920x
lcm	20	0.913x	1.247x
lgamma	20	0.946x	0.932x
nextafter	20	0.969x	1.032x
Total	100	0.913x	0.920x

rad2deg Performance Cases

Case	NVIDIA ntops	NVIDIA torch	NVIDIA ratio	Iluvatar ntops	Iluvatar torch	Iluvatar ratio
f16_large_1d	0.2678	0.2685	1.003x	0.1198	0.1258	1.050x
f32_large_1d	0.5925	0.5744	0.969x	0.2333	0.2328	0.998x
f64_large_1d	1.1498	1.1471	0.998x	0.2000	0.2034	1.017x
f32_large_2d	0.5811	0.5708	0.982x	0.2332	0.2335	1.001x
f16_large_3d	0.2684	0.2669	0.995x	0.1195	0.1257	1.052x
f32_large_3d	0.5861	0.5684	0.970x	0.2334	0.2335	1.001x
f64_large_3d	1.1540	1.1449	0.992x	0.2003	0.2034	1.015x
f32_large_out_1d	0.5805	0.5839	1.006x	0.2335	0.2341	1.002x
f64_large_out_2d	1.1535	1.1435	0.991x	0.2004	0.2035	1.015x
f16_large_out_3d	0.2684	0.2690	1.002x	0.1190	0.1242	1.043x
f32_mid_1d	0.5827	0.5692	0.977x	0.2335	0.2330	0.998x
f16_mid_1d	0.2652	0.2680	1.011x	0.1196	0.1258	1.052x
f64_mid_1d	1.1500	1.1464	0.997x	0.1994	0.2039	1.023x
f32_small_1d	0.5873	0.5697	0.970x	0.2333	0.2329	0.999x
f16_noncontig_4096	0.2698	0.2661	0.986x	0.1194	0.1256	1.051x
f32_noncontig_4096	0.5859	0.5702	0.973x	0.2332	0.2331	0.999x
f64_noncontig_2048	0.2646	0.2643	0.999x	0.0527	0.0537	1.020x
f32_noncontig_out_4096	0.5839	0.5855	1.003x	0.2331	0.2332	1.001x
f32_permute3d_256x256x128	0.2664	0.2667	1.001x	0.1187	0.1187	1.000x
f32_permute3d_out_256x256x128	0.2680	0.2664	0.994x	0.1182	0.1186	1.003x

copysign Performance Cases

Case	NVIDIA ntops	NVIDIA torch	NVIDIA ratio	Iluvatar ntops	Iluvatar torch	Iluvatar ratio
f16_large_1d	0.4321	0.4330	1.002x	0.1718	0.1855	1.080x
f32_large_1d	0.8671	0.8630	0.995x	0.3435	0.3456	1.006x
f64_large_1d	1.7419	1.7314	0.994x	0.2061	0.2027	0.984x
f32_large_2d	0.8701	0.8588	0.987x	0.3435	0.3461	1.008x
f16_large_3d	0.4402	0.4314	0.980x	0.1719	0.1856	1.080x
f32_large_3d	0.8701	0.8637	0.993x	0.3429	0.3461	1.009x
f64_large_3d	1.7351	1.7006	0.980x	0.2071	0.2037	0.984x
f32_large_out_1d	0.8683	0.8544	0.984x	0.3438	0.3459	1.006x
f64_large_out_2d	1.7376	1.7061	0.982x	0.2064	0.2030	0.983x
f16_large_out_3d	0.4390	0.4256	0.969x	0.1717	0.1853	1.079x
f32_mid_1d	0.8653	0.8627	0.997x	0.3431	0.3458	1.008x
f16_mid_1d	0.4342	0.4333	0.998x	0.1717	0.1851	1.078x
f64_mid_1d	1.7364	1.6907	0.974x	0.2063	0.2028	0.983x
f32_small_1d	0.8669	0.8550	0.986x	0.3429	0.3457	1.008x
f32_broadcast_rect_2048x8192	0.2730	0.2636	0.966x	0.2138	0.1969	0.921x
f32_broadcast_4096	0.2729	0.2631	0.964x	0.2137	0.1966	0.920x
f16_noncontig_4096	0.4405	0.4304	0.977x	0.1714	0.1852	1.080x
f32_noncontig_4096	0.8733	0.8581	0.983x	0.3434	0.3454	1.006x
f64_noncontig_2048	0.4366	0.4244	0.972x	0.0545	0.0532	0.977x
f32_permute3d_out_256x256x128	0.4406	0.4314	0.979x	0.1731	0.1757	1.015x

lcm Performance Cases

Case	NVIDIA ntops	NVIDIA torch	NVIDIA ratio	Iluvatar ntops	Iluvatar torch	Iluvatar ratio
i32_large_1d	0.8278	0.8608	1.040x	0.6197	1.1983	1.934x
i32_large_positive_1d	0.8312	0.8620	1.037x	0.4445	0.9836	2.213x
i32_large_2d	0.8319	0.8606	1.034x	0.6211	1.1986	1.930x
i32_large_positive_2d	0.8330	0.8596	1.032x	0.4467	0.9649	2.160x
i32_large_3d	0.8346	0.8636	1.035x	0.6019	1.1599	1.927x
i32_large_positive_3d	0.8369	0.8645	1.033x	0.4310	0.9513	2.207x
i32_large_out_1d	0.8248	0.8610	1.044x	0.6026	1.1603	1.925x
i32_large_out_2d	0.8298	0.8596	1.036x	0.6017	1.1239	1.868x
i32_broadcast_8192	2.2853	2.1833	0.955x	1.6410	4.9773	3.033x
i32_large_low_1d	0.8297	0.8630	1.040x	0.4136	0.7197	1.740x
i16_mid_1d	0.4895	0.4489	0.917x	0.5302	0.7479	1.411x
i16_large_1d	0.4893	0.5124	1.047x	0.5322	0.7482	1.406x
i64_mid_1d	1.6408	1.7197	1.048x	1.2196	4.3822	3.593x
i64_large_1d	1.6399	1.7363	1.059x	1.2213	4.3816	3.588x
u8_mid_1d	0.4326	0.3951	0.913x	0.4418	0.5511	1.247x
i8_mid_1d	0.4016	0.3784	0.942x	0.3831	0.7199	1.879x
i32_noncontig_4096	0.8372	0.8614	1.029x	0.5854	1.1240	1.920x
i32_noncontig_out_4096	0.8325	0.8581	1.031x	0.5845	1.1235	1.922x
i16_noncontig_6144	1.3412	1.2987	0.968x	1.1766	1.6669	1.417x
i32_permute3d_out_256x256x128	0.4243	0.4366	1.029x	0.2960	0.5721	1.933x

lgamma Performance Cases

Case	NVIDIA ntops	NVIDIA torch	NVIDIA ratio	Iluvatar ntops	Iluvatar torch	Iluvatar ratio
f16_large_1d	0.2782	0.2721	0.978x	0.3862	0.3690	0.955x
f32_large_1d	0.5859	0.5873	1.002x	0.3871	0.3608	0.932x
f64_large_1d	11.3853	11.3149	0.994x	4.3499	9.9664	2.291x
f32_large_2d	0.5937	0.5878	0.990x	0.3631	0.3386	0.932x
f16_large_3d	0.2796	0.2733	0.977x	0.3623	0.3462	0.955x
f32_large_3d	0.5952	0.5867	0.986x	0.3630	0.3386	0.933x
f64_large_3d	11.4016	11.3285	0.994x	4.2140	9.9007	2.350x
f32_large_out_1d	0.5876	0.5786	0.985x	0.3630	0.3389	0.934x
f64_large_out_2d	11.3951	11.2864	0.990x	4.2122	9.9390	2.360x
f16_large_out_3d	0.2807	0.2656	0.946x	0.3622	0.3464	0.956x
f32_mid_1d	0.5897	0.5873	0.996x	0.3631	0.3386	0.933x
f16_mid_1d	0.2756	0.2727	0.989x	0.3625	0.3461	0.955x
f64_mid_1d	11.3817	11.3126	0.994x	4.2144	9.8350	2.334x
f32_small_1d	0.5857	0.5841	0.997x	0.3633	0.3385	0.932x
f16_noncontig_4096	0.2819	0.2726	0.967x	0.3626	0.3462	0.955x
f32_noncontig_4096	0.5897	0.5854	0.993x	0.3630	0.3385	0.933x
f64_noncontig_2048	2.9903	2.8783	0.963x	1.0565	2.2345	2.115x
f32_noncontig_out_4096	0.5915	0.5799	0.980x	0.3632	0.3389	0.933x
f32_permute3d_256x256x128	0.2793	0.2718	0.973x	0.1849	0.1747	0.945x
f32_permute3d_out_256x256x128	0.2806	0.2672	0.952x	0.1848	0.1747	0.945x

nextafter Performance Cases

Case	NVIDIA ntops	NVIDIA torch	NVIDIA ratio	Iluvatar ntops	Iluvatar torch	Iluvatar ratio
f16_large_1d	0.4247	0.4273	1.006x	0.1704	0.1884	1.106x
f32_large_1d	0.8311	0.8528	1.026x	0.3336	0.3444	1.032x
f64_large_1d	1.7096	1.7009	0.995x	0.7306	0.9712	1.329x
f32_large_2d	0.8369	0.8550	1.022x	0.3331	0.3446	1.034x
f16_large_3d	0.4327	0.4307	0.995x	0.1704	0.1873	1.099x
f32_large_3d	0.8436	0.8550	1.013x	0.3335	0.3448	1.034x
f64_large_3d	1.7191	1.7335	1.008x	0.7299	0.9530	1.306x
f32_large_out_1d	0.8450	0.8567	1.014x	0.3332	0.3444	1.034x
f64_large_out_2d	1.7280	1.6992	0.983x	0.7305	0.9507	1.301x
f16_large_out_3d	0.4332	0.4319	0.997x	0.1697	0.1875	1.105x
f32_mid_1d	0.8608	0.8561	0.995x	0.3336	0.3451	1.034x
f16_mid_1d	0.4207	0.4314	1.025x	0.1702	0.1880	1.104x
f64_mid_1d	1.7149	1.7144	1.000x	0.7304	0.9537	1.306x
f32_small_1d	0.8297	0.8550	1.030x	0.3335	0.3441	1.032x
f32_broadcast_rect_2048x8192	0.2689	0.2631	0.978x	0.2075	0.2593	1.250x
f32_broadcast_4096	0.2680	0.2631	0.982x	0.2073	0.2595	1.252x
f16_noncontig_4096	0.4316	0.4328	1.003x	0.1700	0.1871	1.101x
f32_noncontig_4096	0.8586	0.8672	1.010x	0.3336	0.3448	1.034x
f64_noncontig_2048	0.4391	0.4253	0.969x	0.1866	0.2451	1.314x
f32_permute3d_out_256x256x128	0.4394	0.4268	0.971x	0.1699	0.1760	1.035x

Notes

PyTorch is used only as the test reference, not as the runtime implementation.
Latest Iluvatar nextafter float16 fix is gated by half and iluvatar; NVIDIA paths remain on the existing ntops/NineToothed kernels.

HyosungSink added 6 commits May 16, 2026 05:52

Fix SDPA fully masked rows

363b0f4

Add ntops rad2deg operator

814300d

Add ntops copysign operator

6cebe21

Add ntops lcm operator

739da9e

Add ntops lgamma operator

8d0c779

Add ntops nextafter operator

c207dbb

HyosungSink force-pushed the 2026-spring-HyosungSink-T1-1-1 branch from e7ccc3b to 1053c1c Compare May 18, 2026 12:58

Register T1-1-1 ntops operators

2824162

HyosungSink force-pushed the 2026-spring-HyosungSink-T1-1-1 branch from 1053c1c to 2824162 Compare May 18, 2026 13:08

Adapt T1-1-1 ntops operators for Iluvatar

0ccbed2

HyosungSink force-pushed the 2026-spring-HyosungSink-T1-1-1 branch from ec69e6f to 0ccbed2 Compare May 19, 2026 16:44

Fix Iluvatar nextafter float16 broadcast

66adbec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2026春季][T1-1-1] HyosungSink#79

[2026春季][T1-1-1] HyosungSink#79
HyosungSink wants to merge 9 commits into
InfiniTensor:masterfrom
HyosungSink:2026-spring-HyosungSink-T1-1-1

HyosungSink commented May 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HyosungSink commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

NVIDIA Performance Summary

rad2deg Performance Cases

copysign Performance Cases

lcm Performance Cases

lgamma Performance Cases

nextafter Performance Cases

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HyosungSink commented May 18, 2026 •

edited

Loading