Skip to content

[2026春季][T1-1-1] HyosungSink#79

Draft
HyosungSink wants to merge 9 commits into
InfiniTensor:masterfrom
HyosungSink:2026-spring-HyosungSink-T1-1-1
Draft

[2026春季][T1-1-1] HyosungSink#79
HyosungSink wants to merge 9 commits into
InfiniTensor:masterfrom
HyosungSink:2026-spring-HyosungSink-T1-1-1

Conversation

@HyosungSink
Copy link
Copy Markdown

@HyosungSink HyosungSink commented May 18, 2026

Summary

  • Implement T1-1-1 operators in ntops: rad2deg, copysign, lcm, lgamma, nextafter.
  • Add kernels, torch wrappers, registration, correctness tests, and per-operator performance coverage.
  • Merge the Iluvatar adaptation into this NVIDIA PR branch, with NVIDIA paths preserved behind device/architecture gates.
  • Correctness tests are split by operator: 30 cases per operator, 150 total.
  • Performance tests are split by operator: 20 cases per operator, 100 total.

Validation

  • NVIDIA L4/sm_89 correctness passed: nextafter targeted 30/30 and full T1-1-1 suite 150/150.
  • NVIDIA L4/sm_89 performance passed: 100/100 cases, every case ratio >= 0.9, minimum ratio 0.913x.
  • NVIDIA L4/sm_89 InfiniCore integration passed for all five operators through the use_ntops path.
  • Iluvatar MR-V100 passed: correctness 150/150, performance 100/100, and InfiniCore --iluvatar integration for all five operators.

NVIDIA Performance Summary

Operator Cases NVIDIA min Iluvatar min
rad2deg 20 0.969x 0.998x
copysign 20 0.964x 0.920x
lcm 20 0.913x 1.247x
lgamma 20 0.946x 0.932x
nextafter 20 0.969x 1.032x
Total 100 0.913x 0.920x

rad2deg Performance Cases

Case NVIDIA ntops NVIDIA torch NVIDIA ratio Iluvatar ntops Iluvatar torch Iluvatar ratio
f16_large_1d 0.2678 0.2685 1.003x 0.1198 0.1258 1.050x
f32_large_1d 0.5925 0.5744 0.969x 0.2333 0.2328 0.998x
f64_large_1d 1.1498 1.1471 0.998x 0.2000 0.2034 1.017x
f32_large_2d 0.5811 0.5708 0.982x 0.2332 0.2335 1.001x
f16_large_3d 0.2684 0.2669 0.995x 0.1195 0.1257 1.052x
f32_large_3d 0.5861 0.5684 0.970x 0.2334 0.2335 1.001x
f64_large_3d 1.1540 1.1449 0.992x 0.2003 0.2034 1.015x
f32_large_out_1d 0.5805 0.5839 1.006x 0.2335 0.2341 1.002x
f64_large_out_2d 1.1535 1.1435 0.991x 0.2004 0.2035 1.015x
f16_large_out_3d 0.2684 0.2690 1.002x 0.1190 0.1242 1.043x
f32_mid_1d 0.5827 0.5692 0.977x 0.2335 0.2330 0.998x
f16_mid_1d 0.2652 0.2680 1.011x 0.1196 0.1258 1.052x
f64_mid_1d 1.1500 1.1464 0.997x 0.1994 0.2039 1.023x
f32_small_1d 0.5873 0.5697 0.970x 0.2333 0.2329 0.999x
f16_noncontig_4096 0.2698 0.2661 0.986x 0.1194 0.1256 1.051x
f32_noncontig_4096 0.5859 0.5702 0.973x 0.2332 0.2331 0.999x
f64_noncontig_2048 0.2646 0.2643 0.999x 0.0527 0.0537 1.020x
f32_noncontig_out_4096 0.5839 0.5855 1.003x 0.2331 0.2332 1.001x
f32_permute3d_256x256x128 0.2664 0.2667 1.001x 0.1187 0.1187 1.000x
f32_permute3d_out_256x256x128 0.2680 0.2664 0.994x 0.1182 0.1186 1.003x

copysign Performance Cases

Case NVIDIA ntops NVIDIA torch NVIDIA ratio Iluvatar ntops Iluvatar torch Iluvatar ratio
f16_large_1d 0.4321 0.4330 1.002x 0.1718 0.1855 1.080x
f32_large_1d 0.8671 0.8630 0.995x 0.3435 0.3456 1.006x
f64_large_1d 1.7419 1.7314 0.994x 0.2061 0.2027 0.984x
f32_large_2d 0.8701 0.8588 0.987x 0.3435 0.3461 1.008x
f16_large_3d 0.4402 0.4314 0.980x 0.1719 0.1856 1.080x
f32_large_3d 0.8701 0.8637 0.993x 0.3429 0.3461 1.009x
f64_large_3d 1.7351 1.7006 0.980x 0.2071 0.2037 0.984x
f32_large_out_1d 0.8683 0.8544 0.984x 0.3438 0.3459 1.006x
f64_large_out_2d 1.7376 1.7061 0.982x 0.2064 0.2030 0.983x
f16_large_out_3d 0.4390 0.4256 0.969x 0.1717 0.1853 1.079x
f32_mid_1d 0.8653 0.8627 0.997x 0.3431 0.3458 1.008x
f16_mid_1d 0.4342 0.4333 0.998x 0.1717 0.1851 1.078x
f64_mid_1d 1.7364 1.6907 0.974x 0.2063 0.2028 0.983x
f32_small_1d 0.8669 0.8550 0.986x 0.3429 0.3457 1.008x
f32_broadcast_rect_2048x8192 0.2730 0.2636 0.966x 0.2138 0.1969 0.921x
f32_broadcast_4096 0.2729 0.2631 0.964x 0.2137 0.1966 0.920x
f16_noncontig_4096 0.4405 0.4304 0.977x 0.1714 0.1852 1.080x
f32_noncontig_4096 0.8733 0.8581 0.983x 0.3434 0.3454 1.006x
f64_noncontig_2048 0.4366 0.4244 0.972x 0.0545 0.0532 0.977x
f32_permute3d_out_256x256x128 0.4406 0.4314 0.979x 0.1731 0.1757 1.015x

lcm Performance Cases

Case NVIDIA ntops NVIDIA torch NVIDIA ratio Iluvatar ntops Iluvatar torch Iluvatar ratio
i32_large_1d 0.8278 0.8608 1.040x 0.6197 1.1983 1.934x
i32_large_positive_1d 0.8312 0.8620 1.037x 0.4445 0.9836 2.213x
i32_large_2d 0.8319 0.8606 1.034x 0.6211 1.1986 1.930x
i32_large_positive_2d 0.8330 0.8596 1.032x 0.4467 0.9649 2.160x
i32_large_3d 0.8346 0.8636 1.035x 0.6019 1.1599 1.927x
i32_large_positive_3d 0.8369 0.8645 1.033x 0.4310 0.9513 2.207x
i32_large_out_1d 0.8248 0.8610 1.044x 0.6026 1.1603 1.925x
i32_large_out_2d 0.8298 0.8596 1.036x 0.6017 1.1239 1.868x
i32_broadcast_8192 2.2853 2.1833 0.955x 1.6410 4.9773 3.033x
i32_large_low_1d 0.8297 0.8630 1.040x 0.4136 0.7197 1.740x
i16_mid_1d 0.4895 0.4489 0.917x 0.5302 0.7479 1.411x
i16_large_1d 0.4893 0.5124 1.047x 0.5322 0.7482 1.406x
i64_mid_1d 1.6408 1.7197 1.048x 1.2196 4.3822 3.593x
i64_large_1d 1.6399 1.7363 1.059x 1.2213 4.3816 3.588x
u8_mid_1d 0.4326 0.3951 0.913x 0.4418 0.5511 1.247x
i8_mid_1d 0.4016 0.3784 0.942x 0.3831 0.7199 1.879x
i32_noncontig_4096 0.8372 0.8614 1.029x 0.5854 1.1240 1.920x
i32_noncontig_out_4096 0.8325 0.8581 1.031x 0.5845 1.1235 1.922x
i16_noncontig_6144 1.3412 1.2987 0.968x 1.1766 1.6669 1.417x
i32_permute3d_out_256x256x128 0.4243 0.4366 1.029x 0.2960 0.5721 1.933x

lgamma Performance Cases

Case NVIDIA ntops NVIDIA torch NVIDIA ratio Iluvatar ntops Iluvatar torch Iluvatar ratio
f16_large_1d 0.2782 0.2721 0.978x 0.3862 0.3690 0.955x
f32_large_1d 0.5859 0.5873 1.002x 0.3871 0.3608 0.932x
f64_large_1d 11.3853 11.3149 0.994x 4.3499 9.9664 2.291x
f32_large_2d 0.5937 0.5878 0.990x 0.3631 0.3386 0.932x
f16_large_3d 0.2796 0.2733 0.977x 0.3623 0.3462 0.955x
f32_large_3d 0.5952 0.5867 0.986x 0.3630 0.3386 0.933x
f64_large_3d 11.4016 11.3285 0.994x 4.2140 9.9007 2.350x
f32_large_out_1d 0.5876 0.5786 0.985x 0.3630 0.3389 0.934x
f64_large_out_2d 11.3951 11.2864 0.990x 4.2122 9.9390 2.360x
f16_large_out_3d 0.2807 0.2656 0.946x 0.3622 0.3464 0.956x
f32_mid_1d 0.5897 0.5873 0.996x 0.3631 0.3386 0.933x
f16_mid_1d 0.2756 0.2727 0.989x 0.3625 0.3461 0.955x
f64_mid_1d 11.3817 11.3126 0.994x 4.2144 9.8350 2.334x
f32_small_1d 0.5857 0.5841 0.997x 0.3633 0.3385 0.932x
f16_noncontig_4096 0.2819 0.2726 0.967x 0.3626 0.3462 0.955x
f32_noncontig_4096 0.5897 0.5854 0.993x 0.3630 0.3385 0.933x
f64_noncontig_2048 2.9903 2.8783 0.963x 1.0565 2.2345 2.115x
f32_noncontig_out_4096 0.5915 0.5799 0.980x 0.3632 0.3389 0.933x
f32_permute3d_256x256x128 0.2793 0.2718 0.973x 0.1849 0.1747 0.945x
f32_permute3d_out_256x256x128 0.2806 0.2672 0.952x 0.1848 0.1747 0.945x

nextafter Performance Cases

Case NVIDIA ntops NVIDIA torch NVIDIA ratio Iluvatar ntops Iluvatar torch Iluvatar ratio
f16_large_1d 0.4247 0.4273 1.006x 0.1704 0.1884 1.106x
f32_large_1d 0.8311 0.8528 1.026x 0.3336 0.3444 1.032x
f64_large_1d 1.7096 1.7009 0.995x 0.7306 0.9712 1.329x
f32_large_2d 0.8369 0.8550 1.022x 0.3331 0.3446 1.034x
f16_large_3d 0.4327 0.4307 0.995x 0.1704 0.1873 1.099x
f32_large_3d 0.8436 0.8550 1.013x 0.3335 0.3448 1.034x
f64_large_3d 1.7191 1.7335 1.008x 0.7299 0.9530 1.306x
f32_large_out_1d 0.8450 0.8567 1.014x 0.3332 0.3444 1.034x
f64_large_out_2d 1.7280 1.6992 0.983x 0.7305 0.9507 1.301x
f16_large_out_3d 0.4332 0.4319 0.997x 0.1697 0.1875 1.105x
f32_mid_1d 0.8608 0.8561 0.995x 0.3336 0.3451 1.034x
f16_mid_1d 0.4207 0.4314 1.025x 0.1702 0.1880 1.104x
f64_mid_1d 1.7149 1.7144 1.000x 0.7304 0.9537 1.306x
f32_small_1d 0.8297 0.8550 1.030x 0.3335 0.3441 1.032x
f32_broadcast_rect_2048x8192 0.2689 0.2631 0.978x 0.2075 0.2593 1.250x
f32_broadcast_4096 0.2680 0.2631 0.982x 0.2073 0.2595 1.252x
f16_noncontig_4096 0.4316 0.4328 1.003x 0.1700 0.1871 1.101x
f32_noncontig_4096 0.8586 0.8672 1.010x 0.3336 0.3448 1.034x
f64_noncontig_2048 0.4391 0.4253 0.969x 0.1866 0.2451 1.314x
f32_permute3d_out_256x256x128 0.4394 0.4268 0.971x 0.1699 0.1760 1.035x

Notes

  • PyTorch is used only as the test reference, not as the runtime implementation.
  • Latest Iluvatar nextafter float16 fix is gated by half and iluvatar; NVIDIA paths remain on the existing ntops/NineToothed kernels.

@HyosungSink HyosungSink force-pushed the 2026-spring-HyosungSink-T1-1-1 branch from e7ccc3b to 1053c1c Compare May 18, 2026 12:58
@HyosungSink HyosungSink force-pushed the 2026-spring-HyosungSink-T1-1-1 branch from 1053c1c to 2824162 Compare May 18, 2026 13:08
@HyosungSink HyosungSink force-pushed the 2026-spring-HyosungSink-T1-1-1 branch from ec69e6f to 0ccbed2 Compare May 19, 2026 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant