Skip to content

fix(theta): add ThetaHashable trait to fix inconsistent hashing across types#111

Open
ZENOTME wants to merge 1 commit intoapache:mainfrom
ZENOTME:theta-canonicalization
Open

fix(theta): add ThetaHashable trait to fix inconsistent hashing across types#111
ZENOTME wants to merge 1 commit intoapache:mainfrom
ZENOTME:theta-canonicalization

Conversation

@ZENOTME
Copy link
Copy Markdown
Contributor

@ZENOTME ZENOTME commented Mar 24, 2026

Rust ThetaSketch::update previously accepted arbitrary T: Hash and fed Rust-native Hash semantics directly into the theta hash path. This diverges from the Java/C++ datasketches implementations, which use explicit canonical update semantics for numeric, floating-point, string, and byte-oriented inputs.

Concretely, this caused several cross-language mismatches:

  • narrow integers were not canonicalized through the same signed-widening path as Java/C++
  • str / byte-oriented inputs followed Rust structural Hash semantics instead of raw-byte semantics
  • the default integer-stream path (for example for i in 0..n { sketch.update(i) }) could produce different retained hashes / estimates from the C++ implementation

@ZENOTME ZENOTME force-pushed the theta-canonicalization branch from adb49dd to 5591bcd Compare March 24, 2026 17:10
@ZENOTME
Copy link
Copy Markdown
Contributor Author

ZENOTME commented Mar 24, 2026

Seems CPC update and HLL update may have the same problem. cc @tisonkun @PsiACE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant