Skip to content

[FLINK-39399] Fix integer overflow in HyperLogLogPlusPlus causing APPROX_COUNT_DISTINCT undercount#27895

Open
jnh5y wants to merge 1 commit intoapache:masterfrom
jnh5y:FLINK-39399-hll-integer-overflow
Open

[FLINK-39399] Fix integer overflow in HyperLogLogPlusPlus causing APPROX_COUNT_DISTINCT undercount#27895
jnh5y wants to merge 1 commit intoapache:masterfrom
jnh5y:FLINK-39399-hll-integer-overflow

Conversation

@jnh5y
Copy link
Copy Markdown
Contributor

@jnh5y jnh5y commented Apr 3, 2026

What is the purpose of the change

In HyperLogLogPlusPlus.query(), the expression "1 << mIdx" uses int shift which wraps modulo 32. At high cardinality
(~100M+ distinct values), some registers accumulate values >= 32, causing the shift to produce incorrect zInverse contributions and wildly wrong estimates.

Brief change log

In HyperLogLogPlusPlus.java, change "1 << mIdx" to "1L << mIdx" to use long shift arithmetic.

Verifying this change

This change added a test in HyperLogLogPlusPlusTest.java.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)

Generated-by: Claude Opus 4.6 (1M context)

@flinkbot
Copy link
Copy Markdown
Collaborator

flinkbot commented Apr 3, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@snuyanzin
Copy link
Copy Markdown
Contributor

snuyanzin commented Apr 3, 2026

Thanks for contribution

since you use AI, can you use Generated-by: instead of Co-authored-by: as mentioned in Apache recommendations https://www.apache.org/legal/generative-tooling.html

also the doc about this is in progress https://github.com/apache/flink/pull/27776/changes#r2979585031

@jnh5y
Copy link
Copy Markdown
Contributor Author

jnh5y commented Apr 3, 2026

Thanks for contribution

since you use AI, can you use Generated-by: instead of Co-authored-by: as mentioned in Apache recommendations https://www.apache.org/legal/generative-tooling.html

also the doc about this is in progress https://github.com/apache/flink/pull/27776/changes#r2979585031

Is it sufficient to have that on the PR Description and keep the Co-Authored tag as-is in the Git commit?

@snuyanzin
Copy link
Copy Markdown
Contributor

quote from ASF doc

This should be included as a token in the source control commit message, for example including the phrase “Generated-by: ”.

so it should be changed in commit

…ROX_COUNT_DISTINCT undercount

In HyperLogLogPlusPlus.query(), the expression "1 << mIdx" uses int
shift which wraps modulo 32. At high cardinality (~100M+ distinct
values), some registers accumulate values >= 32, causing the shift to
produce incorrect zInverse contributions and wildly wrong estimates.

Change "1 << mIdx" to "1L << mIdx" to use long shift arithmetic.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Generated-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jnh5y jnh5y force-pushed the FLINK-39399-hll-integer-overflow branch from 43626cd to b40601d Compare April 6, 2026 13:56
@jnh5y
Copy link
Copy Markdown
Contributor Author

jnh5y commented Apr 6, 2026

@flinkbot run azure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants