Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 57 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,67 @@ The following additional types are implemented, but less tested:

## Reference

* Thomas Mueller Graf, Daniel Lemire, [Binary Fuse Filters: Fast and Smaller Than Xor Filters](http://arxiv.org/abs/2201.01174), Journal of Experimental Algorithmics 27, 2022. DOI: 10.1145/3510449
* Thomas Mueller Graf, Daniel Lemire, [Binary Fuse Filters: Fast and Smaller Than Xor Filters](http://arxiv.org/abs/2201.01174), Journal of Experimental Algorithmics 27, 2022. DOI: 10.1145/3510449
* Thomas Mueller Graf, Daniel Lemire, [Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters](https://arxiv.org/abs/1912.08258), Journal of Experimental Algorithmics 25 (1), 2020. DOI: 10.1145/3376122

## Usage


To use the XOR and Binary Fuse filters, first prepare an array of keys, then construct the filter:

```java
import org.fastfilter.xor.XorBinaryFuse8;
import org.fastfilter.xor.XorBinaryFuse16;

// Example keys
long[] keys = {1, 2, 3, 4, 5};

// Construct binary fuse filters=
XorBinaryFuse8 xorBinaryFuse8 = XorBinaryFuse8.construct(keys);
XorBinaryFuse16 xorBinaryFuse16 = XorBinaryFuse16.construct(keys);

// Check membership
boolean mightContain = xor8.mayContain(1L); // true
boolean mightContain2 = xor8.mayContain(6L); // false (with high probability)
```

All filters implement the `Filter` interface and support the `mayContain(long key)` method to check if a key might be in the set. Note that false positives are possible, but false negatives are not.

### Generating the Hash Values

The library is written to process `long` values that are meant to be hash values. Though you do not need to use
cryptographically strong hashing, you should make sure that your hash functions are reasonable: they should
not generate too many collisions (two objects mapping to the same `long` value).

### Serialization and Deserialization

Filters can be serialized to and deserialized from a `ByteBuffer` for persistence or transmission:

```java
import java.nio.ByteBuffer;

// Assuming you have a constructed filter

// Get the serialized size
int size = XorBinaryFuse8.getSerializedSize();

// Allocate a ByteBuffer
ByteBuffer buffer = ByteBuffer.allocate(size);

// Serialize the filter
XorBinaryFuse8.serialize(buffer);

// Prepare buffer for reading (flip)
buffer.flip();

// Deserialize the filter
XorBinaryFuse8 deserializedXorBinaryFuse8 = Xor8.deserialize(buffer);

// The deserialized filter behaves identically to the original
```

This allows saving filters to files, databases, or sending them over networks.

### Maven

When using Maven: The latest version, 1.0.4, is not yet available on Maven central, see [issue #48](https://github.com/FastFilter/fastfilter_java/issues/48). However, it is available at https://jitpack.io/:
Expand Down
27 changes: 27 additions & 0 deletions fastfilter/src/main/java/org/fastfilter/xor/Deduplicator.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
package org.fastfilter.xor;

import java.util.Arrays;

public class Deduplicator {

/**
* Sorts the keys array and removes duplicates in place.
* Returns the new length of the array (number of unique elements).
*
* @param keys the array of keys to deduplicate
* @param length the current length of the array
* @return the new length after removing duplicates
*/
public static int sortAndRemoveDup(long[] keys, int length) {
Arrays.sort(keys, 0, length);
int j = 1;
for (int i = 1; i < length; i++) {
if (keys[i] != keys[i - 1]) {
keys[j] = keys[i];
j++;
}
}
return j;
}

}
3 changes: 3 additions & 0 deletions fastfilter/src/main/java/org/fastfilter/xor/Xor16.java
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,10 @@
import org.fastfilter.utils.Hash;

/**
* The Xor16 filter implementation is experimental. We recommend using XorBinaryFuse16 instead. Use at your own risks.
*
* The xor filter, a new algorithm that can replace a Bloom filter.
* Thomas Mueller Graf, Daniel Lemire, [Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters](https://arxiv.org/abs/1912.08258), Journal of Experimental Algorithmics 25 (1), 2020. DOI: 10.1145/3376122
*
* It needs 1.23 log(1/fpp) bits per key. It is related to the BDZ algorithm [1]
* (a minimal perfect hash function algorithm).
Expand Down
4 changes: 4 additions & 0 deletions fastfilter/src/main/java/org/fastfilter/xor/Xor8.java
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,12 @@
import org.fastfilter.Filter;
import org.fastfilter.utils.Hash;


/**
* The Xor8 filter implementation is experimental. We recommend using XorBinaryFuse8 instead. Use at your own risks.
*
* The xor filter, a new algorithm that can replace a Bloom filter.
* Thomas Mueller Graf, Daniel Lemire, [Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters](https://arxiv.org/abs/1912.08258), Journal of Experimental Algorithmics 25 (1), 2020. DOI: 10.1145/3376122
*
* It needs 1.23 log(1/fpp) bits per key. It is related to the BDZ algorithm [1]
* (a minimal perfect hash function algorithm).
Expand Down
105 changes: 59 additions & 46 deletions fastfilter/src/main/java/org/fastfilter/xor/XorBinaryFuse16.java
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@

/**
* The xor binary fuse filter, a new algorithm that can replace a Bloom filter.
* Thomas Mueller Graf, Daniel Lemire, [Binary Fuse Filters: Fast and Smaller Than Xor Filters](http://arxiv.org/abs/2201.01174), Journal of Experimental Algorithmics 27, 2022. DOI: 10.1145/3510449
*/
public class XorBinaryFuse16 implements Filter {

Expand Down Expand Up @@ -78,6 +79,15 @@ private static int mod3(int x) {
return x;
}

/**
* Constructs a new XorBinaryFuse16 filter from the given array of keys.
* The filter is designed to have a low false positive rate while being space-efficient.
* The keys array should contain unique values. The array may be mutated during construction
* (e.g., sorted and deduplicated) if the algorithm detects that there are likely too many duplicates.
*
* @param keys the array of long keys to add to the filter
* @return a new XorBinaryFuse16 filter containing all the keys
*/
public static XorBinaryFuse16 construct(long[] keys) {
int size = keys.length;
int segmentLength = calculateSegmentLength(ARITY, size);
Expand All @@ -102,6 +112,7 @@ private void addAll(long[] keys) {
long[] reverseOrder = new long[size + 1];
byte[] reverseH = new byte[size];
int reverseOrderPos = 0;
boolean duplicated = false;

// the lowest 2 bits are the h index (0, 1, or 2)
// so we only have 6 bits for counting;
Expand All @@ -117,7 +128,6 @@ private void addAll(long[] keys) {
blockBits++;
}
int block = 1 << blockBits;
mainloop:
while (true) {
reverseOrder[size] = 1;
int[] startPos = new int[block];
Expand All @@ -126,7 +136,8 @@ private void addAll(long[] keys) {
}
// counting sort

for (long key : keys) {
for(int i = 0; i < size; i++) {
long key = keys[i];
long hash = Hash.hash64(key, seed);
int segmentIndex = (int) (hash >>> (64 - blockBits));
// We only overwrite when the hash was zero. Zero hash values
Expand All @@ -150,60 +161,62 @@ private void addAll(long[] keys) {
}
}
startPos = null;
if (countMask < 0) {
// we have a possible counter overflow
continue mainloop;
}

reverseOrderPos = 0;
int alonePos = 0;
for (int i = 0; i < arrayLength; i++) {
alone[alonePos] = i;
int inc = (t2count[i] >> 2) == 1 ? 1 : 0;
alonePos += inc;
}
if (countMask >= 0) {
reverseOrderPos = 0;
int alonePos = 0;
for (int i = 0; i < arrayLength; i++) {
alone[alonePos] = i;
int inc = (t2count[i] >> 2) == 1 ? 1 : 0;
alonePos += inc;
}

while (alonePos > 0) {
alonePos--;
int index = alone[alonePos];
if ((t2count[index] >> 2) == 1) {
// It is still there!
long hash = t2hash[index];
byte found = (byte) (t2count[index] & 3);

reverseH[reverseOrderPos] = found;
reverseOrder[reverseOrderPos] = hash;

h012[0] = getHashFromHash(hash, 0);
h012[1] = getHashFromHash(hash, 1);
h012[2] = getHashFromHash(hash, 2);

int index3 = h012[mod3(found + 1)];
alone[alonePos] = index3;
alonePos += ((t2count[index3] >> 2) == 2 ? 1 : 0);
t2count[index3] -= 4;
t2count[index3] ^= mod3(found + 1);
t2hash[index3] ^= hash;

index3 = h012[mod3(found + 2)];
alone[alonePos] = index3;
alonePos += ((t2count[index3] >> 2) == 2 ? 1 : 0);
t2count[index3] -= 4;
t2count[index3] ^= mod3(found + 2);
t2hash[index3] ^= hash;

reverseOrderPos++;
while (alonePos > 0) {
alonePos--;
int index = alone[alonePos];
if ((t2count[index] >> 2) == 1) {
// It is still there!
long hash = t2hash[index];
byte found = (byte) (t2count[index] & 3);

reverseH[reverseOrderPos] = found;
reverseOrder[reverseOrderPos] = hash;

h012[0] = getHashFromHash(hash, 0);
h012[1] = getHashFromHash(hash, 1);
h012[2] = getHashFromHash(hash, 2);

int index3 = h012[mod3(found + 1)];
alone[alonePos] = index3;
alonePos += ((t2count[index3] >> 2) == 2 ? 1 : 0);
t2count[index3] -= 4;
t2count[index3] ^= mod3(found + 1);
t2hash[index3] ^= hash;

index3 = h012[mod3(found + 2)];
alone[alonePos] = index3;
alonePos += ((t2count[index3] >> 2) == 2 ? 1 : 0);
t2count[index3] -= 4;
t2count[index3] ^= mod3(found + 2);
t2hash[index3] ^= hash;

reverseOrderPos++;
}
}
}

if (reverseOrderPos == size) {
break;
}
hashIndex++;
Arrays.fill(t2count, (byte) 0);
Arrays.fill(t2hash, 0);
Arrays.fill(reverseOrder, 0);

// If we reach 10 passes, we assume that there are too many duplicates
// in the input key set. We then sort and remove duplicates in place.
// This should almost never happen.
if (countMask < 0 && !duplicated) {
size = Deduplicator.sortAndRemoveDup(keys, size);
duplicated = true;
}
if (hashIndex > 100) {
// if construction doesn't succeed eventually,
// then there is likely a problem with the hash function.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,8 @@
import org.fastfilter.utils.Hash;

/**
* The xor binary fuse filter, a new algorithm that can replace a Bloom filter.
* The XorBinaryFuse32 filter is experimental. We recommend using XorBinaryFuse8 or XorBinaryFuse16 instead.
* Use at your own risks.
*/
public class XorBinaryFuse32 implements Filter {

Expand Down
Loading