137 reads

Bloom Filter Basics in Go

by Dmitrii AntonovMarch 6th, 2024

Too Long; Didn't Read

Learn about Bloom filters: memory-efficient data structures using hashing for fast set membership queries.

featured image - Bloom Filter Basics in Go

Introduction

Bloom filter is a probabilistic data structure based on hashing. It can help us answer whether a key is present in a given set. The Bloom filter is based on a set of bits and hash functions. As opposed to a hash table we don’t store a key-value pair, we only store several bits that equal the number of hash functions. The Bloom filter cannot accurately answer the question of representing a key as a bit set because bits can match other keys. This image below must help you figure out the base idea.

First, we insert "key one” into our bit set and mark two bits as 1. Then we insert another key "key two” into our bit set and also mark other bits as 1. When we want to check for another key in our set of bits, we see that they are already populated, even though the key is not present in our filter. It is called a false positive occurrence.

Optimal size of bit set and number of hash functions

Our implementation of the Bloom filter will take two arguments. First, an expected number of elements that the Bloom filter will store. Second, a probability of the false positive occurrence.

We can compute the optimal size of the bit set by the function -((n * ln(p)) / (ln(2) * ln(2))) where n is the expected number of elements and p is the probability of a false positive occurrence.

For the calculation number of hash functions, we can use the function m / n * ln(2) where m is the size of the bit set, and n is the expected number of elements.

type Bloom struct {
	m      uint32        // m is size of bitset
	k      uint32        // k is number of hash functions
	bitset []bool        // bitset represent an array of bits
	hashes []hash.Hash32 // hashes is preset hash functions
}

// New create a new instance of the bloom filter
// n is expected number of elements
// p is false alarm probability
func New(n int, p float64) *Bloom {
	m := M(n, p)
	k := K(m, uint32(n))
	bitset := make([]bool, m)
	hashes := make([]hash.Hash32, k)

	var i uint32

	for ; i < k; i++ {
		hashes[i] = murmur.New32WithSeed(i)
	}

	return &Bloom{
		m:      m,
		k:      k,
		bitset: bitset,
		hashes: hashes,
	}
}

// M calculates size of bitset
// n is an expected number of elements
// p is a false alarm probability
func M(n int, p float64) uint32 {
	return uint32(math.Ceil(-((float64(n) * math.Log(p)) / (math.Pow(math.Log(2), 2)))))
}

// K calculates number of hash functions
// m is a size of bitset
// n is an expected number of elements
func K(m, n uint32) uint32 {
	return uint32(math.Ceil(math.Log(2) * float64(m) / float64(n)))
}

I used murmur3 hash functions. If you are interested in how this hash function works, you can read this Wikipedia article.

Add and contain operations

Our implementation of adding the key into the Bloom filter is very easy. We find all the positions of our key in the same way as in the hash table. We have to calculate a hash and take the remainder of the division by the bit set size.

// Add the key to the bloom filter
// k is the key of the element
func (b *Bloom) Add(k []byte) {
	for _, h := range b.hashes {
		h.Reset()
		_, _ = h.Write(k)
		idx := h.Sum32() % b.m
		b.bitset[idx] = true
	}
}

The contain operation is similar to the add operation we must check all possible occurrences have positive values. If all the occurrences are true we will return true otherwise false.

func (b *Bloom) Contains(k []byte) bool {
	for _, h := range b.hashes {
		h.Reset()
		_, _ = h.Write(k)
		idx := h.Sum32() % b.m
		if !b.bitset[idx] {
			return false
		}
	}
	return true
}

Union operation

We can union different Bloom filters, but we have a few restrictions. Our Bloom filters must have the same size and number of hash functions. Also, keep in mind that if the sum of the elements is greater than the expected sum, we will lose accuracy.

// Union two different bloom filters with the same size and number of hash functions
func (b *Bloom) Union(a *Bloom) (err error) {
	if b.m != a.m {
		return errors.New("the bloom filters have the different sizes")
	}

	if b.k != a.k {
		return errors.New("the bloom filters have the different number of hash functions")
	}

	n := int(b.m)

	for i := 0; i < n; i++ {
		if b.bitset[i] || a.bitset[i] {
			b.bitset[i] = true
		}
	}

	return nil
}

Props and cons

Props

The bloom filter ensures that the add and contain operation requires O(k) time complexity, where k is the number of hash functions.
The bloom filter requires less memory than the hash table because it stores neither keys nor values.
The bloom filter is very useful when we want to reduce the number of database or cache queries.

Cons

The bit set size and the number of hash functions cannot be changed once the Bloom filter has been created.
If we add more keys than expected, the Bloom filter will lose accuracy.