The Problem

Let's say you are building a sign-up service for an application. You need to ensure that the user doesn't choose an username that has already been taken by someone else.

You might think of a solution which looks like this:

Send a GET request like /username?radioactive11
Query your database to check if a user with username radioactive11 already exists
If the query result is None (or empty), you can allot this username to the new user
Else ask the user to choose a different username

While this approach will work perfectly, it is very slow. Considering the popularity of a sign-up endpoint, this approach is highly unoptimised as you will be doing several read requests to the database for just one write request.

A possible solution

A better solution would be to introduce a caching layer, using some fast database like Redis. Here is what this solution will look like.

Send the same GET request: /username?radioactive11
Check the cache for an existing username
If the username exists, then ask the user to choose a different one. (Cache Hit)
If the username doesn't exist in the cache, query the primary database. (Cache Miss)
If the username doesn't exist in the primary database, allow the user to take this name.
Else ask the user to choose a different username

However, this approach too has it's problems. First, we increased our memory usage. Secondly, we need to ensure that the data in the cache is up to date with out primary database and not stale.

A better solution: Bloom Filters

By definition, a bloom filter is a probabilistic data structure which tells us whether an element might be in a database or definitely isn't. They are only prone to false-positives, that is, searching for an element that does not exists might return incorrect results.

How do bloom filters work?

Inserting an item

We take a Bit Vector of size m with all the bits set to false by default. in this example, let's take a vector with size m = 10.

Next, we need k efficient hash functions. In this example, let's take 3 hash functions, HF1, HF2 & HF3.

In order to insert an element in bloom filter, we hash it using multiple hash functions.

For example, we want to add radioactive11 using the 3 hash functions. Using our arbitrary hash functions, we get the following values:

HF1("radioactive11") = 1988
HF2("radioactive11") = 2022
HF3("radioactive11") = 2001

Now we take mod(m) (10 in this case) of the values obtained by each hash function. We get

1988 % 10 = 8
2022 %10 = 2
2001 % 10 = 1

Now we set 1st, 2nd and 9th bit of our vector to True.

Blog Diagrams.jpg

Testing presence of an item

If we want to verify the existence of an item in the database, we hash it using the same hash functions and calculate the mod. If all the k indices in the vector are already set to true, the item might be already present in the database.

However, if all the k bits are not set to true, the item is definitely not in the database.

Trying it out

Redis provides a module called RedisBloom which provides an easy to use, scalable Bloom Filter.

You can run a local instance of Redis Stack using Docker by following these steps.

Once you have Redis Stack running, you can connect to Redis using the `redis-cli.

Connect to Redis
```
redis-cli
```
Check if the server is running or not
```
127.0.0.1:6379> PING
PONG
```
Create a new bloom filter using the BF.ADD command. This will create a small bloom filter suitable for small number of items.
```
127.0.0.1:6379> BF.ADD bloom taylorswift
(integer) 1
```

You can add more elements using BF.ADD

127.0.0.1:6379> BF.ADD bloom arianagrande
(integer) 1
127.0.0.1:6379> BF.ADD bloom hardwell
(integer) 1
127.0.0.1:6379> BF.ADD bloom avicii
(integer) 1

To check if an element exists, you can do BF.EXISTS. If the output is 1, the element might exist. If the output it 0, the element definitely doesn't exist.

127.0.0.1:6379> BF.EXISTS bloom selenagomez
(integer) 0
127.0.0.1:6379> BF.EXISTS bloom taylorswift
(integer) 1
127.0.0.1:6379> BF.EXISTS bloom edsheeran
(integer) 0
127.0.0.1:6379> BF.EXISTS bloom arianagrande
(integer) 1

Conclusion

Bloom filters are used in a variety of applications. Some of them are:

Avoid caching data that is not frequently used
Avoid suggesting users content that they have interacted with before
Identify malicious URLs

Reduce Database Queries with Bloom Filters

The Problem

A possible solution

A better solution: Bloom Filters

How do bloom filters work?

Inserting an item

Testing presence of an item

Trying it out

Conclusion

Comments

More from this blog

Getting Started with gRPC in Python: A Practical Approach

A Beginner's Guide to gRPC: Faster, Smarter APIs

So, what are ACID Properties?

Introducing Vinyl, scribble for Music 💿

Command Palette

The Problem

A possible solution

A better solution: Bloom Filters

How do bloom filters work?

Inserting an item

Testing presence of an item

Trying it out

Conclusion

Comments

More from this blog