The Problem
Let's say you are building a sign-up service for an application. You need to ensure that the user doesn't choose an username that has already been taken by someone else.
You might think of a solution which looks like this:
- Send a GET request like
/username?radioactive11
- Query your database to check if a user with username
radioactive11
already exists - If the query result is None (or empty), you can allot this username to the new user
- Else ask the user to choose a different username
While this approach will work perfectly, it is very slow. Considering the popularity of a sign-up endpoint, this approach is highly unoptimised as you will be doing several read requests to the database for just one write request.
A possible solution
A better solution would be to introduce a caching layer, using some fast database like Redis. Here is what this solution will look like.
- Send the same GET request:
/username?radioactive11
- Check the cache for an existing username
- If the username exists, then ask the user to choose a different one. (Cache Hit)
- If the username doesn't exist in the cache, query the primary database. (Cache Miss)
- If the username doesn't exist in the primary database, allow the user to take this name.
- Else ask the user to choose a different username
However, this approach too has it's problems. First, we increased our memory usage. Secondly, we need to ensure that the data in the cache is up to date with out primary database and not stale.
A better solution: Bloom Filters
By definition, a bloom filter is a probabilistic data structure which tells us whether an element might be in a database or definitely isn't. They are only prone to false-positives, that is, searching for an element that does not exists might return incorrect results.
How do bloom filters work?
Inserting an item
We take a Bit Vector
of size m
with all the bits set to false
by default. in this example, let's take a vector with size m = 10
.
Next, we need k
efficient hash functions. In this example, let's take 3 hash functions, HF1, HF2 & HF3
.
In order to insert an element in bloom filter, we hash it using multiple hash functions.
For example, we want to add radioactive11
using the 3 hash functions. Using our arbitrary hash functions, we get the following values:
HF1("radioactive11")
= 1988HF2("radioactive11")
= 2022HF3("radioactive11")
= 2001
Now we take mod(m)
(10 in this case) of the values obtained by each hash function. We get
- 1988 % 10 = 8
- 2022 %10 = 2
- 2001 % 10 = 1
Now we set 1st, 2nd and 9th bit of our vector to True
.
Testing presence of an item
If we want to verify the existence of an item in the database, we hash it using the same hash functions and calculate the mod
. If all the k
indices in the vector are already set to true, the item might be already present in the database.
However, if all the k
bits are not set to true, the item is definitely not in the database.
Trying it out
Redis provides a module called RedisBloom
which provides an easy to use, scalable Bloom Filter.
You can run a local instance of Redis Stack using Docker by following these steps.
Once you have Redis Stack running, you can connect to Redis using the `redis-cli.
- Connect to Redis
redis-cli
- Check if the server is running or not
127.0.0.1:6379> PING PONG
- Create a new bloom filter using the
BF.ADD
command. This will create a small bloom filter suitable for small number of items.127.0.0.1:6379> BF.ADD bloom taylorswift (integer) 1
- You can add more elements using BF.ADD
127.0.0.1:6379> BF.ADD bloom arianagrande (integer) 1 127.0.0.1:6379> BF.ADD bloom hardwell (integer) 1 127.0.0.1:6379> BF.ADD bloom avicii (integer) 1
- To check if an element exists, you can do
BF.EXISTS
. If the output is 1, the element might exist. If the output it 0, the element definitely doesn't exist.
127.0.0.1:6379> BF.EXISTS bloom selenagomez
(integer) 0
127.0.0.1:6379> BF.EXISTS bloom taylorswift
(integer) 1
127.0.0.1:6379> BF.EXISTS bloom edsheeran
(integer) 0
127.0.0.1:6379> BF.EXISTS bloom arianagrande
(integer) 1
Conclusion
Bloom filters are used in a variety of applications. Some of them are:
- Avoid caching data that is not frequently used
- Avoid suggesting users content that they have interacted with before
- Identify malicious URLs