Reduce Database Queries with Bloom Filters

Reduce Database Queries with Bloom Filters

ft. Redis Bloom

·

4 min read

The Problem

Let's say you are building a sign-up service for an application. You need to ensure that the user doesn't choose an username that has already been taken by someone else.

You might think of a solution which looks like this:

  1. Send a GET request like /username?radioactive11
  2. Query your database to check if a user with username radioactive11 already exists
  3. If the query result is None (or empty), you can allot this username to the new user
  4. Else ask the user to choose a different username

While this approach will work perfectly, it is very slow. Considering the popularity of a sign-up endpoint, this approach is highly unoptimised as you will be doing several read requests to the database for just one write request.

A possible solution

A better solution would be to introduce a caching layer, using some fast database like Redis. Here is what this solution will look like.

  1. Send the same GET request: /username?radioactive11
  2. Check the cache for an existing username
  3. If the username exists, then ask the user to choose a different one. (Cache Hit)
  4. If the username doesn't exist in the cache, query the primary database. (Cache Miss)
  5. If the username doesn't exist in the primary database, allow the user to take this name.
  6. Else ask the user to choose a different username

However, this approach too has it's problems. First, we increased our memory usage. Secondly, we need to ensure that the data in the cache is up to date with out primary database and not stale.

A better solution: Bloom Filters

By definition, a bloom filter is a probabilistic data structure which tells us whether an element might be in a database or definitely isn't. They are only prone to false-positives, that is, searching for an element that does not exists might return incorrect results.

How do bloom filters work?

Inserting an item

We take a Bit Vector of size m with all the bits set to false by default. in this example, let's take a vector with size m = 10.

Next, we need k efficient hash functions. In this example, let's take 3 hash functions, HF1, HF2 & HF3.

In order to insert an element in bloom filter, we hash it using multiple hash functions.

For example, we want to add radioactive11 using the 3 hash functions. Using our arbitrary hash functions, we get the following values:

  • HF1("radioactive11") = 1988
  • HF2("radioactive11") = 2022
  • HF3("radioactive11") = 2001

Now we take mod(m) (10 in this case) of the values obtained by each hash function. We get

  • 1988 % 10 = 8
  • 2022 %10 = 2
  • 2001 % 10 = 1

Now we set 1st, 2nd and 9th bit of our vector to True.

Blog Diagrams.jpg

Testing presence of an item

If we want to verify the existence of an item in the database, we hash it using the same hash functions and calculate the mod. If all the k indices in the vector are already set to true, the item might be already present in the database.

However, if all the k bits are not set to true, the item is definitely not in the database.

Trying it out

Redis provides a module called RedisBloom which provides an easy to use, scalable Bloom Filter.

You can run a local instance of Redis Stack using Docker by following these steps.

Once you have Redis Stack running, you can connect to Redis using the `redis-cli.

  1. Connect to Redis
    redis-cli
    
  2. Check if the server is running or not
    127.0.0.1:6379> PING
    PONG
    
  3. Create a new bloom filter using the BF.ADD command. This will create a small bloom filter suitable for small number of items.
    127.0.0.1:6379> BF.ADD bloom taylorswift
    (integer) 1
    
  4. You can add more elements using BF.ADD
    127.0.0.1:6379> BF.ADD bloom arianagrande
    (integer) 1
    127.0.0.1:6379> BF.ADD bloom hardwell
    (integer) 1
    127.0.0.1:6379> BF.ADD bloom avicii
    (integer) 1
    
  5. To check if an element exists, you can do BF.EXISTS. If the output is 1, the element might exist. If the output it 0, the element definitely doesn't exist.
127.0.0.1:6379> BF.EXISTS bloom selenagomez
(integer) 0
127.0.0.1:6379> BF.EXISTS bloom taylorswift
(integer) 1
127.0.0.1:6379> BF.EXISTS bloom edsheeran
(integer) 0
127.0.0.1:6379> BF.EXISTS bloom arianagrande
(integer) 1

Conclusion

Bloom filters are used in a variety of applications. Some of them are:

  • Avoid caching data that is not frequently used
  • Avoid suggesting users content that they have interacted with before
  • Identify malicious URLs