# Reduce Database Queries with Bloom Filters

# The Problem

Let's say you are building a sign-up service for an application. You need to ensure that the user doesn't choose an username that has already been taken by someone else. 

You might think of a solution which looks like this:


1. Send a GET request like `/username?radioactive11`
2. Query your database to check if a user with username `radioactive11` already exists
3. If the query result is None (or empty), you can allot this username to the new user
4. Else ask the user to choose a different username

While this approach will work perfectly, it is very slow. Considering the popularity of a sign-up endpoint, this approach is highly unoptimised as you will be doing **several read** requests to the database for just **one write request**.

# A possible solution

A better solution would be to introduce a caching layer, using some fast database like Redis. Here is what this solution will look like.

1. Send the same GET request: `/username?radioactive11`
2. Check the cache for an existing username
3. If the username exists, then ask the user to choose a different one. (Cache Hit)
4. If the username doesn't exist in the cache, query the primary database. (Cache Miss)
5. If the username doesn't exist in the primary database, allow the user to take this name.
6. Else ask the user to choose a different username
 
However, this approach too has it's problems. First, we increased our memory usage. Secondly, we need to ensure that the data in the cache is up to date with out primary database and not stale.

# A better solution: Bloom Filters

By definition, a bloom filter is a probabilistic data structure which tells us whether an element **might be** in a database or **definitely isn't**. They are only prone to false-positives, that is, searching for an element that does not exists might return incorrect results. 

## How do bloom filters work?
### Inserting an item

We take a `Bit Vector` of size `m` with all the bits set to `false` by default. in this example, let's take a vector with size `m = 10`.

Next, we need `k` **efficient** hash functions. In this example, let's take 3 hash functions, `HF1, HF2 & HF3`.

In order to insert an element in bloom filter, we hash it using multiple hash functions. 

For example, we want to add `radioactive11` using the 3 hash functions. Using our arbitrary hash functions, we get the following values:

- `HF1("radioactive11")` = 1988
- `HF2("radioactive11")` = 2022
- `HF3("radioactive11")` = 2001

Now we take `mod(m)` (10 in this case) of the values obtained by each hash function. We get 

- 1988 % 10 = 8
- 2022 %10 = 2
- 2001 % 10 = 1

Now we set 1st, 2nd and 9th bit of our vector to `True`.


![Blog Diagrams.jpg](https://cdn.hashnode.com/res/hashnode/image/upload/v1662010788341/SsSZFHWyM.jpg align="left")

### Testing presence of an item

If we want to verify the existence of an item in the database, we hash it using the same hash functions and calculate the `mod`. If all the `k` indices in the vector are already set to true, the item **might** be already present in the database.

However, if all the `k` bits are not set to true, the item is **definitely** not in the database.

# Trying it out

Redis provides a module called [`RedisBloom`](https://docs.redis.com/latest/modules/redisbloom/) which provides an easy to use, scalable Bloom Filter.

You can run a local instance of Redis Stack using Docker by following [these steps](https://redis.io/docs/stack/get-started/install/docker/).

Once you have Redis Stack running, you can connect to Redis using the `redis-cli.

1. Connect to Redis
```sh
redis-cli
``` 
2. Check if the server is running or not
```sh
127.0.0.1:6379> PING
PONG
```
3. Create a new bloom filter using the `BF.ADD` command. This will create a small bloom filter suitable for small number of items.
```sh
127.0.0.1:6379> BF.ADD bloom taylorswift
(integer) 1
```
4. You can add more elements using BF.ADD
```sh
127.0.0.1:6379> BF.ADD bloom arianagrande
(integer) 1
127.0.0.1:6379> BF.ADD bloom hardwell
(integer) 1
127.0.0.1:6379> BF.ADD bloom avicii
(integer) 1
```
5. To check if an element exists, you can do `BF.EXISTS `. If the output is 1, the element **might** exist. If the output it 0, the element **definitely** doesn't exist.

```sh
127.0.0.1:6379> BF.EXISTS bloom selenagomez
(integer) 0
127.0.0.1:6379> BF.EXISTS bloom taylorswift
(integer) 1
127.0.0.1:6379> BF.EXISTS bloom edsheeran
(integer) 0
127.0.0.1:6379> BF.EXISTS bloom arianagrande
(integer) 1
```
 # Conclusion

Bloom filters are used in a variety of applications. Some of them are:

- Avoid caching data that is not frequently used
- Avoid suggesting users content that they have interacted with before
- Identify malicious URLs