And here comes bloom filter. This one’s not from any OSS project, but found here.

<<bloom.h>>= #ifndef __BLOOM_H__ #define __BLOOM_H__ #include<stdlib.h> typedef unsigned int (*hashfunc_t)(const char *); typedef struct { size_t asize; unsigned char *a; size_t nfuncs; hashfunc_t *funcs; } BLOOM; BLOOM *bloom_create(size_t size, size_t nfuncs, ...); int bloom_destroy(BLOOM *bloom); int bloom_add(BLOOM *bloom, const char *s); int bloom_check(BLOOM *bloom, const char *s); #endif

Again, the data structure doesn’t tell the story, but hints some things. Infact, I think this is one of those “more algorithms than data structure” category math object, even though it’s more often called a data structure.

Here’s the code that actually does the clever bit.

#define SETBIT(a, n) (a[n/CHAR_BIT] |= (1<<(n%CHAR_BIT))) #define GETBIT(a, n) (a[n/CHAR_BIT] & (1<<(n%CHAR_BIT))) int bloom_add(BLOOM *bloom, const char *s) { size_t n; for(n=0; n<bloom->nfuncs; ++n) { SETBIT(bloom->a, bloom->funcs[n](s)%bloom->asize); } return 0; } int bloom_check(BLOOM *bloom, const char *s) { size_t n; for(n=0; n<bloom->nfuncs; ++n) { if(!(GETBIT(bloom->a, bloom->funcs[n](s)%bloom->asize))) return 0; } return 1; }

Let’s take closer look at those two functions above.

1. bloom_add, Let’s ignore the for loop, as it seems to apply a function to the given value before setting the BF’s bit.

For the moment assume it’s one function.

The macro definition, finds the character corresponding(i.e: by taking modulo of the value/systems’ default CHAR_BIT value, as defined by limits.h.)

What does SETBIT do?(It takes in a pointer(array of chars, a value(after applying one of the hash functions defined to original value and taking a modulo of asize(filter size )

So far it has all been stuff, that’s unrelated to the interesting math part, but necessary for the <side effects galore> real world computation.

At the core of the logic, it’s straightforward:

a, Left shift the Char value( after finding modulo, for splitting the characters inside the char array)

b, Or it with the already existing char value.

2. For the bloom_check function,Ignoring the for loop, we find the GETBIT macro as the core functionality.

The GETBIT, simply returns the result of ‘AND’ operation on the existing value, and the new value.

So this is all fine and dandy, but why is this special? or for that matter, how’s it different from some other scheme.

We the modulo,(i.e splitting the input value into <sizeof(char)> sized parts ) does nothing to limit the size of the input value.

Here’s where that for loop we ignored comes in. It applies one hash function(or many) to the input. These hash functions do the job of transforming/mapping the actual unlimited sized input into a limited sized char array.

Here’s the hash function code that comes along with the example program at the site.

unsigned int sax_hash(const char *key) { unsigned int h=0; while(*key) h^=(h<<5)+(h>>2)+(unsigned char)*key++; return h; } unsigned int sdbm_hash(const char *key) { unsigned int h=0; while(*key) h=(unsigned char)*key++ + (h<<6) + (h<<16) - h; return h; }

So how exactly does this hash function, take in a random set of characters and always produces a finite size/limited number of characters? And what are the trade offs? How does it maintain unique mapping under that condition? when does it fail?

Stay Tuned.