The Perfect Hash Function

In the article from 1984, Fredman and Komlos (1) mention the perfect hash function, that actually can store a sparse table with 0(1) worst case access time. And as they mention that both the data structure and the query algorithm were easy to implement, I have decided to give it a try.

The magic is actually simple. They are using a simple function for hashing, that is nothing fancy:

Simple hash function.

Where:

k is a random number between 1 and p-1
p is a prime number, bigger than the biggest number in the set, that is being hashed
n is the number of positions on level 1 in our table, equal to the number of numbers that we are going to hash

Ok, but that function has one flaw – as you can imagine from the birthday paradox, there would easily be collisions with n bigger than 5. Which is usually the case with these tables. So what do we do? Or what did the authors do? They needed a good way to avoid collisions and they actually found one – they insert “buckets” withing “buckets”, and the level-2 buckets have positions, equal to the square of the items in them. Thus, they guarantee that a solution with no collisions with level 2 will be found easily – the chance to have a collision with positions equal to the square of items is actually 0.5 for any random k.

If the last sentence was something that you did not completely understand – go to the YouTube video and take a look.

Generally, the code is here:

class FKSPerfectHash:
    def __init__(self, numbers: List[int]):
        self.elements = list(set(numbers))
        self.n = len(self.elements)

        if self.n == 0:
            self.p = 1
            self.buckets = []
            self.level2_tables = []
            return

        max_val = max(self.elements)
        self.p = next_prime(max_val)

        self._build_level_1()
        self._build_level_2()

    def _build_level_1(self):
        while True:
            self.k_main = random.randint(1, self.p - 1)
            self.buckets = [[] for _ in range(self.n)]

            for x in self.elements:
                idx = ((self.k_main * x) % self.p ) % self.n
                self.buckets[idx].append(x)
    
            sum_squares = sum(len(b)**2 for b in self.buckets)

            if sum_squares < 3 * self.n:
                break
    
    def _build_level_2(self):
        self.level2_tables = [None] * self.n

        for i, bucket in enumerate(self.buckets):
            b_size = len(bucket)

            if b_size == 0:
                self.level2_tables[i] = (0, 0, None)
                continue
            
            m_j = b_size * b_size
            while True:
                k_prim = random.randint(1, self.p-1)
                temp_storage = [None] * m_j
                collision_found = False

                for x in bucket:
                    idx = ((k_prim * x) % self.p) % m_j

                    if temp_storage[idx] is not None:
                        collision_found = True
                        break
                    temp_storage[idx] = x
                
                if not collision_found:
                    self.level2_tables[i] = (k_prim, m_j, temp_storage)
                    break
                    
    def query(self, x:int) -> bool:
        if self.n == 0:
            return False
        idx_1 = ((self.k_main * x) % self.p) %self.n
        params = self.level2_tables[idx_1]
        k_prim, m_j, storage = params

        if storage is None:
            return False

        idx_2 = ((k_prim * x) % self.p ) % m_j
        return storage[idx_2] == x

data = [2, 4, 5, 15, 18, 30]
fks = FKSPerfectHash(data)
fks.query(5)

Implementing a Perfect Hash Function (FKS)

Watch this video on YouTube

GitHub code – https://github.com/Vitosh/Python_personal/tree/master/YouTube/051_Python-Hashing

1 – Michael L. Fredman, János Komlós, and Endre Szemerédi. 1984. Storing a Sparse Table with 0(1) Worst Case Access Time. J. ACM 31, 3 (July 1984), 538–544. https://doi.org/10.1145/828.1884

The Perfect Hash Function

Related posts: