/dev/null: Last Week's Coding Practice: Hash Tables

In an effort to keep myself in tip-top coding shape, I've started doing the weekly coding problems at http://codingforinterviews.com/. Each week, they set a topic, ask you a few questions and send you a simple problem to solve. They don't really check your answers or solution, but the main point is to get you thinking about important data structures, algorithms, etc. This is my second week in, and so far it seems worthwhile. Today I'd like to briefly mention last week's topic: Hash Tables, and document some of the things that I've learnt.

To me, hash tables have been the bread and butter of programming ever since learning about them. Constant-time lookups and insertion, linear storage overhead, complete freedom to choose the key and value types... what's there not to like, right? Unfortunately, while I was using hash tables practically everywhere, I neglected to learn about how they are actually implemented. Last week, that changed.

Back to School

First, I had to make some adjustments to my vocabulary. A hash table (or hashtable, hashmap, hash map, ...) is one way of implementing an associative array, which is a mapping a mapping between keys and values. Other implementations of associative arrays are possible: for example, you could do it using a linked list of (key, value) pairs, but it wouldn't be a particularly good implementation since insertion and retrieval would be O(n).

Next, a hash table consists of a hash function and a fixed-length array. The hash function takes a key, which can be of any data type, and returns an integer (the hash value). This hash function is deterministic (the same key always gets hashed to the same integer). Additionally, one of the most desirable properties of a hash function is that it returns unique hashes for unique keys -- different keys never get the same hash value. When two different keys hash to the same value, it is known as a collision. For good hash functions, collisions are rare in practice. The hash value is mod'ed (for lack of a better word) by the length of the fixed-length array, and the result is used to index into the array.

In a trivial case, each element (bin) of the array can simply hold the value that corresponds to a specific key. Due to collisions (and the mod operation), however, different keys can end up going into the same bin. While collisions are rare, they do occur, and a strategy to handle or reliably avoid them becomes necessary.

There are several ways to deal with a collision. I'll describe one of the simplest: each bin contains a linked list that holds (key, value) pairs. When inserting a (key, value) pair into the hash table you just append the pair to the linked list of that particular bin. When retrieving a value for a particular key, you traverse the entire list for the right bin until you hit the pair with the identical key. Appending a list and traversing it adds overhead to the hash table insertion and retrieval. However, this overhead is O(n) of the length of the linked list. Since collisions are rare, the linked lists will be fairly short, and this overhead will be minimal. Other collision handling strategies reduce this overhead, but are slightly more involved.

Lastly, while hash functions sound like magic, in essence they are quite simple: convert the key into a numeric representation of arbitrary length, and then reduce it to an integer. For numbers or numeric-like data types such as characters, you don't really need to do anything -- the hash is simply equal to the key. For compound data types (arrays, structs, strings), you can obtain the numeric representation by adding the individual elements. Addition alone doesn't evenly distribute the hash value over the entire domain of the hash function (integers), so add-and-multiply is often used instead.

Email me teh codez?

After learning all the stuff above, I was ready to have a try at implementing my own hash table. Here it is, in C, complete with Makefile and all:

	#include <stdlib.h>
	#include <stdint.h>
	#include <assert.h>
	#include <string.h>
	#include <stdio.h>

	/*
	* Lessons learnt:
	* - need semicolon after struct declaration
	* - calloc accepts size_t and number of elements as separate parameters
	* - check levels of indirections in Hashmap struct
	* - strlen doesn't include the null termination character
	*/

	/*
	* A bin represents a unique (key, value) pair stored in a single bin of
	* a Hashmap. Since multiple pairs can be assigned to a single bin, they are
	* connected by a linked list structure.
	*/
	struct Bin
	{
	char *key;
	char *value;
	struct Bin *next;
	};

	/*
	* A Hashmap is a linked list of bins. The capacity determines the maximum
	* value of the key.
	*/
	struct Hashmap
	{
	size_t capacity;
	struct Bin *bins;
	};

	/*
	* Allocate a new bin on the heap. Will trigger an assertion on
	* out-of-memory.
	*/
	struct Bin *
	bin_new(char key, char value);

	/*
	* Allocate the bins for the specified Hashmap. Assumes the Hashmap is
	* allocated, but currently uninitialized.
	*/
	void
	hashmap_init(struct Hashmap *map, size_t capacity);

	/*
	* Insert the specified (key, value) pair into the Hashmap.
	* If an item with the specified key already exists, it will be overwritten.
	*/
	void
	hashmap_insert(struct Hashmap map, char key, char *value);

	/*
	* Retrieve the value with the specified key from the Hashmap. If it is not
	* found, returns 0. If it is found, sets the value.
	*/
	int
	hashmap_retrieve(struct Hashmap map, char key, char **value);

	/*
	* Calculate the hash value for a key.
	*/
	size_t
	hash_key(char *key);

	struct Bin *
	bin_new(char key, char value)
	{
	struct Bin *p = calloc(sizeof(struct Bin), 1);
	assert(p);
	p->key = key;
	p->value = value;
	return p;
	}

	void
	hashmap_init(struct Hashmap *map, size_t capacity)
	{
	assert(map);
	map->bins = calloc(sizeof(struct Bin), capacity);
	assert(map->bins);
	map->capacity = capacity;
	}

	void
	hashmap_insert(struct Hashmap map, char key, char *value)
	{
	size_t idx;
	struct Bin *current = NULL;
	char *heap_value = NULL;
	char *heap_key = NULL;

	assert(map);
	assert(key);
	assert(value);

	heap_value = calloc(sizeof(char), strlen(value) + 1);
	assert(heap_value);

	strcpy(heap_value, value);
	idx = hash_key(key) % map->capacity;
	if (!map->bins[idx].key)
	{
	//
	// This is an empty bin. Initialize it.
	// The +1 is to account for the null termination character, which isn't
	// counted by strlen.
	//
	map->bins[idx].key = calloc(sizeof(char), strlen(key) + 1);
	assert(map->bins[idx].key);
	strcpy(map->bins[idx].key, key);
	map->bins[idx].value = heap_value;
	return;
	}
	for (current = &map->bins[idx]; current->next; current = current->next)
	{
	//
	// Overwrite existing value.
	//
	if (strcmp(current->key, key) == 0)
	{
	free(current->value);
	current->value = heap_value;
	return;
	}
	}
	heap_key = calloc(sizeof(char), strlen(key) + 1);
	strcpy(heap_key, key);
	current->next = bin_new(heap_key, heap_value);
	}

	int
	hashmap_retrieve(struct Hashmap map, char key, char **value)
	{
	size_t idx = hash_key(key) % map->capacity;
	struct Bin *current = &map->bins[idx];
	if (!current->key)
	{
	//
	// This is an empty bin.
	//
	return 0;
	}
	for (; current; current = current->next)
	{
	if (strcmp(current->key, key) == 0)
	{
	*value = current->value;
	return 1;
	}
	}
	return 0;
	}

	size_t
	hash_key(char *key)
	{
	size_t hash = 0;
	int i;
	for (i = 0; i < strlen(key); ++i)
	hash = (hash << 4) + key[i];
	return hash;
	}

	int
	main(void)
	{
	/*
	* Assign a random integer to each word in the system dictionary.
	* For a single word ("hello"), print the random integer during assignment.
	* Finally, fetch the integer for that word from the hashmap.
	* Since the fetched integer is the same, the map functions correctly.
	*/
	char random[256];
	char line[256];
	FILE *fin = NULL;
	char *value = NULL;
	struct Hashmap map;
	int ret;
	hashmap_init(&map, 65535);

	fin = fopen("/usr/share/dict/words", "r");
	assert(fin);
	for (;;)
	{
	fgets(line, 256, fin);
	if (feof(fin))
	break;
	sprintf(random, "%d", rand());
	if (strcmp("hello\n", line) == 0)
	printf("in:\tmap[\"hello\"] = \"%s\"\n", random);
	hashmap_insert(&map, line, random);
	}
	fclose(fin);

	ret = hashmap_retrieve(&map, "hello\n", &value);
	assert(ret);
	printf("out:\tmap[\"hello\"] = \"%s\"\n", value);
	return 0;
	}

view raw hashmap.c hosted with ❤ by GitHub

	CFLAGS=-ggdb -Wall

	all: hashmap.out

	# $< the dependencies
	# $@ the target
	hashmap.out: hashmap.o
	gcc -Wall -ggdb $< -o$ @

	clean:
	rm -rf .out .o

view raw Makefile hosted with ❤ by GitHub

/dev/null

Thursday, December 20, 2012

Last Week's Coding Practice: Hash Tables

Back to School

Email me teh codez?

No comments:

Post a Comment

About Me