Friday, February 15, 2013

Coding Practice: Tries

I was 16 when I first got my hands on a mobile phone -- it was the year of the Sydney Olympics, and since I was a volunteer, I was spending a lot of time outside and needed a way to stay in touch with home base. I'm not 100% positive, but I think it was an Ericsson SH888. Calls weren't cheap back then, so SMS was the way to go. I remember the pain of having to enter English text through a numeric keypad without using a predictive text engine like T9. Of course, these days, predictive text (also known as auto-completion) is everywhere: mobile devices, web browsers, and office software. In today's article, I'll briefly discuss a data structure that can be used to implement auto-completion effectively and efficiently -- the trie.

A trie is a tree-like data structure. The main difference between tries and BSTs is that tries have an arbitrary branching factor, whereas BST nodes can have at most two children. This allows faster lookups: a trie lookup is proportional to the length of the key, whereas a BST lookup is proportional to the number of items stored in the tree. Another advantage of tries is that they allow prefix-based lookups. For this reason, they are used in text processing for tasks such as word autocompletion.

Interestingly, a trie can also be used as an alternative to a hash table when implementing associative arrays. It has several advantages over a hash table: it does not require a hash function, and is thus faster for shorter keys; it maintains order within items that it stores. This alternative is particularly useful when many of the associative array keys are prefixes of other keys (for example, when the keys are English words).

For this week's coding practice, I coded up an autocompleter in Python. The autocompleter works in two steps: in the first step, the script initializes a trie by looks at a dictionary of English words (on Linux systems, one can often be found at /usr/share/dict/american-english). In the second step, the script prompts the user to input a prefix, and uses the trie to locate all words beginning with that prefix. The initialization step takes a little while (approximately 2 seconds for 100,000 words), but only needs to be done once. The lookup step is pretty much instantaneous: as long as the trie is in memory, the cost of lookups is linear to the length of the key.

The entire american-english file consists of 100K words and is approximately 1MiB. The trie used to hold this dictionary has over 200K nodes, and occupies a whopping 85MiB in memory. That's approximately 425 bytes per node: it's likely that this can be reduced. Personally, I suspect that an implementation in C would use less memory.

Here is the code:

Finally, many a battle have been fought over how to pronounce the word "trie". One camp insists on pronouncing it the same as "tree", since the origins of the word are from "retrieval". Since a "tree" and a "trie" are slightly different things, another camp insists on pronouncing it as "try". Finally, others sidestep the issue completely and refer to tries as "prefix trees".

No comments:

Post a Comment