Lab Exercise 9: Spellchecking using Trees and Hash-Tables

Duration: 3 sessions

Aims

To encourage you to find out about, and use, ordered-binary-trees and hash-tables for search.

Learning outcomes

On successful completion of this exercise, a student will:

Summary

Write C code to complete two different versions of a simple spell-checking program: one using ordered-binary-trees and the other using hash-tables.

The completed programs will consist of several ".c" and ".h" files, combined together using make. You are given these components to start with:

You are also given a makefile and a series of data-files to use for testing your code. You should copy all the files from /opt/info/courses/COMP26120/problems/ex9 (use this path) to start.

You are expected to complete dict-tree.c for part 1 and dict-hash.c for part 2. You should not change any of the other .c or .h files.

For this assignment, you can only get full credit if you do not use code from outside sources.

Description

Start by copying the starting code (described above) from /opt/info/courses/COMP26120/problems/ex9 (use this path) into your ex9 directory.

The interface - Table

The simple spelling checker you are given uses a "Table" data-type (data structures and functions) defined by dict.h, in which to store all the words in a dictionary. (The only data about each word is the word itself, stored as a string, and used as the key to locate the word in the data-structures). For each part of this exercise, you implement the Table data-type in different ways, so it can then be used by the code provided, and the results compared.

dict.h requires the following functions to be implemented:

(You do not need to implement a function to delete a given key from a Table as this is not needed by the spell-checker.)

Running your code

To test your implementation, you will need a sample dictionary and a sample text file with some text that contains the words from the sample dictionary (and perhaps also some spelling mistakes). You are given several such files, and you will probably need to create some more to help debug your code.

You should also test your program using a larger dictionary. One example is the Linux dictionary that can be found in /usr/share/dict/words.

Compile and link your code using make tree (for part 1) or make hash (for part 2).

When you run your spell-checker program, you can:

e.g.:
tree -d sample-dictionary -m 1 -vv sample-file
or:
hash -d /usr/share/dict/words -s 1000003 -m 2 -v sample-file

There are several such tests already provided in the makefile, and you should read it to see what they are, and also how to modify them by changing some of the variables in the makefile such as MODE. Then you can "make testtree" or "make test1" etc.

Part 1: Tree-based implementation

Implement the Table type using ordered-binary-trees (see, for example: M.T. Goodrich, R. Tamassia: Algorithm Design p. 145). Your code should be go in: dict-tree.c

You can start by using a simple insert technique to build the tree that constitutes the Table, although in this example it will lead to extremely sub-optimal tree shape (Why is this?).

If you have time, you should then improve your algorithm to produce more balanced trees using e.g. AA trees or AVL trees. (See, for example: M.T. Goodrich, R. Tamassia: Algorithm Design p. 152).

Write your code so that you can use the -m parameter, which sets the mode variable, to select the different algorithms (e.g. unbalanced or balanced) that you have implemented.

During the marking of this part you will be asked to explain why you chose the particular algorithm you used, and to discuss the asymptotic algorithm complexity of the function find that you implemented. You need to discuss the best and the worst case. You also need to compare this complexity to that when the dictionary is implemented as a list (see the lecture notes on complexity).

Part 2: Hash table-based implementation

Reimplement the Table data type using hash tables. Your code should go in: dict-hash.c

The hash-value(s) needed for inserting a string into your hash-table should be derived from the string. For example, you can consider a simple summation key based on ASCII values of the individual characters, or a more sophisticated polynomial hash code, in which different letter positions are associated with different weights.
(Warning: if your algorithm is too simple, you may find that your program is very slow when using a realistic dictionary.)

The hashing strategy to be adopted is that of open addressing, so that collisions are dealt with by using a collision resolution function. You should attempt to implement several different instances of linear probing for collision handling using different values of the hash table size N (where N is a prime number). Then try to improve the collision handling procedure by implementing quadratic probing.

If you have time, you should try to implement double hashing.

Write your code so that you can use the -m parameter, which sets the mode variable, to select the different hashing algorithms that you have implemented.

Your code should keep the num_entries field of the table up to date.
Your insert function should check for a full table, and exit the program cleanly should this occur.

An optional extra could increase (double?) the hash table size when the table is getting full and then rehash into a larger table.

You should experiment with the various hash functions described in the lectures, with different Table sizes and different collision resolution functions. You will need to add code to print_stats (and other places) to report your findings.

During the marking of this part you will be asked to discuss the asymptotic algorithmic complexity of your function find, and the potential problems that linear and quadratic probing may cause with respect to clustering of the elements in a hash table.

Marking Process

You must use labprint and submit as normal.

labprint and submit will look for: dict-tree.c and dict-hash.c

There are 15 marks for each part (30 in total):

You will get no credit for parts of the exercise that you have not implemented yourself.