Duration: 3 sessions
The completed programs will consist of several "
.h" files, combined together using
You are given these components to start with:
speller.h- defines the facilities provided by
speller.c- the driver for each spell-checking program, which:
inserts them in your data-structure (ordered-binary-tree or hash-table)
finds them in your data-structure
/usr/share/dict/wordsis intended to contain 479829 words (1 per line) but is read as 526065 words of which 418666 are unique and 107399 are duplicates.
dict.h- defines the dictionary facilities that must be provided by
dict-hash.cfor the spell-checking program to function correctly.
dict-tree.c- the starting point for your ordered-binary-tree version of the dictionary
dict-hash.c- the starting point for your hash-table version of the dictionary
makefileand a series of data-files to use for testing your code. You should copy all the files from
/opt/info/courses/COMP26120/problems/ex9(use this path) to start.
You are expected to complete
dict-tree.c for part 1 and
dict-hash.c for part 2. You should not change any of the
For this assignment, you can only get full credit if you do not use code from outside sources.
Start by copying the starting code (described above) from
/opt/info/courses/COMP26120/problems/ex9 (use this path) into your
The simple spelling checker you are given uses a "Table" data-type (data
structures and functions) defined by
dict.h, in which to store
all the words in a dictionary. (The only data about each word is
the word itself, stored as a string, and used as the key to locate the word
in the data-structures). For each part of this exercise, you implement the
Table data-type in different ways, so it can then be used by the code
provided, and the results compared.
dict.h requires the following functions to be implemented:
mallocspace for the data structure(s), and set fields to e.g. zero. For the hash-table this includes setting up an array of empty cells, whereas for the ordered-binary-tree you only need to initialise the pointer to the head (root) of the tree (e.g. to NULL) - the nodes of the tree are created one-by-one by
finda given key (i.e. string, alphabetic word) in a Table.
inserta given key in a Table.
insertis called with a duplicate key (i.e. it is already in your Table) you should ignore it, so that every key in your Table is different.
strdup) before you save it in your Table.
print_tableto list the contents of a Table. (You can also use this to help with debugging.)
print_statsto output statistical information to show how well your code is working.
find, so you can compare them with the theoretical minimum. You should add fields to the data structures in the code you are given to collect the information you need.
To test your implementation, you will need a sample dictionary and a sample text file with some text that contains the words from the sample dictionary (and perhaps also some spelling mistakes). You are given several such files, and you will probably need to create some more to help debug your code.
You should also test your program using a larger dictionary. One example is the
Linux dictionary that can be found in
Compile and link your code using
make tree (for part 1) or
make hash (for part 2).
When you run your spell-checker program, you can:
-dflag to specify a dictionary.
-sflag to specify a hash table size: try a prime number greater than twice the size of the dictionary (e.g. 1,000,003 for
-mflag to specify a particular mode of operation for your code (e.g. to use a particular hashing algorithm). You can access the mode setting by using the corresponding
modevariable in the code you write.
-vflag to turn on diagnostic printing, or
-vvfor more printing (or
-vvvetc. - the more "v"s, the higher the value of the corresponding variable
tree -d sample-dictionary -m 1 -vv sample-file
hash -d /usr/share/dict/words -s 1000003 -m 2 -v sample-file
There are several such tests already provided in the
and you should read it to see what they are, and also how to modify them by
changing some of the variables in the
MODE. Then you can "make testtree" or "make test1" etc.
Implement the Table type using ordered-binary-trees
(see, for example: M.T. Goodrich, R. Tamassia: Algorithm Design p. 145).
Your code should be go in:
You can start by using a simple insert technique to build the tree that constitutes the Table, although in this example it will lead to extremely sub-optimal tree shape (Why is this?).
If you have time, you should then improve your algorithm to produce more balanced trees using e.g. AA trees or AVL trees. (See, for example: M.T. Goodrich, R. Tamassia: Algorithm Design p. 152).
Write your code so that you can use the
-m parameter, which
mode variable, to select the different algorithms
(e.g. unbalanced or balanced) that you have implemented.
During the marking of this part you will be asked to explain why you chose
the particular algorithm you used, and to discuss the asymptotic algorithm
complexity of the function
find that you implemented. You need
to discuss the best and the worst case. You also need to compare this
complexity to that when the dictionary is implemented as a list (see the
lecture notes on complexity).
Reimplement the Table data type using hash tables.
Your code should go in:
The hash-value(s) needed for inserting a string into your hash-table should
be derived from the string. For example, you can consider a simple summation
key based on ASCII values of the individual characters, or a more
sophisticated polynomial hash code, in which different letter positions are
associated with different weights.
(Warning: if your algorithm is too simple, you may find that your program is very slow when using a realistic dictionary.)
The hashing strategy to be adopted is that of open addressing, so that collisions are dealt with by using a collision resolution function. You should attempt to implement several different instances of linear probing for collision handling using different values of the hash table size N (where N is a prime number). Then try to improve the collision handling procedure by implementing quadratic probing.
If you have time, you should try to implement double hashing.
Write your code so that you can use the
-m parameter, which
mode variable, to select the different hashing
algorithms that you have implemented.
Your code should keep the
num_entries field of the table up
insert function should check for a full table, and
exit the program cleanly should this occur.
An optional extra could increase (double?) the hash table size when the table is getting full and then rehash into a larger table.
You should experiment with the various hash functions described in the
lectures, with different Table sizes and different collision resolution
functions. You will need to add code to
(and other places) to report your findings.
During the marking of this part you will be asked to discuss the asymptotic
algorithmic complexity of your function
find, and the potential
problems that linear and quadratic probing may cause with respect to
clustering of the elements in a hash table.
You must use
submit as normal.
submit will look for:
There are 15 marks for each part (30 in total):
You will get no credit for parts of the exercise that you have not implemented yourself.