IUP Computer Science
COSC 310 Fall 2002
babynames.dat is a binary
file containing 2319 records of the following form:
struct baby
{ char
name[12];
int
origin;
float
frequency;
int
source;
char
sex;
char
meaning[35];
};
The name is a baby's name (a c-string); all names are a single word in all uppercase (a few names are hyphenated). The origin is a code for the language or culture where the name originates. The frequency indicates how often the name is used to name babies. The source is a zip code of a baby with that name. The sex is M (male), F (female) or E (either). The meaning gives the derivation or significance of the name. Each record is the same size. Because the file is binary, you will not have an easy time looking at it with an editor or wordprocessor. I have made the file babies.txt available for you to look at; it contains some of the same information (name, meaning, sex, and origin) and in the same order as babynames.dat but is in a text form. Note: these are both fairly large files, babies.txt at 76K and babynames.dat at 136K.
The origin code is a number between 0 and 29. Its intended use is as an index to the names of 30 cultures which appear in the file origins.txt. This is an ordinary text file in which each culture appears as one word on a line by itself. The beginning of this file looks like this:
Unknown
African
African-American
American
so that code 0 is for Unknown, code 1 is for African, code 2 is for African-American, and so on.
You are to write two programs.
In the first program, read the records of babynames.dat and write them to a hash file that can contain no more than 3479 records. The record form for the hash file is the same as for babynames.dat, so the hash file is also a binary file. You may devise any hash function that you want to distribute the records into the hash file, as long as the hash function uses the baby name as the key and the function is different from the sample hash function given in the textbook. Your program must resolve collisions using some open addressing approach, not chaining. You should try to create a hash function that gives a good record distribution in the file storage area.
The second program should test your hash function and hash file by looking up in the hash file the 500 baby names that are contained in the file queries.txt. This is an ordinary text file in which each line contains one baby name. For the first fifteen query names, your program should produce a display that shows the record number where the name is found, the number of reads it took to find the record, and all of the information from the baby's record. Your program must also display the total number of file records read in order to handle the 500 queries. Below is sample output similar to what your program should produce. This output was produced by using the string hash function in the book with simple linear probing to resolve conflicts. Notice that your program needs to handle queries for names that are not in the hash file (like ANNAMAE).
Record Name
Origin Freq. Loc. Sex Meaning
207 1 JEWEL
French 0.0142 85612 F Joy
1866 2 DARNELL
Unknown 0.0093 80764 M Hidden Nook
786 1 FAIRFAX
British 0.0113 2498 M Fair Haired
1138 1 AMENA
Unknown 0.0096 13242 F Honest Woman
1760 1 CHEYENNE Native-American 0.0149
13988 F People of Alien Speech
1268 1 BONA
Unknown 0.0057 1006 F Good
2702 1 MAHARI
African 0.0102 78358 E Forgiver
2040 1 DIALLO
African 0.0048 11882 M Bold
Could not find ANNAMAE
2363 2 CORINA
French 0.0062 20650 F Maiden
64 1 AYAME
Japanese 0.0028 49666 F Iris
5 1 MORSE
Unknown 0.0044 15184 M Son of Morris
637 2 ANDY
British 0.0100 66780 M From the name ANDREW
29 1 FLETCHER
British 0.0144 93236 M Arrow Maker
564 5 CARON
Vietnamese 0.0079 13560 F Loving, Kind
Total reads: 1229
Both of your programs must deal with the baby name records one at a time. That is, your programs are not allowed to create huge arrays to hold the records and then read or write to the binary files in huge blocks. Each read or write on either babynames.dat or your hash file must be done one record at a time. Your programs are not allowed to hold more that 2 or 3 baby name records at any one time - you are to make them as space efficient as possible.
Strictly speaking, this assignment does not really involve object oriented programming. The idea is for you to develop an effective hash function and collision handling technique. These could then be used to form the basis of a Hash class like the one shown in the book.
Hand in printouts of both of your programs
and a printout showing the first fifteen queries and the total number of
reads. Name the source programs after your own name and copy them
to the hand-in folder on the P:\ drive for your section.
Extra Credit will be awarded for the best hash function and collision handling mechanism. The total number of reads gives an accurate measure of how good your function is. The lower the number of reads, the better the function is. The hash file is required to have space for no more than 3479 records because putting 2319 records into that much space will cause it to be 2/3 full. Using a good hash function (one that avoids clustering, gives a uniform distribution) and simple linear probing, we should expect the total number of reads for the 500 queries to be 1000, not the 1229 that I got. A perfect hash function would be able to do the 500 queries in 500 reads.
Currently, I expect to award extra credit
on a declining scale. That is, the best hash function will get the
most extra credit, with lesser amounts of extra credit for second best,
third best, etc. At least 5 people will get extra credit if they
beat my 1229 reads. The more they beat it by, the more the extra
credit.