IUP Computer Science
COSC 310    Fall 2002
 

Project #6
(Due  9 December 2002)


babynames.dat is a binary file containing 2319 records of the following form:

struct   baby
{     char        name[12];
      int         origin;
      float       frequency;
      int         source;
      char        sex;
      char        meaning[35];
};

The name is a baby's name (a c-string); all names are a single word in all uppercase (a few names are hyphenated).  The origin is a code for the language or culture where the name originates.  The frequency indicates how often the name is used to name babies.  The source is a zip code of a baby with that name.  The sex is M (male), F (female) or E (either).  The meaning gives the derivation or significance of the name.  Each record is the same size.  Because the file is binary, you will not have an easy time looking at it with an editor or wordprocessor.  I have made the file babies.txt available for you to look at; it contains some of the same information (name, meaning, sex, and origin) and in the same order as babynames.dat but is in a text form.  Note:  these are both fairly large files, babies.txt at 76K and babynames.dat at 136K.

The origin code is a number between 0 and 29.  Its intended use is as an index to the names of 30 cultures which appear in the file origins.txt.  This is an ordinary text file in which each culture appears as one word on a line by itself.  The beginning of this file looks like this:

Unknown
African
African-American
American

so that code 0 is for Unknown, code 1 is for African, code 2 is for African-American, and so on.

You are to write two programs.

In the first program, read the records of babynames.dat and write them to a hash file that can contain no more than 3479 records.  The record form for the hash file is the same as for babynames.dat, so the hash file is also a binary file.  You may devise any hash function that you want to distribute the records into the hash file, as long as the hash function uses the baby name as the key and the function is different from the sample hash function given in the textbook.  Your program must resolve collisions using some open addressing approach, not chaining.  You should try to create a hash function that gives a good record distribution in the file storage area.

The second program should test your hash function and hash file by looking up in the hash file the 500 baby names that are contained in the file queries.txt.  This is an ordinary text file in which each line contains one baby name.  For the first fifteen query names, your program should produce a display that shows the record number where the name is found, the number of reads it took to find the record, and all of the information from the baby's record.  Your program must also display the total number of file records read in order to handle the 500 queries.  Below is sample output similar to what your program should produce.  This output was produced by using the string hash function in the book with simple linear probing to resolve conflicts.  Notice that your program needs to handle queries for names that are not in the hash file (like ANNAMAE).

Record         Name          Origin   Freq.  Loc. Sex Meaning
 207  1       JEWEL          French  0.0142 85612  F  Joy
1866  2     DARNELL         Unknown  0.0093 80764  M  Hidden Nook
 786  1     FAIRFAX         British  0.0113  2498  M  Fair Haired
1138  1       AMENA         Unknown  0.0096 13242  F  Honest Woman
1760  1    CHEYENNE Native-American  0.0149 13988  F  People of Alien Speech
1268  1        BONA         Unknown  0.0057  1006  F  Good
2702  1      MAHARI         African  0.0102 78358  E  Forgiver
2040  1      DIALLO         African  0.0048 11882  M  Bold
Could not find ANNAMAE
2363  2      CORINA          French  0.0062 20650  F  Maiden
  64  1       AYAME        Japanese  0.0028 49666  F  Iris
   5  1       MORSE         Unknown  0.0044 15184  M  Son of Morris
 637  2        ANDY         British  0.0100 66780  M  From the name ANDREW
  29  1    FLETCHER         British  0.0144 93236  M  Arrow Maker
 564  5       CARON      Vietnamese  0.0079 13560  F  Loving, Kind

Total reads: 1229

Both of your programs must deal with the baby name records one at a time.  That is, your programs are not allowed to create huge arrays to hold the records and then read or write to the binary files in huge blocks.  Each read or write on either babynames.dat or your hash file must be done one record at a time.  Your programs are not allowed to hold more that 2 or 3 baby name records at any one time - you are to make them as space efficient as possible.

Strictly speaking, this assignment does not really involve object oriented programming.  The idea is for you to develop an effective hash function and collision handling technique.  These could then be used to form the basis of a Hash class like the one shown in the book.

Hand in printouts of both of your programs and a printout showing the first fifteen queries and the total number of reads.  Name the source programs after your own name and copy them to the hand-in folder on the P:\ drive for your section.
 

Extra Credit will be awarded for the best hash function and collision handling mechanism.  The total number of reads gives an accurate measure of how good your function is.  The lower the number of reads, the better the function is.  The hash file is required to have space for no more than 3479 records because putting 2319 records into that much space will cause it to be 2/3 full.  Using a good hash function (one that avoids clustering, gives a uniform distribution) and simple linear probing, we should expect the total number of reads for the 500 queries to be 1000, not the 1229 that I got.  A perfect hash function would be able to do the 500 queries in 500 reads.

Currently, I expect to award extra credit on a declining scale.  That is, the best hash function will get the most extra credit, with lesser amounts of extra credit for second best, third best, etc.  At least 5 people will get extra credit if they beat my 1229 reads.  The more they beat it by, the more the extra credit.