IUP Computer Science
COSC 300

Project #5
(Due 13 April 2001)

In the early 1950's, David Huffman developed a text compression method based on variable length bit patterns. These bit patterns are substituted for the standard representation of characters in the compressed text. The substitute patterns (known as Huffman codes) are devised in such a way that two conditions hold: 1) the length of a pattern was inversely proportional to the frequency of the character it substituted for (so, frequently used characters like 'e' or 't' or ' ' have short codes and seldom used characters like 'j' and 'z' have long codes); and 2) no code was the prefix of another code (so, if 010 is a code for some character, then no other code can have 010 as its first three bits).

To use a complete Huffman code that are adequate for an ordinary text file, about 80 codes are needed to accommodate upper and lowercase letters, digits and typical punctuation. With such a code, it should be possible to compress a text file to about 70% of its original size. However, if we are willing to accept something less that perfect decompression (i.e., if the decompressed file need not have exactly the same form as the original), a "short-cut" Huffman code can be used. Such a short- cut code can be used to compress a text file to close to 50% of its original size. Below is an example of such a short-cut code containing only 30 codes.

LF    111001
SPC   00
MISC  111101
.     1110001
a   11111
b   011110
c   10100
d   11001
e   010
f   110000
g   110001
h   10101
i   11011
j   0111111010
k   0111110
l   111100
m   110101
n   11101
o   0110
p   1110000
q   01111110111
r   1000
s   1001
t   1011
u   01110
v   01111111
w   1101001
x   011111100
y   1101000
z   01111110110
To use this code, any letter is represented by is lowercase Huffman code; linefeeds (ASCII code 10) has a code; space (SPC) has a code; period (.) has a code; and any other character is represented using the miscellaneous (MISC) code. Note: carriage return characters (13) are NOT compressed; they are ignored completely during the compression. When a file that was compressed with the short-cut code is decompressed (expanded), MISC codes are replaced with asterisks and the only other characters in the expanded file are those shown above. The expanded file is generally readable but lacks a few helpful features that would make it more readable (most punctuation, capitalization and numbers).

You are to write an assembly language program that uses the short-cut Huffman code above to compress text files. Your program must prompt for the name of the input (original) file and the name of the compressed file to be produced. You should be sure to use simple DOS file names in 8.3 format.

Procedures to read and write files are provided to you for this project. The procedure to read the original file is named ReadFileIn. Prior to calling it, your program must put the offset of a 10000 byte buffer in the BX register and the offset of a file specification in the DX register. This procedure will check the size of the file, open it, and read the entire file into the buffer. ReadFileIn holds the actual number of bytes that the file has in the AX register when it returns. The procedure to write compressed file is named WriteFileOut. Prior to calling it, your program must put the offset of a buffer large enough to hold the compressed text into the BX register, the offset of the file specification for the compressed file in the DX register, and the number of bytes to output in the CX register. If anything goes wrong with either of these procedures, they generally put some code in the AX register and terminate the program.

The two procedures are in a file named FILEIO.INC It is in the folder I:\JLWOLFE\COSC300 To use the procedures, copy the file to the folder where you have the .ASM file for this project. In the .ASM file, put the statement include fileio.inc between the statement that ends the main procedure (main endp) and the statement that ends the program (end main).

Hand in a source listing of your program and a floppy disk on which you have both the .ASM file and .EXE file for your program. These files must be named with your initials, as yourinitials- p5.asm and yourinitials-p5.exe You will find the program EXPAND.EXE in I:\JLWOLFE\COSC300 This program can be used to expand any file that was properly compressed with the short-cut code given here. Use it to check your results.

Here is a sample of what your program should do. Assume we have a file with the two lines on the left (18 bytes) in it. After going through compression and expansion, the lines would look like the two on the right.

It's not so                                       it*s not so
hard.                                             hard.
The compression would use these codes:
11011-1011-111101-1001-00-11101-0110-1011-00-1001-0110-111001-10101-11111-1000-11001-1110001
   i    t    misc   s       n     o    t       s    o    lf     h     a     r    d      .
When grouped into bytes we get 10 (9 1/2) compressed bytes, which in hex are:
DD FB 27 5A C9 6E 6B F8 CF 10
This is a compression to about 53%.