Project #3

IUP Computer Science
COSC 310 Spring 2008

Project #3
(Common Substrings)
(Due 27 Feb 2008)

Some of the classic problems in computer science involve strings: matching one string pattern to a part of another string, finding substrings that are common to two strings, finding subsequences that are common to two strings to name a few. In this project, we focus on finding the maximum length substrings that are common to two strings. The obvious but terribly inefficient way to solve this problem is to create every possible substring of one string and then try to do a pattern matching on the other string. Unfortunately, this approach is O(M²N²) where M and N are the string lengths. It is easy to find information on the Web to help with this problem. Wikipedia gives the following pseudocode algorithm which is O(M N) and consequently still takes a fair amount of work when the strings are long; but this algorithm is basically optimal. In the algorithm, S and T are the two strings and are indexed like arrays. L is a 2-D array of counters. There is also a set named result being used to collect the substrings of length z (the maximum length). result is what the pseudocode algorithm produces as the collection of maximal length substrings that are common to S and T. (Note: the U in the next to last line indicates set union.) This algorithm finds the maximal length and collects the substrings at the same time. It may be a little more efficient to separate these tasks because clearing and recollecting the sets on each increase in substring size takes some effort.

function LCSubstr(S[1..m], T[1..n])
    L := array(0..m, 0..n)      ; array of counters for substring lengths
    z := 0
    result := {}                ; make set empty
    for i := 1..m
        for j := 1..n
            if S[i] = T[j]                    ; compare two characters
                L[i,j] := L[i-1,j-1] + 1      ; update count
                if L[i,j] > z                 ; check for new max count
                    z := L[i,j]               ; update max
                    result := {}              ; make set empty
                if L[i,j] = z                 ; add to set if at max
                    result := result U {S[i-z+1..i]}
    return result

Finding common substrings are useful in DNA research in which sequences labeled A, C, G, and T indicate the chemical components of a DNA strand. Finding common substrings is also useful in some forms of code breaking and in plagiarism detection. For your program, you are given two files each containing the letter sequences for two DNA strands - each strand occupies one line of the file. Your program is to produce a list of the maximal length substrings that occur in the two DNA strands. For example, if a file contained the sequences

ATGGGAAAACAGCCGAACACGACTCTTCAGATCTTGTAAGTTCTACCTGTCGTCCAAAAT
ATAAAAGTGCCAATGTTGTCCATGGCGCGTCACTTGATTCTGAACAGGAGGCC

your program should produce the following results (although perhaps in a somewhat modified output form).

Maximum common substring length: 5

String#1 String#2 Substring Found
   Start     Start
      52        18            GTCCA
      15        42            GAACA
       8        43            AACAG

As you can see, the program must report what the common substring is and the starting character positions in the two DNA sequences. The substrings are marked in bold above to show you where they are - note that in one case, the substrings overlap. The DNA sequences in strands1.txt are approximately 1000 characters long. The DNA sequences in strands2.txt are approximately 3000 characters long. Both files can be found on the I: drive in I:\jlwolfe\310\s08

There are two absolute requirements for your program. 1: the program must use ArrayLists rather than an arrays, such as for the L array in the algorithm; and 2: the program must use Sets to collect something representing the maximal substrings. The DNA sequences may be represented in Strings, arrays or ArrayLists.

Sets are not mentioned in your text book until section 9.1; but they are easy to use. Their use is outlined below. The Java API provides a generic Set interface which defines the following methods

boolean add(E object) Adds the object to the Set if it is not already there; returns
true if the object was not previously in the set, false otherwise

boolean remove(Object obj) Removes obj from the Set if it was present; returns true
if the object was removed.

int size() Returns the number of elements in the set.

void clear() Removes all objects from the Set.

Iterator<E> iterator() Returns an iterator for the Set, making it possible to access
each Set element in turn.

The Java API class HashSet implements the Set interface. There is a simple example of its use on pages 465-466, including the use of the Iterator.

When the program is finished, run it on each of the strands files and capture the output that is generated. Hand in a printout of your well-documented program, and a printout of thecaptured output. Capture the output by copying and pasting from the output window; do not capture a screen image. You must create a folder named p3 under the folder named after you on the P: drive for COSC 310. Copy into p3 all .java files that you created for this project.