Assignment #1: Longest Common Substring

Write a Java program that accepts two files (on the command line) and outputs the longest common substring of the two files. In other words, the longest substring that is present in both files.

Your program should collapse all consecutive white-space into a single space. An example of correct output:
FILE1 contains: I love
programming all day and all night.
FILE2 contains: If I have to write one more program, I will change majors.
OUTPUT: The longest common substring is 9 characters 'e program'

If the longest common substring is more than 30 characters, simply print out the output above, but after thirty characters, print ....

Your program should be reasonably efficient, so it should be able to find the longest common substring in two books: "Moby Dick" and "War and Peace" which I will provide as data files. Note that these data files will be about 1,000,000 characters for Moby Dick and 3,000,000 characters for War and Peace. If your algorithm works, you will find that phrases such as

'could not sleep for a long time'
'covered his face with his hands'
'his whole attention was absorbed'
'it was impossible to prevent it'

are in both books. But they are not the longest such phrase.

Efficient Algorithm

In class we described a very inefficient algorithm. Here we will use a reasonable algorithm, but one that is not efficient in all cases. For this particular input, it is good enough. You will need to use a suffix array. There are complicated algorithms to construct the suffix array in linear time, but also simpler algorithms that will work well enough for this input. You can use the simpler construction algorithm.

Begin, by forming a single string representing the input file, followed by some special character or characters that are not in the input, followed, by the second file. As an example, let us use two much smaller inputs:

INPUT #1: fashion
INPUT #2: crashes

Example for above:

String: fashion#crashes

Then compute all the possible suffixes of the resulting string

s
es
hes
shes
ashes
rashes
crashes
#crashes
n#crashes
on#crashes
ion#crashes
hion#crashes
shion#crashes
ashion#crashes
fashion#crashes

Then sort the suffixes.

#crashes
ashes
ashion#crashes
crashes
es
fashion#crashes
hes
hion#crashes
ion#crashes
n#crashes
on#crashes
rashes
s
shes
shion#crashes

Sorting is pretty easy to do; use Arrays.sort in package java.util.

Now look at each adjacent pair of elements, in which one contains a # and one does not (you don't have to search for the # in your code; just check if the starting point of the string is before the # (the string length will give you an idea). For each such adjacent pair, compute the length of the common prefix; the longest such common prefix (in this case the prefix ash in ashes and ashion#crashes is your answer.

Note that excessive string concatenation in your algorithm will be fatal. You may need to use StringBuilders. You will also need to write your own String class. Details will be discussed in a lecture.

Submission Requirement

On the moodle web site, you will (eventually) find several sets of data files to test your program. Your program should include code that details how long it takes to compute your answers, as I did in class for the maximum subsequence sum problem. You must upload a zip file that includes your complete source project (preferably Netbeans, ready to load), and also contains the output in separate data files. VERY IMPORTANT: If you do not provide output for the test cases that I specify in separate, easy to find data files, I will assume that your program does not work on those test cases, and grade accordingly. Do not embed the output in your source code.

Additionally, please answer the following question:

Suppose the two data files are both the text for Moby Dick. Obviously the longest common substring is the entire file. Explain as precisely as possible how your algorithm behaves and why. Please remember that all work, including the answer to this question is to be individual work.