String Matching

String Matching

Yu Wang

Finding all occurrence of a pattern in a text is a problem in text editing programs. So it is very meaningful to have an efficient algorithm to solve it.

Extract String Matching Problem:

Input: Test String T[1, …, n], Pattern string P[1, …, m]

Output: Find location in T where P occurs, i.e., find s such that P[1…m] exactly matches substring in T starting at location s, i.e., find s such that P[1…m]=T[s…(s+m-1)].

Naive-String-Matcher(T, P) first aligns the first characters of Text and Pattern P, then compares them. If they are not matched, P slides down a character and is compared again. The algorithm repeats this loop until it finds a match.

Naive-String-Matcher(T, P)

P n ¬ length(T)
m ¬ length(P)
for s ¬ 0 to n-m
do if P[1…m]=T[s+1…s+m]
then print "Pattern occurs with shift" s

Time complexity:

Number of comparison in the worst case is O(mn). In the worst case at each position, only the last character of P might cause a mismatch.

Naive-String-Matcher(T, P) is not an efficient algorithm to solve the string matching problem, especially when there are a lot of matches between Pattern and Text. The reason is that it ignores the information gained about the text

We now discuss an improved algorithm which considers the characters it has seen. First we’ll describe the idea of this algorithm.

When T is a long string and P is much shorter, algorithm will work on pattern to determine how much should P slide when the i^th character in the pattern has a mismatch. The text is not considered for this process and the information is independent of where the mismatched character is in the text.

For example,

1 2 3 4 5 6 7

When we work these two strings, we find perfect matches from position 1-5. But at position 6, they are not matched. Instead of moving one, algorithm will move P two cells and check again. Furthermore, it will restart from position 6, thus avoiding the inspection of characters form the text that have already been matched. It uses the fact that the characters in the text in position 3-5 are exactly the same as the first 3 character of the pattern

When P involves a set of many distinct characters, the preprocessing is simple. For example, P= a b c d e f g. If there is a mismatch at position 4, P just slide down 5 cells. But if P is a set with less distinct characters, then the processing is a little complicated. For example, P is

1	2	3	4	5	6	7	8	9	0
a	g	a	g	a	g	a	g	c	a

If 1^st or 2^nd character is mismatched with text, P slides down 1 position. If mismatch is at position 3, P moves down 2 positions.

Number of moves is defined by the prefix function p [P].

p [P] = largest k such that P[1…k] = P[q-k+1…q].

P[1…k] is the prefix of P of length k. The prefix array indicates that if mismatch is at position q+1, then we slide P by q-p [q] and we don’t need to compare the first k characters with the corresponding characters of T at the new shift.

(Note: we define p [P] the largest k. Because in some cases, if P slides too much, something will be missed.)

KMP-Matcher(T,P)

begin

1. n ¬ length[T]

m ¬ length[P]

p ¬ Compute-Prefix-Function(P)
q ¬ 0
for i=1 to n do
while((q>0) and (P[q+1]¹ T[i])) do
q ¬ p [q]
if (P[q+1]=T[i]) then
q ¬ q +1
if (q = m) then

print " Pattern occurs at location" i-m
q ¬ p [q]

end

Text:

1	2	3	4	5	6	7	8	9	0	1	2	3	4	5	6	7	8	9	0	1
a	g	a	g	g	a	g	g	a	g	a	g	g	a	g	a	g	g	a	g	g

Pattern:

n=20, m=10. p is computed as follows:

a	g	a	g	a	g	a	g	c	a
0	0	1	2	3	4	5	6	0	1

1	2		3	4		5	6		7	8		9	0		1	2	3	4	5	6	7	8	9	0	1
a	g		a	g		g	a		g	g		a	g		a	g	g	a	g	a	g	g	a	g	g
a		g	a		g	a		g	a		g	c		a
0		0	1		2	3		4	5		6	0		1

When I=0 to I=4, execute line 8 and line 9.

When I=5, q=4, P[q+1]¹ T[I}, so line 7, q=p [4]=2. P is moved down as:

1	2		3	4		5	6		7	8		9	0		1	2	3	4	5	6	7	8	9	0	1
a	g		a	g		g	a		g	g		a	g		a	g	g	a	g	a	g	g	a	g	g
a		g	a		g	a		g	a		g	c		a
0		0	1		2	3		4	5		6	0		1

Because T[5]¹ q[3], execute line 7, q=0. P is moved down:

1	2		3	4		5	6		7	8		9	0		1	2	3	4	5	6	7	8	9	0	1
a	g		a	g		g	a		g	g		a	g		a	g	g	a	g	a	g	g	a	g	g
a		g	a		g	a		g	a		g	c		a
0		0	1		2	3		4	5		6	0		1

Now execute line 5, I=6, q=0. P is slide down:

1	2		3	4		5	6		7	8		9	0		1	2	3	4	5	6	7	8	9	0	1
a	g		a	g		g	a		g	g		a	g		a	g	g	a	g	a	g	g	a	g	g
a		g	a		g	a		g	a		g	c		a
0		0	1		2	3		4	5		6	0		1

Because P[1]=T[6], execute line 8 and execute loop from line 5…

The algorithm slides sown P, compares, slides down, compares... At last, there is no perfect match for the pattern.

Time Complexity: O(m+n) (number of comparison in worst case) for line 5-12. To analyze the time complexity for line3, we look at algorithm Compute-Prefix-Function in detail.

First, let’s check the simple compute prefix function:

Simple-Compute-Prefix-Function (P)

begin

n ¬ length[T]

p [1] = 0

for q= 2 to m do

for k = q-1 down to 0 do

if ( P[1…k] = P[(q-k+1)…q]) then

p [q] = k

return p

end

Time Complexity: This algorithm’s time complexity is O(m³). Line3: O(m), line4: O(m), line 5: O(m). It’s a bad algorithm compared to KMP-Matcher(T,P), in which the time complexity is just O(n+m) for comparing.

We can improve this algorithm by introducing Compute-Prefix-function whose time complexity is only O(m).

Compute-Prefix-function(P)

begin

m ¬ length [P]

p [1] = 0
k=0

for q =2 to m do

while ((k>0) and (P[k+1] ¹ P[q]))

do k=p [k]

if (P[k+1] = P[q])

then k++

p [q] = k

return p

end

This algorithm uses the same principle as KMP-Matcher(T,P), by comparing the pattern against itself.

Time complexity:

Its running time can be analyzed using amortized analysis. Let the potential equal the value of variable k of the algorithm. This potential has an initial value of 0, by line 3. Line 6 decreases k whenever it is executed, since p [k]<k. Because p [k]>=0 for all k, however, k can never became negative. The only other line that affects k is line 8, which increases k by at most one during each execution of the for loop, and since q is incremented in each iteration of the for loop body, k<q always holds, we can pay for each execution of the while loop body on line 6 with the corresponding decrease in the potential function, since p [k]<k. Line 8 increases the potential function by at most one, so that the amortized cost of the loop body on lines 5-9 is O(1). Because the number of outer-loop iterations is O(m), and since the final potential function is at least as great as the initial potential function, the total actual worst-case running time of this algorithm is O(m).

So the time complexity of algorithm KMP-Matcher(T,P) is O(m+n).

1	2		3	4		5	6		7	8		9	0		1	2	3	4	5	6	7	8	9	0	1
a	g		a	g		g	a		g	g		a	g		a	g	g	a	g	a	g	g	a	g	g
a		g	a		g	a		g	a		g	c		a
0		0	1		2	3		4	5		6	0		1

1	2		3	4		5	6		7	8		9	0		1	2	3	4	5	6	7	8	9	0	1
a	g		a	g		g	a		g	g		a	g		a	g	g	a	g	a	g	g	a	g	g
a		g	a		g	a		g	a		g	c		a
0		0	1		2	3		4	5		6	0		1

1	2		3	4		5	6		7	8		9	0		1	2	3	4	5	6	7	8	9	0	1
a	g		a	g		g	a		g	g		a	g		a	g	g	a	g	a	g	g	a	g	g
a		g	a		g	a		g	a		g	c		a
0		0	1		2	3		4	5		6	0		1

1	2		3	4		5	6		7	8		9	0		1	2	3	4	5	6	7	8	9	0	1
a	g		a	g		g	a		g	g		a	g		a	g	g	a	g	a	g	g	a	g	g
a		g	a		g	a		g	a		g	c		a
0		0	1		2	3		4	5		6	0		1

1	2		3	4		5	6		7	8		9	0		1	2	3	4	5	6	7	8	9	0	1
a	g		a	g		g	a		g	g		a	g		a	g	g	a	g	a	g	g	a	g	g
a		g	a		g	a		g	a		g	c		a
0		0	1		2	3		4	5		6	0		1

1	2		3	4		5	6		7	8		9	0		1	2	3	4	5	6	7	8	9	0	1
a	g		a	g		g	a		g	g		a	g		a	g	g	a	g	a	g	g	a	g	g
a		g	a		g	a		g	a		g	c		a
0		0	1		2	3		4	5		6	0		1

1	2		3	4		5	6		7	8		9	0		1	2	3	4	5	6	7	8	9	0	1
a	g		a	g		g	a		g	g		a	g		a	g	g	a	g	a	g	g	a	g	g
a		g	a		g	a		g	a		g	c		a
0		0	1		2	3		4	5		6	0		1

1	2		3	4		5	6		7	8		9	0		1	2	3	4	5	6	7	8	9	0	1
a	g		a	g		g	a		g	g		a	g		a	g	g	a	g	a	g	g	a	g	g
a		g	a		g	a		g	a		g	c		a
0		0	1		2	3		4	5		6	0		1

1	2		3	4		5	6		7	8		9	0		1	2	3	4	5	6	7	8	9	0	1
a	g		a	g		g	a		g	g		a	g		a	g	g	a	g	a	g	g	a	g	g
a		g	a		g	a		g	a		g	c		a
0		0	1		2	3		4	5		6	0		1

1	2		3	4		5	6		7	8		9	0		1	2	3	4	5	6	7	8	9	0	1
a	g		a	g		g	a		g	g		a	g		a	g	g	a	g	a	g	g	a	g	g
a		g	a		g	a		g	a		g	c		a
0		0	1		2	3		4	5		6	0		1

1	2		3	4		5	6		7	8		9	0		1	2	3	4	5	6	7	8	9	0	1
a	g		a	g		g	a		g	g		a	g		a	g	g	a	g	a	g	g	a	g	g
a		g	a		g	a		g	a		g	c		a
0		0	1		2	3		4	5		6	0		1

1	2		3	4		5	6		7	8		9	0		1	2	3	4	5	6	7	8	9	0	1
a	g		a	g		g	a		g	g		a	g		a	g	g	a	g	a	g	g	a	g	g
a		g	a		g	a		g	a		g	c		a
0		0	1		2	3		4	5		6	0		1