Monday, August 6, 2012

Longest Common Substring Problem - Dynamic Programming (C#)

Given two strings, we need to find out the longest sub string of first string which is also present in second string. 

Let us understand what sub strings are, sub strings of a string "manu" are 'm', 'a','n','u','ma','man','an', 'anu', 'nu'. We are only concerned about sub strings but not all permutations of the given string . 

Let us take example of strings, "hello" and "yellow". The common sub strings for these two are many but we need to find out the longest one which is "ello" in this case. Yes, now we get the idea. This can easily be solved in a recursive way. This problem is where we need to use previously computed result and build solution over it. Common sub string will start with a character that is common in both string 1 and string 2. With compromise to time and space we can solve above problem in O(m*n) time and O(m*n) space using straight away approach. Since we use previously computed values, we can store them and avoid extra computations and thus make this solution belong to category of "Dynamic Programming". 

For each character str1[i] we loop with all the characters in str2. If str1[i] matches with any character in str2, we can say common sub string  exists here which is of length (length of longest common sub string of str1[1..i-1], str2[1..j-1] +1) or starts here which is of length 1. We will use a two dimensional array of dimensions [str1.len, str2.len] to store, longest common sub string at that point [i,j].

if str1[m] == str2[n] 

     LCS(Str1[1..m], str2[1..n]) = LCS(str1[1..m-1],str2[1..n-1]) +1
else
      LCS(Str1[1..m], str2[1..n]) = 0

For "hello" and "yellow" above mentioned LCS method can be explained with below table:










This wiki link explains this solution better using tables to store intermediate values and solving complete problem by solving smaller problems. We will code above explained solution using C#. We need to iterate over each character of string 2 inside each character looping of string 1 and make sure we update lcs_so_far whenever we come across lcs of common sub string of greater height. We can use the length and last index of common sub string to actually print the longest common sub string.

Below implementation does not handle boundary or any special checks. It is only to explain the longest common sub string algorithm. [Tested with multiple inputs]

Code:

        public static void PrintLongestCommonSubstring(char[] str1, char[] str2)
        {
            int[,] l = new int[str1.Length, str2.Length];
            int lcs = -1;
            string substr = string.Empty;
            int end = -1;
            
            for (int i = 0; i < str1.Length; i++)
            {
                for (int j = 0; j < str2.Length; j++)
                {
                    if (str1[i] == str2[j])
                    {
                        if (i == 0 || j == 0)
                        {
                            l[i, j] = 1;
                        }
                        else
                            l[i, j] = l[i - 1, j - 1] + 1;
                        if (l[i, j] > lcs)
                        {
                            lcs = l[i, j];
                            end = i;
                        }

                    }
                    else
                        l[i, j] = 0;
                }
            }

            for (int i = end - lcs + 1; i <= end; i++)
            {
                substr += str1[i];
            }
            
            Console.WriteLine("Longest Common SubString Length = {0}, Longest Common Substring = {1}", lcs, substr);
        } 


Longest Common sub string problem can be solved efficiently using suffix trees. We can achieve almost linear time with suffix trees or even "Trie". I will cover "Trie" and\or suffix trees in some other article. "Trie" allows you easy text search like suffix trees. They both allow a text to be searched in quicker time but may need space and time to create the tree.

Note: Using trees will be ideal here. 

No comments:

Post a Comment