Tuesday, October 8, 2013

Longest Common Substring and Longest Common Subsequence (LCS) - Code using C# (Part 1)

Before starting with algorithms, let us understand difference between a substring of a string and a susbsequence of a string. Substring and subsequence are confusing terms. 

Sub-string is the part of the input string itself. It has characters in the same order they appear in the main string. Whereas subsequence is the string which has sequence of characters of main string, which can be formed by deleting one or many characters randomly from the input string.

Example:
Input string    : "HELLO WELCOME EVERYONE"
Sub-string     : "EVERY" or "COME"
Subsequence : "LWCME"


Longest Common substring

Longest Common substring of two given strings is the string which is common substring of both given strings and longest of all such common substrings.

Example:
String 1   : "HELLO WELCOME EVERYONE"
String 2   : "NO TWEET NO NO WELEVERYONEWELHELL"
LCString : "EVERYONE"

One approach of solving this would be using Dynamic Programming.

a. Allocate a 2D array of size [str1.Length, str2.Length]
b. Each value of this matrix saves the information about length of longest substring that's ending at that position. e.g. L[3,4] gives length of longest common substring of first three characters of first input string and first four characters of second input string.
c. If we start with 1st character of string 1 and 1st character of string 2, if they match we can say L[0, 0] = 1
d. We then can build this array by seeing if LCS value at any postion is greater than cached LCS value.
e. We will also save the ending index of the longest common substring till every point when character do not match. This can be used to backtrack and get the actual LCS from given input string.
f. L[str1.Length, str2.Length] will have the final LCS length saved in it.
g. Each step uses LCS value calculated in previous steps and adds one to it if characters match at new indexes in both given strings.

In below C# code I'm assuming array of characters are passed as input instead of string.

        public static void PrintLongestCommonSubstring(char[] str1, char[] str2)
        {
            int[,] l = new int[str1.Length, str2.Length];
            int lcs = -1;
            string substr = string.Empty;
            int end = -1;
           
            for (int i = 0; i < str1.Length; i++)
            {
                for (int j = 0; j < str2.Length; j++)
                {
                    //if character at 'i' in string 1 and character at 'j' in string 2
                    matches then increment the length of lcs value for that position
                    if (str1[i] == str2[j])
                    {
                        //taking care of array indexes
                        if (i == 0 || j == 0)
                        {
                            l[i, j] = 1;
                        }
                        else
                        {
                            //characters match , lets update new lcs value
                            l[i, j] = l[i - 1, j - 1] + 1;
                        }

                        //This is to get the longest common substring
                        // use end index in the first string for longest common substring
                        if (l[i, j] > lcs)
                        {
                            lcs = l[i, j];
                            end = i;
                        }

                    }
                    else
                    {
                        //characters do not match use set length of LCS at that position
                        as 0
                        l[i, j] = 0;
                    }
                }
            }

            for (int i = end - lcs + 1; i <= end; i++)
            {
                substr += str1[i];
            }
           
            Console.WriteLine("Longest Common SubString Length = {0}, Longest Common  
               Substring = {1}", lcs, substr);
        }


NOTE: I hope comments in the above code helps in understanding the code block. 

I will continue this article and explain about finding Longest Common Subsequence in next article.

Algorithm and References: wikipedia (link provided above)


No comments:

Post a Comment