Wednesday, October 30, 2013

Strings in depth C# - Immutability, string interning, String.Intern, Intern Pool and StringBuilder

a. String are immutatble

Immutability means value of a string object can't be modified once its assigned. If a string object is modified to create a different value a new string object with that new value will be created.

Because of this immutable property of strings, any operation like Addition or Deletion  will result in new string creation and it will be performance penalty. 

In cases where we repeatedly add or delete from a string, StringBuilder is suggested. Unlike string any operation on StringBuilder object will work on single instance of string object and hence will be efficient in cases where repeated string modification operations are involved.

b. String is a reference type

I know it's a simple concept yet good to note it.

c. What is string interning?

String interning is a method of storing only one copy of each distinct string value, which must be immutable. Interning strings makes some string processing tasks more time- or space-efficient at the cost of requiring more time when the string is created or interned. The distinct values are stored in a string intern pool.

C# also uses string interning. It interns string based on the literal value. However any operation that results in same literal value as a literal in intern pool does not get to use the existing intern reference.  

Intern Pool:

The common language runtime(CLR) automatically maintains a table, called the "intern pool", which contains a single instance of each unique literal string constant declared in a program, as well as any unique instance of String you add programmatically.
The intern pool conserves string storage. If you assign a literal string constant to several variables, each variable is set to reference the same constant in the intern pool instead of referencing several different instances of String that have identical values.
There are two methods where you can get the intern of a string i.e. get the reference for a string if it's already in the Intern pool.
a. IsInterned
Returns reference to a string if the literal string is already in the intern pool otherwise it returns null.
string s1 = "hello";
string s2 = String.IsInterned("Hello") ;
IsInterned will check intern pool for string with value "hello" and returns reference to that string.
Hence s1 will have reference as s2.
Note: IsInterned does not return bool but returns "null" when value is not present in intern pool.
b. Intern
Returns reference to a string if the literal string is already in the intern pool otherwise create a string with given literal value.
Below block is straight from MSDN about performance considerations when working with intern pool.

Performance Considerations with intern pool:

If you are trying to reduce the total amount of memory your application allocates, keep in mind that interning a string has two unwanted side effects. First, the memory allocated for interned String objects is not likely be released until the common language runtime (CLR) terminates. The reason is that the CLR's reference to the interned String object can persist after your application, or even your application domain, terminates. Second, to intern a string, you must first create the string. The memory used by the String object must still be allocated, even though the memory will eventually be garbage collected.
The .NET Framework version 2.0 introduces CompilationRelaxations.NoStringInterning enumeration member. The NoStringInterning member marks an assembly as not requiring string-literal interning. You can apply NoStringInterning to an assembly using the CompilationRelaxationsAttribute attribute. Also, when you use the Native Image Generator (Ngen.exe) to compile an assembly in advance of run time, strings are not interned across modules.

Tuesday, October 8, 2013

Longest Common Substring and Longest Common Subsequence (LCS) - Code using C# (Part 1)

Before starting with algorithms, let us understand difference between a substring of a string and a susbsequence of a string. Substring and subsequence are confusing terms. 

Sub-string is the part of the input string itself. It has characters in the same order they appear in the main string. Whereas subsequence is the string which has sequence of characters of main string, which can be formed by deleting one or many characters randomly from the input string.

Example:
Input string    : "HELLO WELCOME EVERYONE"
Sub-string     : "EVERY" or "COME"
Subsequence : "LWCME"


Longest Common substring

Longest Common substring of two given strings is the string which is common substring of both given strings and longest of all such common substrings.

Example:
String 1   : "HELLO WELCOME EVERYONE"
String 2   : "NO TWEET NO NO WELEVERYONEWELHELL"
LCString : "EVERYONE"

One approach of solving this would be using Dynamic Programming.

a. Allocate a 2D array of size [str1.Length, str2.Length]
b. Each value of this matrix saves the information about length of longest substring that's ending at that position. e.g. L[3,4] gives length of longest common substring of first three characters of first input string and first four characters of second input string.
c. If we start with 1st character of string 1 and 1st character of string 2, if they match we can say L[0, 0] = 1
d. We then can build this array by seeing if LCS value at any postion is greater than cached LCS value.
e. We will also save the ending index of the longest common substring till every point when character do not match. This can be used to backtrack and get the actual LCS from given input string.
f. L[str1.Length, str2.Length] will have the final LCS length saved in it.
g. Each step uses LCS value calculated in previous steps and adds one to it if characters match at new indexes in both given strings.

In below C# code I'm assuming array of characters are passed as input instead of string.

        public static void PrintLongestCommonSubstring(char[] str1, char[] str2)
        {
            int[,] l = new int[str1.Length, str2.Length];
            int lcs = -1;
            string substr = string.Empty;
            int end = -1;
           
            for (int i = 0; i < str1.Length; i++)
            {
                for (int j = 0; j < str2.Length; j++)
                {
                    //if character at 'i' in string 1 and character at 'j' in string 2
                    matches then increment the length of lcs value for that position
                    if (str1[i] == str2[j])
                    {
                        //taking care of array indexes
                        if (i == 0 || j == 0)
                        {
                            l[i, j] = 1;
                        }
                        else
                        {
                            //characters match , lets update new lcs value
                            l[i, j] = l[i - 1, j - 1] + 1;
                        }

                        //This is to get the longest common substring
                        // use end index in the first string for longest common substring
                        if (l[i, j] > lcs)
                        {
                            lcs = l[i, j];
                            end = i;
                        }

                    }
                    else
                    {
                        //characters do not match use set length of LCS at that position
                        as 0
                        l[i, j] = 0;
                    }
                }
            }

            for (int i = end - lcs + 1; i <= end; i++)
            {
                substr += str1[i];
            }
           
            Console.WriteLine("Longest Common SubString Length = {0}, Longest Common  
               Substring = {1}", lcs, substr);
        }


NOTE: I hope comments in the above code helps in understanding the code block. 

I will continue this article and explain about finding Longest Common Subsequence in next article.

Algorithm and References: wikipedia (link provided above)