Saturday, March 9, 2013

Introduction to Strings in Mathematica

A format convention in Mathematica is to show quotation marks around strings in input cells, but not in output cells. To show them in output cells, use FullForm.

In[18]:= "I've entered a string by using quotation marks."

Out[18]= I've entered a string by using quotation marks.

In[19]:= Head@%

Out[19]= String

In[20]:= FullForm@%%

Out[20]//FullForm= "I've entered a string by using quotation marks."

In programming languages, "String" is the name for the data type of text expressions. In other words, unlike symbols in a general expression, the symbols in a String, like a, b, c, do not refer to places in memory that store a variable, like a function or value. Symbols in a String do not refer to anything, or you could say they only refer to themselves as "literals", that is, the literal characters or tokens "a", "b", "c". An "a" is just a blank token with nothing behind it; hence the terms "literal" or "token" for a String symbol.

In[21]:= a = 5534.7; a

Out[21]= 5534.7

Trying to use a String as a variable gives an error message.

In[22]:= "a" = "Lookee here."

During evaluation of In[22]:= Set::setraw: Cannot assign to raw object a. >>

Out[22]= "Lookee here."

Consequently, Strings are not evaluated in the usual way that other Expressions are. Wellin et al. in their excellent book suggest that we think of a String as being broken up by Mathematica into a sequence of its characters and then functions are applied to the sequence.

In[8]:= aSentence = "The quick brown fox jumped over the 742 lazy white dogs.";

In[11]:= aSentenceCharacters = aSentence // Characters

Out[11]= {T,h,e, ,q,u,i,c,k, ,b,r,o,w,n, ,f,o,x, ,j,u,m,p,e,d, ,o,v,e,r, ,t,h,e, ,7,4,2, ,l,a,z,y, ,w,h,i,t,e, ,d,o,g,s,.}

Functions, like MatchQ, Cases, Position, etc., that work on non-String expressions do not work on Strings but do work on their sequence of characters.

In[14]:= Position[aSentence, "e"]

Out[14]= {}

So perhaps as Wellin et al suggest, Mathematica first breaks a String into a sequence of characters and applies some of the regular built-in functions as well as some specialized String functions to the sequence.


In[25]:= MatchQ[#, "e"] & /@ aSentenceCharacters

Out[25]= {False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, True, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False}

In[24]:= StringMatchQ[aSentenceCharacters, "e"]

Out[24]= {False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, True, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False}


In[15]:= Position[aSentenceCharacters, "e"]

Out[15]= {{3}, {25}, {30}, {35}, {50}}

In[16]:= StringPosition[aSentence, "e"]

Out[16]= {{3, 3}, {25, 25}, {30, 30}, {35, 35}, {50, 50}}

The position index is repeated in StringPosition since it shows the beginning and ending positions of the matched String.

In[20]:= StringPosition[aSentence, "lazy"]

Out[20]= {{41, 44}}

When working with Strings, you do not need to enclose a String in List brackets:

In[3]:= aString = {"abcd"}

Out[3]= {abcd}

In[4]:= aString2 = "abcd"

Out[4]= abcd

In[5]:= Head@aString2

Out[5]= String

The String is delimited by invisible quotation marks. As usual FullForm and InputForm reveal the way Mathematica sees the String.

In[6]:= FullForm@aString2

Out[6]="abcd"

In[7]:= InputForm@aString2

Out[7]="abcd"

Next we will look at the variety of String functions in Mathematica.