Wednesday, April 16, 2014

String Operations: The Essential Function StringSplit

Split a String and Gather its Words by Their First Letter

Let's process a nice sonnet I just read in Ray Kurzweil's excellent book on what he thinks is the heart of cognitive science and AI, How to Create a Mind: The Secret of Human Thought Revealed.
To start I copy and paste the text from the web between quotation marks.

In[377]:= sonnet73=
"That time of year thou mayst in me behold,
When yellow leaves, or none, or few, do hang
Upon those boughs which shake against the cold,
Bare ruined choirs, where late the sweet birds sang.
In me thou seest the twilight of such day,
As after sunset fadeth in the west,
Which by and by black night doth take away,
Death's second self, that seals up all in rest.
In me thou seest the glowing of such fire,
That on the ashes of his youth doth lie,
As the death-bed whereon it must expire,
Consumed with that which it was nourished by.
This thou perceiv'st, which makes thy love more strong,
To love that well, which thou must leave ere long.";

Can we use ToLowerCase without Map? Yes, since it's Listable.

In[380]:= sonnet73LowerCase=ToLowerCase@%377

Out[380]= that time of year thou mayst in me behold,
when yellow leaves, or none, or few, do hang
upon those boughs which shake against the cold,
bare ruined choirs, where late the sweet birds sang.
in me thou seest the twilight of such day,
as after sunset fadeth in the west,
which by and by black night doth take away,
death's second self, that seals up all in rest.
in me thou seest the glowing of such fire,
that on the ashes of his youth doth lie,
as the death-bed whereon it must expire,
consumed with that which it was nourished by.
this thou perceiv'st, which makes thy love more strong,
to love that well, which thou must leave ere long.

We need StringSplit, not SplitBy, since we are working with a String. StringSplit is a powerful and essential function when importing and transforming large files to the format you need. Examples in the Doc Center like this one show its versatility:

In[1]:= StringSplit["a-b:c-d:e-f-g",{":","-"}]

Out[1]= {a,b,c,d,e,f,g}

Using WordBoundary as the pattern test for where to split works, but leaves much more "noise". Here is the output. Using InputForm is often handy to see what is going in while processing Strings.

In[409]:= StringSplit[sonnet73LowerCase,WordBoundary]//InputForm

Out[409]//InputForm=
{"that", " ", "time", " ", "of", " ", "year", " ", "thou", " ", "mayst", " ", "in",
 " ", "me", " ", "behold", ",\n", "when", " ", "yellow", " ", "leaves", ", ", "or",
 " ", "none", ", ", "or", " ", "few", ", ", "do", " ", "hang", "\n", "upon", " ",
 "those", " ", "boughs", " ", "which", " ", "shake", " ", "against", " ", "the", " ",
 "cold", ",\n", "bare", " ", "ruined", " ", "choirs", ", ", "where", " ", "late", " ",
 "the", " ", "sweet", " ", "birds", " ", "sang", ".\n", "in", " ", "me", " ", "thou",
 " ", "seest", " ", "the", " ", "twilight", " ", "of", " ", "such", " ", "day", ",\n",
 "as", " ", "after", " ", "sunset", " ", "fadeth", " ", "in", " ", "the", " ", "west",
 ",\n", "which", " ", "by", " ", "and", " ", "by", " ", "black", " ", "night", " ",
 "doth", " ", "take", " ", "away", ",\n", "death", "'", "s", " ", "second", " ",
 "self", ", ", "that", " ", "seals", " ", "up", " ", "all", " ", "in", " ", "rest",
 ".\n", "in", " ", "me", " ", "thou", " ", "seest", " ", "the", " ", "glowing", " ",
 "of", " ", "such", " ", "fire", ",\n", "that", " ", "on", " ", "the", " ", "ashes",
 " ", "of", " ", "his", " ", "youth", " ", "doth", " ", "lie", ",\n", "as", " ", "the",
 " ", "death", "-", "bed", " ", "whereon", " ", "it", " ", "must", " ", "expire",
 ",\n", "consumed", " ", "with", " ", "that", " ", "which", " ", "it", " ", "was", " ",
 "nourished", " ", "by", ".\n", "this", " ", "thou", " ", "perceiv", "'", "st", ", ",
 "which", " ", "makes", " ", "thy", " ", "love", " ", "more", " ", "strong", ",\n",
 "to", " ", "love", " ", "that", " ", "well", ", ", "which", " ", "thou", " ", "must",
 " ", "leave", " ", "ere", " ", "long", "."}

What a mess! And using this lengthy construct to DeleteCases of punctuation still left an imperfect result (which I won't show in case you're about to eat).

%//DeleteCases[#,""|" "|","|", "|" ,"|"'"|"s"|"st"|",\n"|",\n"|".\n"|"\n"|"-"|"."|" . "|"  . "]&

Compare the result from proper use of StringSplit.

In[405]:= sonnet73Split=StringSplit[sonnet73LowerCase,{Whitespace,".",","}]//DeleteCases[#,""]&

Out[405]= {that,time,of,year,thou,mayst,in,me,behold,when,yellow,leaves,or,none,or,few,do,hang,upon,those,boughs,which,shake,against,the,cold,bare,ruined,choirs,where,late,the,sweet,birds,sang,in,me,thou,seest,the,twilight,of,such,day,as,after,sunset,fadeth,in,the,west,which,by,and,by,black,night,doth,take,away,death's,second,self,that,seals,up,all,in,rest,in,me,thou,seest,the,glowing,of,such,fire,that,on,the,ashes,of,his,youth,doth,lie,as,the,death-bed,whereon,it,must,expire,consumed,with,that,which,it,was,nourished,by,this,thou,perceiv'st,which,makes,thy,love,more,strong,to,love,that,well,which,thou,must,leave,ere,long}

Here GatherBy groups sub-lists by the first character of each word.

In[411]:= sonnet73SplitGathered=GatherBy[sonnet73Split,First@Characters@#&]

Out[411]= {{that,time,thou,those,the,the,thou,the,twilight,the,take,that,thou,the,that,the,the,that,this,thou,thy,to,that,thou},{of,or,or,of,of,on,of},{year,yellow,youth},{mayst,me,me,me,must,makes,more,must},{in,in,in,in,in,it,it},{behold,boughs,bare,birds,by,by,black,by},{when,which,where,west,which,whereon,with,which,was,which,well,which},{leaves,late,lie,love,love,leave,long},{none,night,nourished},{few,fadeth,fire},{do,day,doth,death's,doth,death-bed},{hang,his},{upon,up},{shake,sweet,sang,seest,such,sunset,second,self,seals,seest,such,strong},{against,as,after,and,away,all,ashes,as},{cold,choirs,consumed},{ruined,rest},{glowing},{expire,ere},{perceiv'st}}

To see the Lists sorted by their initial letter, we should use SortBy, but this didn't work.

In[412]:= SortBy[%,First@Characters@#&]

We need to add another First to apply the sorting function to the First word in each sublist.

In[413]:= SortBy[%,First@Characters@First@#&]

Out[413]= {{against,as,after,and,away,all,ashes,as},{behold,boughs,bare,birds,by,by,black,by},{cold,choirs,consumed},{do,day,doth,death's,doth,death-bed},{expire,ere},{few,fadeth,fire},{glowing},{hang,his},{in,in,in,in,in,it,it},{leaves,late,lie,love,love,leave,long},{mayst,me,me,me,must,makes,more,must},{none,night,nourished},{of,or,or,of,of,on,of},{perceiv'st},{ruined,rest},{shake,sweet,sang,seest,such,sunset,second,self,seals,seest,such,strong},{that,time,thou,those,the,the,thou,the,twilight,the,take,that,thou,the,that,the,the,that,this,thou,thy,to,that,thou},{upon,up},{when,which,where,west,which,whereon,with,which,was,which,well,which},{year,yellow,youth}}

File Operations: Add the Total Byte Size of Files in Folders

File Operations: Add the Total Byte Size of Files in Folders

Mostly from the Carnegie-Mellon Pronouncing Dictionary (link available here: http://www.speech.cs.cmu.edu/cgi-bin/cmudict), I have created 138,418 files that I wanted to group into 3 folders of equal size. The files are in sub-folders according to the first letter of their filename, and a first pass at dividing them equally by dividing the Length of the original List by 3 split the last two Lists in the middle of files beginning with "p". I want to verify that the 3 sub-folders contain files of equal size or even them out. This involves some file operations and Select.

In[312]:= fileNames=FileNames["*.zip",{"C:\\Users\\kwcarlso\\Documents\\Kris\\Megapedia-Local\\Reference-English\\Target Files\\Reference-English"}]

Here and below I've abbreviated the output with "{filename, ..., filename}".

Out[312]= {C:\Users\kwcarlso\Documents\Kris\Megapedia-Local\Reference-English\Target Files\Reference-English\a.zip, ... ,C:\Users\kwcarlso\Documents\Kris\Megapedia-Local\Reference-English\Target Files\Reference-English\z.zip}

How to best tally the file byte size in each sub-folder? I considered using GatherBy to group the FileNames into the same groups as the sub-folders, and that probably would have worked, but would have been complicated to use three predicates, each with a range of letters in it. Discretion is always the better part of valor in programming, even to the point of just adding up the byte sizes by hand if this task were a one-off and to let me get on to the next task.

I decided to apply the principle of "divide-and-conquer" and use Select to group each set of files individually, then add byte sizes for each result. The predicate in Select is a set-theoretic operation, which indicated using a set-theoretic function, MemberQ. An original design principle of Mathematica's pre-cursor, Symbolic Manipulation Program (SMP), was to transparently map to all common mathematical functions and syntax, and it is simplest to do just that if possible. I use the infix functional style for MemberQ because it seems a bit more readable than functional bracketed style (but that's a trivial personal preference).

My first attempt failed since I forgot that CharacterRange is case-sensitive. I changed "A" and "F" to "a" and "f" and it worked perfectly. We need the final #& since the Select test works by simply applying the predicate to each element of the first argument List that you provide it.

In[316]:= fileNamesAF=Select[fileNames,CharacterRange["a","f"]~MemberQ~First@Characters@FileNameTake@#&]

Out[316]= {C:\Users\kwcarlso\Documents\Kris\Megapedia-Local\Reference-English\Target Files\Reference-English\a.zip, ..., C:\Users\kwcarlso\Documents\Kris\Megapedia-Local\Reference-English\Target Files\Reference-English\f.zip}

In[317]:= fileNamesGP=Select[fileNames,CharacterRange["g","p"]~MemberQ~First@Characters@FileNameTake@#&]

Out[317]= {C:\Users\kwcarlso\Documents\Kris\Megapedia-Local\Reference-English\Target Files\Reference-English\g.zip, ..., C:\Users\kwcarlso\Documents\Kris\Megapedia-Local\Reference-English\Target Files\Reference-English\p.zip}

In[318]:= fileNamesQZ=Select[fileNames,CharacterRange["q","z"]~MemberQ~First@Characters@FileNameTake@#&]
Out[318]= {C:\Users\kwcarlso\Documents\Kris\Megapedia-Local\Reference-English\Target Files\Reference-English\q.zip, ..., C:\Users\kwcarlso\Documents\Kris\Megapedia-Local\Reference-English\Target Files\Reference-English\z.zip}

At this point all we need to do is Map FileByteCount onto the filename in each folder, which means at Level 2 only (note the brackets around 2).

In[319]:= Map[FileByteCount,{fileNamesAF,fileNamesGP,fileNamesQZ},{2}]

Out[319]= {{11154319,14865868,17087067,12155371,7279606,8313716},{8802167,9846927,5719874,2458146,6091626,8605940,14729204,4588698,4529191,13363263},{795821,11363572,22685379,9138068,3902314,3644541,6431739,119495,1027761,1314296}}

Replacing List with Plus to add the numbers in each List is a common operation. When referring to Output I often use the line number instead of % since I can then change the function and re-Evaluate without modifiying it (to %%, %3, or then using the line number).

In[321]:= Plus@@#&/@%319

Out[321]= {70855947,78735036,60422986}

The result told me that moving the files beginning with "P" from the middle set into the last set would even those last two groups out.

Friday, April 11, 2014

An Advanced Predicate and Analysis of How It Works with Trace and Timing

Wagner gives a different version of allodd@x that is more efficient than the one shown here. This function stops testing as soon as it finds a number that is not odd. Contrary to his predisposition, Wagner uses a procedural loop, "de-constructing," as he calls it, the List to operate on its Parts.

In[269]:= allOddProcedural@aList_List:=Module[{i},
For[i=1,i<=Length@aList,i++,If[!OddQ[aList[[i]]],Break[]]];i>Length@aList]

But I bet OddQ is Listable and therefore its C compilation would be much faster than the For loop. First let's compare the Wagner allOddQ from above with his procedural one.

Wagner's function Breaks after OddQ returns False when testing the third argument, 6 (near the end of the Trace). Note the 100-fold difference in Timing!

In[275]:= list2={3,5,6,7};

In[276]:= allOdd@list2//Trace

Out[276]= {{list2,{3,5,6,7}},allOdd[{3,5,6,7}],Length[Select[{3,5,6,7},OddQ]]==Length[{3,5,6,7}],{{Select[{3,5,6,7},OddQ],{OddQ[3],True},{OddQ[5],True},{OddQ[6],False},{OddQ[7],True},{3,5,7}},Length[{3,5,7}],3},{Length[{3,5,6,7}],4},3==4,False}

In[277]:= allOddProcedural@list2//Trace

Out[277]= {{list2,{3,5,6,7}},allOddProcedural[{3,5,6,7}],Module[{i$},For[i$=1,i$<=Length[{3,5,6,7}],i$++,If[!OddQ[{3,5,6,7}[[i$]]],Break[]]];i$>Length[{3,5,6,7}]],{For[i$41099=1,i$41099<=Length[{3,5,6,7}],i$41099++,If[!OddQ[{3,5,6,7}[[i$41099]]],Break[]]];i$41099>Length[{3,5,6,7}],{For[i$41099=1,i$41099<=Length[{3,5,6,7}],i$41099++,If[!OddQ[{3,5,6,7}[[i$41099]]],Break[]]],{i$41099=1,1},{{i$41099,1},{Length[{3,5,6,7}],4},1<=4,True},{{{{{i$41099,1},{3,5,6,7}[[1]],3},OddQ[3],True},!True,False},If[False,Break[]],Null},{i$41099++,{i$41099,1},{i$41099=2,2},1},{{i$41099,2},{Length[{3,5,6,7}],4},2<=4,True},{{{{{i$41099,2},{3,5,6,7}[[2]],5},OddQ[5],True},!True,False},If[False,Break[]],Null},{i$41099++,{i$41099,2},{i$41099=3,3},2},{{i$41099,3},{Length[{3,5,6,7}],4},3<=4,True},{{{{{i$41099,3},{3,5,6,7}[[3]],6},OddQ[6],False},!False,True},If[True,Break[]],Break[]}},{{i$41099,3},{Length[{3,5,6,7}],4},3>4,False},False},False}

In[262]:= list1=Range[1,10^7,2];

Timing[#@list1]&/@{allOddProcedural,allodd}

{{9.297660,True},{0.109201,<<1>>}}

Why Use Built-in Functions? They're Already Written, De-bugged, and Fast

Now let's try my premise - first, OddQ is Listable as I thought.

In[260]:= Attributes@OddQ

Out[260]= {Listable,Protected}

Let's write the simple function:

In[279]:= allOddFunctional@aList_List:=If[!OddQ@aList,i>Length@aList]

In[280]:= Timing[allOddFunctional@list1]

Out[280]= {0.093601,<<1>>}

allOddFunctional is 10X faster than Wagner's procedural function, even though it appears to check every argument:

In[281]:= allOddFunctional@list2//Trace

Out[281]= {{list2,{3,5,6,7}},allOddFunctional[{3,5,6,7}],If[!OddQ[{3,5,6,7}],i>Length[{3,5,6,7}]],{{OddQ[{3,5,6,7}],{OddQ[3],OddQ[5],OddQ[6],OddQ[7]},{OddQ[3],True},{OddQ[5],True},{OddQ[6],False},{OddQ[7],True},{True,True,False,True}},!{True,True,False,True}},If[!{True,True,False,True},i>Length[{3,5,6,7}]]}

The lesson is simply that built-in functions are almost always faster than ones we make if for no other reason than they are compiled into C.

Here Wagner shows the use of While instead of For to stop when it finds an odd number.

allodd@aList_List:=Module[{i=1},While[i<=Length@aList&&OddQ@aList[[i]],i++];i>Length@aList]

Here is a Dropbox link to Wagner's book. High kudos to "Mr Wizard Todd" for asking McGraw-Hill for the right to distribute Wagner's out-of-print book. Here's his post on the StackExchange Mathematica forum.

https://www.dropbox.com/s/kllwg6y44p8va5g/Wagner%20All%20Parts-RC.pdf

Here is a link to a compressed file: http://www.verbeia.com/mathematica/PowerProgMa.zip


More Complex Predicates: allOddQ, allIdenticalQ, subsetQ

"Mr Wizard Todd", Manfred Plagmann, and Sophia Scheibe have done Mathematica users a nice favor by getting permission from McGraw-Hill to distribute licensed copies of David Wagner' s superb book, Power Programming in Mathematica. (Dropbox link: https://www.dropbox.com/s/j2dsyvptnxjd369/Wagner%20All%20Parts-RC.pdf)

A related post is How to See the Equivalence of Select and Cases.

Here are some nice examples of predicates from Wagner (re - written in my Prefix/Postfix dialect).Test a list to see if all of its entries are odd integers.

allOdd@aList_List:=Length@Select[aList,OddQ]==Length@aList

allOdd@{2,3,5,6}

False

allOdd@{3,5,7,9}

True

This predicate tests a List to see if its parts are identical. Count returns all Parts that are Equal to the First Part.

identicalListPartsQ@aList_List:=Count[aList, First@aList] == Length@aList

identicalListPartsQ@{a,b,c,4}

False

identicalListPartsQ@{a,a,a,a}

True

SameQ vs. Equal  and Testing for Lists

SubsetQ determines if its first argument is a subset of its second argument and returns True or False. Wagner's function did not include a List test and he notes on Equal vs.SameQ : "The use of === rather than == makes SubsetQ return False if either set1 or set2 is not a list. Try it with Equal."

I use the Head test: set1_List, which rejects a non-List argument before evaluation. And more importantly, SameQ should always be used to test non-numerical equivalence. In fact using Equal here can give screwy results in part because of its usage as the mathematical "equals" (=) in Mathematica's syntax for equations.

When using abbreviated operators, we should first determine their Precedence:

Precedence/@{Union,Equal,SameQ}

{300.,290.,290.}

Since Union is slightly stickier than SameQ, the Union will be performed before the SameQ test.

subsetQ[set1_,set2_]:=set1∪set2===set2

subsetQ[{a,b,c},{a,b,d}]

False

subsetQ[{a,b},{a,b,2}]

False

Union Sorts Its Result

Whoops! Wagner's function didn't work. Let's find out why:

subsetQ[{a,b},{a,b,2}]//Trace

{subsetQ[{a,b},{a,b,2}],{a,b}∪{a,b,2}==={a,b,2},{{a,b}∪{a,b,2},{2,a,b}},{2,a,b}==={a,b,2},False}

The fact that Union returns a sorted List fouls up the comparison. Let's try a fix with Sort:

Clear@subsetQ;subsetQ[set1_,set2_]:=set1∪set2===Sort@set2

And it works - you see in the last step Sort fixes the order.

subsetQ[{a,b},{a,b,2}]//Trace

{subsetQ[{a,b},{a,b,2}],{a,b}∪{a,b,2}===Sort[{a,b,2}],{{a,b}∪{a,b,2},{2,a,b}},{Sort[{a,b,2}],{2,a,b}},{2,a,b}==={2,a,b},True}

Greater (>) is a "non-Q" predicate so the last statement (lacking a return-suppressing semi-colon), returns True or False.

More on predicates: http://mathematica-guide.blogspot.com/2015/07/how-to-see-equivalence-of-select-and.html

Create Predicates to Test Strings vs Numerics: Test for File Extensions

Elsewhere I tell why predicates are important to use, show how to find all predicates in Mathematica including ones not suffixed with "Q"and create ones of your own to test numerical expressions. Here I cover predicates for testing Strings, and create some tests for file types, as judged by their extensions. Note that FileExtension does not capture the period before the extension.

Equal (==) and SameQ (===) are valid predicates for Strings as well as numerics, and work fine for simple applications.

In[206]:= uncusFileQ@fileName_String := FileExtension@fileName == "unc"

In[207]:= uncusFileQ@"afile.unc"

Out[207]= True

In[208]:= uncusFileQ@"afile.txt"

Out[208]= False

In[209]:= textFileQ@fileName_String := FileExtension@fileName == "txt"

In[215]:= textFileQ@"sampleFile.txt"

Out[215]= True

To test for two alternative file extensions is a little trickier. Since we are testing for String equivalence between String patterns, we use the general String predicate function StringMatchQ and Alternatives (|), not logical Or (||). An example is testing for HTML files, since their extensions come in .htm and .html flavors.

In[210]:= Clear@htmlFileQ;
htmlFileQ@fileName_String := StringMatchQ[FileExtension@fileName, "html" | "htm"]

In[212]:= htmlFileQ@"www.jognog.com/index.html"

Out[212]= True

In[213]:= htmlFileQ@"www.jognog.com/index.htm"

Out[213]= True

In[214]:= htmlFileQ@"www.jognog.com/sitemap.xml"

Out[214]= False