Wednesday, April 16, 2014

File Operations: Add the Total Byte Size of Files in Folders

File Operations: Add the Total Byte Size of Files in Folders

Mostly from the Carnegie-Mellon Pronouncing Dictionary (link available here: http://www.speech.cs.cmu.edu/cgi-bin/cmudict), I have created 138,418 files that I wanted to group into 3 folders of equal size. The files are in sub-folders according to the first letter of their filename, and a first pass at dividing them equally by dividing the Length of the original List by 3 split the last two Lists in the middle of files beginning with "p". I want to verify that the 3 sub-folders contain files of equal size or even them out. This involves some file operations and Select.

In[312]:= fileNames=FileNames["*.zip",{"C:\\Users\\kwcarlso\\Documents\\Kris\\Megapedia-Local\\Reference-English\\Target Files\\Reference-English"}]

Here and below I've abbreviated the output with "{filename, ..., filename}".

Out[312]= {C:\Users\kwcarlso\Documents\Kris\Megapedia-Local\Reference-English\Target Files\Reference-English\a.zip, ... ,C:\Users\kwcarlso\Documents\Kris\Megapedia-Local\Reference-English\Target Files\Reference-English\z.zip}

How to best tally the file byte size in each sub-folder? I considered using GatherBy to group the FileNames into the same groups as the sub-folders, and that probably would have worked, but would have been complicated to use three predicates, each with a range of letters in it. Discretion is always the better part of valor in programming, even to the point of just adding up the byte sizes by hand if this task were a one-off and to let me get on to the next task.

I decided to apply the principle of "divide-and-conquer" and use Select to group each set of files individually, then add byte sizes for each result. The predicate in Select is a set-theoretic operation, which indicated using a set-theoretic function, MemberQ. An original design principle of Mathematica's pre-cursor, Symbolic Manipulation Program (SMP), was to transparently map to all common mathematical functions and syntax, and it is simplest to do just that if possible. I use the infix functional style for MemberQ because it seems a bit more readable than functional bracketed style (but that's a trivial personal preference).

My first attempt failed since I forgot that CharacterRange is case-sensitive. I changed "A" and "F" to "a" and "f" and it worked perfectly. We need the final #& since the Select test works by simply applying the predicate to each element of the first argument List that you provide it.

In[316]:= fileNamesAF=Select[fileNames,CharacterRange["a","f"]~MemberQ~First@Characters@FileNameTake@#&]

Out[316]= {C:\Users\kwcarlso\Documents\Kris\Megapedia-Local\Reference-English\Target Files\Reference-English\a.zip, ..., C:\Users\kwcarlso\Documents\Kris\Megapedia-Local\Reference-English\Target Files\Reference-English\f.zip}

In[317]:= fileNamesGP=Select[fileNames,CharacterRange["g","p"]~MemberQ~First@Characters@FileNameTake@#&]

Out[317]= {C:\Users\kwcarlso\Documents\Kris\Megapedia-Local\Reference-English\Target Files\Reference-English\g.zip, ..., C:\Users\kwcarlso\Documents\Kris\Megapedia-Local\Reference-English\Target Files\Reference-English\p.zip}

In[318]:= fileNamesQZ=Select[fileNames,CharacterRange["q","z"]~MemberQ~First@Characters@FileNameTake@#&]
Out[318]= {C:\Users\kwcarlso\Documents\Kris\Megapedia-Local\Reference-English\Target Files\Reference-English\q.zip, ..., C:\Users\kwcarlso\Documents\Kris\Megapedia-Local\Reference-English\Target Files\Reference-English\z.zip}

At this point all we need to do is Map FileByteCount onto the filename in each folder, which means at Level 2 only (note the brackets around 2).

In[319]:= Map[FileByteCount,{fileNamesAF,fileNamesGP,fileNamesQZ},{2}]

Out[319]= {{11154319,14865868,17087067,12155371,7279606,8313716},{8802167,9846927,5719874,2458146,6091626,8605940,14729204,4588698,4529191,13363263},{795821,11363572,22685379,9138068,3902314,3644541,6431739,119495,1027761,1314296}}

Replacing List with Plus to add the numbers in each List is a common operation. When referring to Output I often use the line number instead of % since I can then change the function and re-Evaluate without modifiying it (to %%, %3, or then using the line number).

In[321]:= Plus@@#&/@%319

Out[321]= {70855947,78735036,60422986}

The result told me that moving the files beginning with "P" from the middle set into the last set would even those last two groups out.

1 comment: