Monday, May 26, 2014

How to Speed Up Processing Large Files

Here are five useful tips to speed up Mathematica, especially when working with large files.

First, use simple file formats such as .txt or .csv, thereby avoiding the complicated superstructure of, for example, Microsoft Excel. I have processed files with between 10^5 and 10^6 elements by saving an Excel file as CSV. 

Second, memory is cheap — the simplest way is to increase the random access memory allocated to Mathematica. To increase RAM (Random Access Memory, which is 100 - 1000 times faster than standard hard drive memory, but solid state drives are changing that) allocated to Mathematica, the command is 

ReinstallJava[JVMArguments->"Xmx6000m"]

where the final piece "6000m" tells your operating system to allocate more memory to the Java Virtual Machine, in this case 6000 MB (= 6 GB), but it can be as large as your RAM will tolerate. 

I have several machines with 16 GB of RAM and one with 32 GB. I've needed up to 9 GB ("9000m") allocated to Mathematica for generating large plots of trees as well as processing large files. 

First load the JLink Package and then reinstall Java with a memory spec:

Needs@"JLink`";
ReinstallJava[JVMArguments -> "-Xmx3000m"]

Whatever you get for Output is fine as long as there is no error message.

Third, instead of using Import, I have found that the lower-level ReadList function is one or two orders of magnitude faster. See http://reference.wolfram.com/mathematica/ref/ReadList.html. For instance, after trying Import on a 150 GB file and it never finished, I used ReadList and it took 72 seconds!

Fourth, get the data in parts if getting all the records at once is slow. See the Elements section of  http://reference.wolfram.com/mathematica/ref/format/XLSX.html for details on getting parts of data. For example, I've found breaking a list of 10^6 records into sublists of 50K records works wonders.

Fifth, I don't know why, but Mathematica is slow at exporting HTML format. When working with thousands of files It is far quicker to save them as text and then change the file extensions with a shell command such as 

ren *.txt *.html