Thursday, February 21, 2013

Saving data into cPickle format (in Python)

I recently created a python script which generated a huge-ass dictionary, which I wanted to save and use later in another program. The simple solution was simpy to print my dictionary and pipe it into txt file. However, the txt file ended up being almost 500MB large, and using eval() to parse the text string into a Python object took around 5 minutes. Enter the cPickle module. cPickle uses a C implementation of handling pickled stuff, which effectively means that it's faster. In my case loading time decreased to few seconds, and the size of the cPickle'd file was 50% of the plain .txt file.

Here are a couple of snippets to get you started using cPickle. First import cPickle. The cPickle module should be included in most Python distributions
import cPickle
In this example, we have an array called "large_array" created by the "create_large_array()" function. I want to store "large_array" in a file called "large_array.cpickle".
large_array = create_large_array()
output_filename = "large_array.cpickle"

f = open(output_filename,"wb")
cPickle.dump(large_array, f, protocol=2)
f.close()
In this example "open()" is called with the second argument "wb", which tells Python to open the file in write/binary mode. "cPickle.dump()" dumps "large_array" into "f" using "protocol=2". cPickle has several different modes which all do the same -- protocol 2 is the fastest.

If you want to load the array from  "large_array.cpickle" in another script, you can use something like this:
def load_pickle(filename):
    f = open(filename,"rb")
    p = cPickle.load(f)
    f.close()
    return(p)


large_array = load_pickle("large_array.cpickle")