<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>MetaOptimize &#187; Python</title>
	<atom:link href="http://metaoptimize.com/blog/tag/python/feed/" rel="self" type="application/rss+xml" />
	<link>http://metaoptimize.com/blog</link>
	<description>building machine learning and natural language processing tools</description>
	<lastBuildDate>Sat, 17 Mar 2012 21:25:50 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=abc</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>PyLucene 3.0 in 60 seconds — Tutorial sample code for the 3.0 API</title>
		<link>http://metaoptimize.com/blog/2010/08/09/pylucene-3-0-in-60-seconds-tutorial-sample-code-for-the-3-0-api/</link>
		<comments>http://metaoptimize.com/blog/2010/08/09/pylucene-3-0-in-60-seconds-tutorial-sample-code-for-the-3-0-api/#comments</comments>
		<pubDate>Mon, 09 Aug 2010 16:08:00 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[API]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[search]]></category>

		<guid isPermaLink="false">http://metaoptimize.com/blog/?p=135</guid>
		<description><![CDATA[Until there is better documentation for Lucene 3.0, I recommend you use Lucene 2.4 or 2.9. Nonetheless, I provide a basic indexing and retrieval code using the PyLucene 3.0 API, perhaps the first such example code on the web.]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_blue" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Fmetaoptimize.com%252Fblog%252F2010%252F08%252F09%252Fpylucene-3-0-in-60-seconds-tutorial-sample-code-for-the-3-0-api%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2FcdCihJ%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22PyLucene%203.0%20in%2060%20seconds%20-%20Tutorial%20sample%20code%20for%20the%203.0%20API%22%20%7D);"></div>
<h2>Summary</h2>
<p>I provide a basic indexing and retrieval code using the PyLucene 3.0 API. <a href="http://manning.com/lucene">Lucene In Action (2nd Ed)</a> covers Lucene 3.0, but the PyLucene code samples for have not been updated for the 3.0 API, only the Java ones. Unfortunately, there is currently little (no?) example PyLucene code in blogosphere. If you have links to more Lucene 3.0 tutorials and samples, please share them in the comments.</p>
<p><em>Update 20100810: In light of discussions with other, this post has been substantially rewritten since it was first posted.</em></p>
<hr />
<h2>Background</h2>
<p>Historically, I have found it easy to write basic PyLucene 2.4 (or 2.9?) code. PyLucene includes Lucene In Action code samples ported from Java to Python, and these code samples are correct and easy to adapt. I recently was developing a new project based upon Lucene (<a href="http://github.com/turian/biased-text-sample">biased-text-sample</a>), and I decided to try PyLucene 3.0.2–1. I was surprised to find that PyLucene code samples in <a href="http://svn.apache.org/viewvc/lucene/pylucene/trunk/samples/LuceneInAction/"><tt>samples/LuceneInAction/</tt></a> are out-of-date, and use the 2.x API. (Note: The code in <a href="http://svn.apache.org/viewvc/lucene/pylucene/trunk/samples/"><tt>samples/*.py</tt></a> appears to be updated to the 3.0 API.)</p>
<p>I was able to find no Lucene 3.0 tutorials or code samples on the web, except for this one article:</p>
<ul>
<li>
<a href="http://ikaisays.com/2010/04/24/lucene-in-memory-search-example-now-updated-for-lucene-3-0-1/">Lucene In-Memory Search Example: Now updated for Lucene 3.0.1</a></li>
</ul>
<p>If you have links to more Lucene 3.0 tutorials and samples, please share them in the comments.</p>
<hr />
<h2>Sample PyLucene 3.0 code</h2>
<p>In the spirit of Lingpipe’s <a href="http://lingpipe-blog.com/2009/02/18/lucene-24-in-60-seconds/">Lucene 2.4 in 60 seconds</a>, here are relevant PyLucene 3.0 code snippets from my <a href="http://github.com/turian/biased-text-sample">biased-text-sample</a> project, for indexing and retrieval. </p>
<h3>Indexing</h3>
<pre>import lucene
from lucene import \
    SimpleFSDirectory, System, File, \
    Document, Field, StandardAnalyzer, IndexWriter, Version

if __name__ == "__main__":
    lucene.initVM()
    indexDir = "/Tmp/REMOVEME.index-dir"
    dir = SimpleFSDirectory(File(indexDir))
    analyzer = StandardAnalyzer(Version.LUCENE_30)
    writer = IndexWriter(dir, analyzer, True, IndexWriter.MaxFieldLength(512))

    print >> sys.stderr, "Currently there are %d documents in the index..." % writer.numDocs()

    print >> sys.stderr, "Reading lines from sys.stdin..."
    for l in sys.stdin:
        doc = Document()
        doc.add(Field("text", l, Field.Store.YES, Field.Index.ANALYZED))
        writer.addDocument(doc)

    print >> sys.stderr, "Indexed lines from stdin (%d documents in index)" % (writer.numDocs())
    print >> sys.stderr, "About to optimize index of %d documents..." % writer.numDocs()
    writer.optimize()
    print >> sys.stderr, "...done optimizing index of %d documents" % writer.numDocs()
    print >> sys.stderr, "Closing index of %d documents..." % writer.numDocs()
    writer.close()
    print >> sys.stderr, "...done closing index of %d documents" % writer.numDocs()
</pre>
<h3>Retrieval</h3>
<pre>
import lucene
from lucene import \
    SimpleFSDirectory, System, File, \
    Document, Field, StandardAnalyzer, IndexSearcher, Version, QueryParser

if __name__ == "__main__":
    lucene.initVM()
    indexDir = "/Tmp/REMOVEME.index-dir"
    dir = SimpleFSDirectory(File(indexDir))
    analyzer = StandardAnalyzer(Version.LUCENE_30)
    searcher = IndexSearcher(dir)

    query = QueryParser(Version.LUCENE_30, "text", analyzer).parse("Find this sentence please")
    MAX = 1000
    hits = searcher.search(query, MAX)

    print "Found %d document(s) that matched query '%s':" % (hits.totalHits, query)

    for hit in hits.scoreDocs:
        print hit.score, hit.doc, hit.toString()
        doc = searcher.doc(hit.doc)
        print doc.get("text").encode("utf-8")
</pre>

<div style="float:left;margin:0px 0px 0px 0px;"></div>]]></content:encoded>
			<wfw:commentRss>http://metaoptimize.com/blog/2010/08/09/pylucene-3-0-in-60-seconds-tutorial-sample-code-for-the-3-0-api/feed/</wfw:commentRss>
		<slash:comments>43</slash:comments>
		</item>
		<item>
		<title>Why can’t you pickle generators in Python? A pattern for saving training state</title>
		<link>http://metaoptimize.com/blog/2009/12/22/why-cant-you-pickle-generators-in-python-workaround-pattern-for-saving-training-state/</link>
		<comments>http://metaoptimize.com/blog/2009/12/22/why-cant-you-pickle-generators-in-python-workaround-pattern-for-saving-training-state/#comments</comments>
		<pubDate>Tue, 22 Dec 2009 08:52:13 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[experimental control]]></category>
		<category><![CDATA[Generator]]></category>
		<category><![CDATA[generators]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[persistance]]></category>
		<category><![CDATA[pickling]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[serialization]]></category>
		<category><![CDATA[training state]]></category>

		<guid isPermaLink="false">http://blog.metaoptimize.com/?p=72</guid>
		<description><![CDATA[

Summary

A pattern for persisting generators is to turn them into pickle-able class objects. This is useful when you use generators for streaming training examples.
I would also try generator_tools, which might be a more convenient alternative to the pattern I describe. I haven’t used it yet.

Generators for streaming training examples
For machine learning, python generators are a [...]]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_blue" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Fmetaoptimize.com%252Fblog%252F2009%252F12%252F22%252Fwhy-cant-you-pickle-generators-in-python-workaround-pattern-for-saving-training-state%252F%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Why%20can%27t%20you%20pickle%20generators%20in%20Python%3F%20A%20pattern%20for%20saving%20training%20state%22%20%7D);"></div>
<h1>Summary</h1>
<p><a href="http://flickr.com/photos/28402283@N07/3186143355" title="Moon Rise behind the San Gorgonio Pass Wind Farm"><img align=right src="http://farm4.static.flickr.com/3118/3186143355_4840fb7620_t.jpg" /></a></p>
<p>A pattern for persisting generators is to turn them into pickle-able class objects. This is useful when you use generators for streaming training examples.</p>
<p>I would also try <a href="http://www.fiber-space.de/generator_tools/doc/generator_tools.html">generator_tools</a>, which might be a more convenient alternative to the pattern I describe. I haven’t used it yet.</p>
<hr />
<h2>Generators for streaming training examples</h2>
<p>For machine learning, python <a href="http://www.ibm.com/developerworks/library/l-pycon.html">generators</a> are a simple idiom that make it easy to generate a stream of training examples. Moreover, you can nest generators:</p>
<ul>
<li>The inner generator can be used to read one example at a time.</li>
<li>The outer generator can be used to read examples from the inner generator until you have a full minibatch, and then yield this minibatch.</li>
</ul>
<p>Here is some example code:</p>
<p>[Update: The example holds without the ALL CAPS magic variable names, “HYPERPARAMETERS”. However, I include HYPERPARAMETERS because I am including the actual code I am using. Hyperparameters are global, read-only variables that specify the particular experimental condition being tested. I can’t say that I have the best solution to this particular aspect of experimental control (hyperparameters). I might write a blog post about it in the future, to solicit feedback on improved methods. However, I have refined my current approach over several years, and I can assure you that it is far less painful than a handful of more “clean” approaches.]</p>
<pre>def get_train_example():
    HYPERPARAMETERS = common.hyperparameters.read("language-model")

    from vocabulary import wordmap
    for l in myopen(HYPERPARAMETERS["TRAIN_SENTENCES"]):
        prevwords = []
        for w in string.split(l):
            w = string.strip(w)
            id = None
            if wordmap.exists(w):
                prevwords.append(wordmap.id(w))
                if len(prevwords) >= HYPERPARAMETERS["WINDOW_SIZE"]:
                    yield prevwords[-HYPERPARAMETERS["WINDOW_SIZE"]:]
            else:
                prevwords = []

def get_train_minibatch():
    HYPERPARAMETERS = common.hyperparameters.read("language-model")
    minibatch = []
    for e in get_train_example():
        minibatch.append(e)
        if len(minibatch) >= HYPERPARAMETERS["MINIBATCH SIZE"]:
            assert len(minibatch) == HYPERPARAMETERS["MINIBATCH SIZE"]
            yield minibatch
            minibatch = []
</pre>
<h2>You can’t persist training state by pickling your generators</h2>
<p>However, generators become problematic when you want to persist your experiment’s state in order to later restart training at the same place. Unfortunately, <a href="http://bugs.python.org/issue1092962">you can’t pickle generators in Python</a>. And it can be a bit of a <a href="http://en.wiktionary.org/wiki/pain_in_the_ass">PITA</a> to workaround this, in order to save the training state.</p>
<h2>Pattern to workaround this annoyance</h2>
<p>Following useful discussion on <a href="http://groups.google.com/group/pylearn-dev/browse_thread/thread/c4e4dd3496bbbf08">pylearn-dev</a> and stackoverflow <a href="http://stackoverflow.com/questions/1942328/add-a-member-variable-method-to-a-python-generator">[1]</a> <a href="http://stackoverflow.com/questions/1939015/singleton-python-generator-or-pickle-a-python-generator">[2]</a>, I propose the following pattern for converting generators to pickle-able class objects:</p>
<ol>
<li>Convert the generator to a class in which the generator code is the <a href="http://stackoverflow.com/questions/1942328/add-a-member-variable-method-to-a-python-generator/1942387#1942387">__iter__</a> method</li>
<li>Add <a href="http://docs.python.org/library/pickle.html#object.__getstate__">__getstate__</a> and <a href="http://docs.python.org/library/pickle.html#object.__setstate__">__setstate__</a> methods to the class, to handling pickling. Remember that you can’t pickle file objects. So __setstate__ will have to re-open files, as necessary.</li>
</ol>
<p>Here is the updated code, after applying this pattern:</p>
<pre>
class TrainingExampleStream(object):
    def __init__(self):
        # Set the state variables, in case pickling happens before __iter__ is called.
        self.filename = None
        self.count = 0
        pass

    def __iter__(self):
        HYPERPARAMETERS = common.hyperparameters.read("language-model")
        from vocabulary import wordmap
        self.filename = HYPERPARAMETERS["TRAIN_SENTENCES"]
        self.count = 0
        for l in myopen(self.filename):
            prevwords = []
            for w in string.split(l):
                w = string.strip(w)
                id = None
                if wordmap.exists(w):
                    prevwords.append(wordmap.id(w))
                    if len(prevwords) >= HYPERPARAMETERS["WINDOW_SIZE"]:
                        self.count += 1
                        yield prevwords[-HYPERPARAMETERS["WINDOW_SIZE"]:]
                else:
                    prevwords = []

    def __getstate__(self):
        return self.filename, self.count

    def __setstate__(self, state):
        """
        @warning: We ignore the filename.  If we wanted
        to be really fastidious, we would assume that
        HYPERPARAMETERS["TRAIN_SENTENCES"] might change.  The only
        problem is that if we change filesystems, the filename
        might change just because the base file is in a different
        path. So we issue a warning if the filename is different from what is expected.
        """
        filename, count = state
        print >> sys.stderr, ("__setstate__(%s)..." % `state`)
        iter = self.__iter__()
        while count != self.count:
#            print count, self.count
            iter.next()
        if self.filename != filename:
            assert self.filename == HYPERPARAMETERS["TRAIN_SENTENCES"]
            print >> sys.stderr, ("self.filename %s != filename given to __setstate__ %s" % (self.filename, filename))
        print >> sys.stderr, ("...__setstate__(%s)" % `state`)

class TrainingMinibatchStream(object):
    def __init__(self):
        pass

    def __iter__(self):
        HYPERPARAMETERS = common.hyperparameters.read("language-model")
        minibatch = []
        self.get_train_example = TrainingExampleStream()
        for e in self.get_train_example:
            minibatch.append(e)
            if len(minibatch) >= HYPERPARAMETERS["MINIBATCH SIZE"]:
                assert len(minibatch) == HYPERPARAMETERS["MINIBATCH SIZE"]
                yield minibatch
                minibatch = []

    def __getstate__(self):
        return (self.get_train_example.__getstate__(),)

    def __setstate__(self, state):
        """
        @warning: We ignore the filename.
        """
        self.get_train_example = TrainingExampleStream()
        self.get_train_example.__setstate__(state[0])
</pre>

<div style="float:left;margin:0px 0px 0px 0px;"></div>]]></content:encoded>
			<wfw:commentRss>http://metaoptimize.com/blog/2009/12/22/why-cant-you-pickle-generators-in-python-workaround-pattern-for-saving-training-state/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Fast deserialization in Python</title>
		<link>http://metaoptimize.com/blog/2009/03/22/fast-deserialization-in-python/</link>
		<comments>http://metaoptimize.com/blog/2009/03/22/fast-deserialization-in-python/#comments</comments>
		<pubDate>Mon, 23 Mar 2009 02:48:31 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[JSON]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[YAML]]></category>

		<guid isPermaLink="false">http://blog.metaoptimize.com/?p=5</guid>
		<description><![CDATA[

All standard YMMV disclaimers apply.
Update (20090324–2): According to John Millikin, the author of jsonlib, cjson is buggy and unmaintained. I will evaluate further and post a followup blog entry. My discussion with Dan Pascu, the author of cjson, corroborates these claims. I urge readers to read John Millikin’s comment.
Summary:
For quickly deserializing data in Python, use [...]]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_blue" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Fmetaoptimize.com%252Fblog%252F2009%252F03%252F22%252Ffast-deserialization-in-python%252F%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Fast%20deserialization%20in%20Python%22%20%7D);"></div>
<p><em>All standard YMMV disclaimers apply</em>.</p>
<p><b>Update (20090324–2):</b> According to <a href="http://news.ycombinator.com/item?id=529104">John Millikin</a>, the author of jsonlib, cjson is buggy and unmaintained. I will evaluate further and post a followup blog entry. My discussion with Dan Pascu, the author of cjson, corroborates these claims. I urge readers to read John Millikin’s comment.</p>
<h1>Summary:</h1>
<p><del>For quickly deserializing data in Python, use <a href="http://pypi.python.org/pypi/python-cjson">cjson</a>.</del><br />
simplejson is mysteriously slow on certain installations.</p>
<p><b>Update (20090324):</b> According to <a href="http://kbyanc.blogspot.com/2007/07/python-serializer-benchmarks.html" rel="nofollow">Extra Cheese</a>, cjson 1.0.5 has an incompatibility with simplejson in processing slashes. A fix is available from <a href="http://www.vazor.com/cjson.html" rel="nofollow">Matt Billenstein</a>. However, Dan Pascu, the author of cjson, deprecates Matt Billenstein’s cjson 1.0.6 because Matt’s patch parses the JSON twice, which makes it twice as slow. This will still be faster than all alternatives in certain circumstances. You will not find Matt’s cjson on the cheeseshop, only on Matt’s site.
</p>
<h1>Abstract:</h1>
<p>We were initially using simplejson for our work, because the <a href="http://json.org/">JSON</a> format is human-readable and because anecdotal evidence from the blogosphere touted simplejson’s new C speedups.  We observed that simplejson was actually quite slow on one of our installation environments. This observation prompted to do this study.  We found the cjson consistently achieves the fastest deserialization performance.  We still do not understand why simplejson is slow in certain installation environments.</p>
<h1>Approach:</h1>
<p>We compared the following serialization approaches:</p>
<ul>
<li><a href="http://pypi.python.org/pypi/simplejson">simplejson</a> 2.0.9, with C speedups</li>
<li><a href="http://pypi.python.org/pypi/jsonlib/">jsonlib</a> 1.3.10</li>
<li><a href="http://pypi.python.org/pypi/python-cjson">cjson</a> 1.0.5</li>
<li><a href="http://pyyaml.org/wiki/PyYAML">PyYAML</a> 3.05 with <a href="http://pyyaml.org/wiki/LibYAML">libyaml</a> 0.1.1/0.1.2 C bindings. (We used 0.1.1 on dormeur and 0.1.2 on mammouth.)</li>
<li><a href="http://pyyaml.org/wiki/PySyck">PySyck</a> 0.61.2 with <a href="http://whytheluckystiff.net/syck/" class="broken_link">syck</a> 0.55 C bindings. Note that PySyck did not compile until we followed the advice in <a href="http://pyyaml.org/ticket/67">this ticket</a>.</li>
<li>Google <a href="http://code.google.com/p/protobuf/">protobuf</a> 2.0.3</li>
<li>Python <a href="http://docs.python.org/library/pickle.html">pickle</a>, protocol=-1 (binary)</li>
<li>Python pickle, protocol=0 (text)</li>
</ul>
<p>We have not tried the following serialization approaches:</p>
<ul>
<li>Python <a href="http://docs.python.org/library/marshal.html">marshall</a>, which is supposedly much faster than Python pickle. On the downside, the marshal format may change between Python versions.</li>
<li>Native Python, i.e. reading the repr() of the data as a module</li>
<li>XML implementations</li>
<li>Facebook <a href="http://incubator.apache.org/thrift/">thrift</a></li>
<li>Hand-coding C serialization</li>
</ul>
<h1>Experiments:</h1>
<h2>Data:</h2>
<p>We were working with a data structure we call the “vocabulary”. The vocabulary is a list of vocabulary terms. Each vocabulary term in turn contained a list of term forms. An example vocabulary term is as follows:</p>
<pre><code>{
    "term class": "the propos delet",
    "canonical form": "the proposed deletion",
    "rank": 3590,
    "count": 7180.0,
    "term forms": [
        { "form": "the proposed deletion", "count": 7153.333333333333 },
        { "form": "the proposed deletions", "count": 13.666666666666666 },
        { "form": "The proposed deletion", "count": 12.0 },
        { "form": "the proposed deletes", "count": 1.0 }
    ]
}
</code></pre>
<p>We perform all our deserialization experiments on a vocabulary file that contained 502K fields, as computed using:</p>
<pre><code>zcat vocabulary.json.gz | grep ':' | wc -l
</code></pre>
<p>We use gzip on all serialized files, both when writing them and when reading them.  The size of the vocabulary in different serialization formats was as follows:</p>
<p><center></p>
<table>
<tr>
<td><b>Format</b></td>
<td>gzip’ed size</td>
</tr>
<tr>
<td>protobuf</td>
<td>1.7 MB</td>
</tr>
<tr>
<td>JSON</td>
<td>1.9 MB</td>
</tr>
<tr>
<td>pickle (protocol –1)</td>
<td>4.0 MB</td>
</tr>
<tr>
<td>pickle (protocol 0)</td>
<td>4.3 MB</td>
</tr>
</table>
<p></center></p>
<p>gzip’ed JSON only use 10% more disk space than gzip’ed protobuf format, which is the most compact serialization format we tested.  JSON has the advantage of being human-readable, unlike protocol buffer.</p>
<h2>Setup:</h2>
<p>We tested on two different eight core x86-64 Linux installation environments.</p>
<p><center></p>
<table>
<tr>
<td><b>Name</b></td>
<td><b>Python version</b></td>
<td><b>CPU model name</b></td>
<td><b>OS version</b></td>
</tr>
<tr>
<td>dormeur</td>
<td>2.5</td>
<td>Intel® Core™2 Duo CPU     E8400  @ 3.00GHz</td>
<td>2.6.23.17–88.fc7</td>
</tr>
<tr>
<td>mammouth</td>
<td>2.6.1</td>
<td>Intel® Xeon® CPU           E5462  @ 2.80GHz</td>
<td>2.6.18–92.1.10.el5_lustre.1.6.6smp</td>
</tr>
</table>
<p></center></p>
<h2>Results:</h2>
<p>We read in the vocabulary using a particular deserialization approach.  We measure real time, as well as the combined user time and system time, using the Unix ‘time’ command.  For each experiment, we ran the deserialization of the vocabulary three times, and averaged the times over these three runs. Variance appeared to be low, but we did not compute it.  We present all times in seconds.  Some experiments were not performed on mammouth.</p>
<p>The first result line in the table, ‘read’, is when we read the vocabulary json.gz file into memory, but do not deserialize it. It provides an upper-bound on the performance of the deserializer.</p>
<p>The following table presents the results, sorted by real time on dormeur.</p>
<p><center></p>
<table cellpadding="2" border="1">
<tr>
<td><b>deserializer</b></td>
<td colspan="2" align="center"><b>dormeur</b></td>
<td colspan="2" align="center"><b>mammouth</b></td>
</tr>
<tr>
<td></td>
<td>real</td>
<td>user+sys</td>
<td>real</td>
<td>user+sys</td>
</tr>
<tr>
<td>read</td>
<td>0.76</td>
<td>0.24</td>
<td>0.18</td>
<td>0.18</td>
</tr>
<tr>
<td>cjson</td>
<td>2.17</td>
<td>1.04</td>
<td>0.93</td>
<td>0.91</td>
</tr>
<tr>
<td>jsonlib</td>
<td>7.88</td>
<td>6.59</td>
<td>3.77</td>
<td>3.77</td>
</tr>
<tr>
<td>cPickle (protocol –1)</td>
<td>13.3</td>
<td>9.9</td>
<td>10.2</td>
<td>10.2</td>
</tr>
<tr>
<td>PySyck</td>
<td>19.1</td>
<td>18.2</td>
<td></td>
<td></td>
</tr>
<tr>
<td>simplejson</td>
<td>24.7</td>
<td>16.2</td>
<td>1.10</td>
<td>1.04</td>
</tr>
<tr>
<td>cPickle (protocol 0)</td>
<td>25.1</td>
<td>20.4</td>
<td>20.7</td>
<td>20.7</td>
</tr>
<tr>
<td>protobuf</td>
<td>42.3</td>
<td>32.4</td>
<td></td>
<td></td>
</tr>
<tr>
<td>PyYAML</td>
<td>89.3</td>
<td>80.5</td>
<td>319</td>
<td>318</td>
</tr>
</table>
<p></center></p>
<p>Observe that simplejson is more than an order of magnitude slower on dormeur.</p>
<h1>Conclusions:</h1>
<p>gzip’ed JSON only use 10% more disk space than the most compact serialization format we tested (gzip’ed protocol buffer).  JSON has the advantage of being human-readable, unlike protocol buffer.</p>
<p>cjson has the fastest deserialization time of all packages we tested.  We have not measured serialization time in the experiments above, but we do so in the next section.</p>
<p>We did not realize that simplejson was far slower on one of our installs until we did speed tests.  simplejson should be avoided unless you specifically determine that it is comparable in speed to cjson.  On certain installs, simplejson deserialization is as fast as cjson.  On other installs, simplejson deserialization is an order of magnitude slower than cjson.  On “slow” installs, the user is led to believe that C speedups have been compiled into simplejson. Indeed, evidence indicates that our “slow” simplejson installation was, nonetheless, using C speedups:</p>
<pre><code>&gt;&gt;&gt; simplejson.decoder.make_scanner
&lt;type 'simplejson._speedups.Scanner'&gt;
&gt;&gt;&gt; simplejson.decoder.scanstring is simplejson.decoder.c_scanstring
True
</code></pre>
<p>The user might not only detect that simplejson is slow without using a direct speed comparison to cjson.</p>
<p>protobuf is interesting because it requires one to declare the protocol schema. This is useful for documenting your data format. Unfortunately, the Python implementation of Google’s Protocol Buffers is very slow because it is <a href="http://news.ycombinator.com/item?id=498982">pure Python</a>.</p>
<p>Generating C++ Protocol Buffers and wrapping them with swig, as suggested by this <a href="http://news.ycombinator.com/item?id=499040">commentator</a>, might be faster than cjson.  Hand-coding C serialization routines is another option if one must eke out every last bit of speed.</p>
<h1>Related work:</h1>
<p><a href="http://bouncybouncy.net/ramblings/posts/json_vs_thrift_and_protocol_buffers_round_2/">This study</a> and <a href="http://gist.github.com/72412">this followup</a> provide supporting evidence that cjson is faster than alternatives. Neither of these studies experienced any simplejson slowness.</p>
<p>We used bouncybouncy’s <a href="http://bouncybouncy.net/ramblings/files/sertest2.tgz">sertest2 code</a> code, and modified it to CDumper and CLoader (the C libyaml bindings) in PyYAML.  We modified their code to create 100K records.</p>
<p>Here is the output of sertest2 running on dormeur, which we have modified slightly for improved readability:</p>
<pre><code>100000 total records        (0.830s)

get_thrift                  (0.300s)
get_protobuf                (5.010s)

Serialize:
ser_cjson                   (0.270s) 6807019 bytes
ser_simplejson              (2.210s) 6807019 bytes
ser_yaml                    (31.590s) 6107019 bytes
ser_protobuf                (19.760s) 1716519 bytes

Serialize to a gzip'ed file:
ser_cjson_compressed        (0.520s) 1245257 bytes
ser_simplejson_compressed   (2.440s) 1245257 bytes
ser_protobuf_compressed     (19.920s) 980508 bytes
ser_yaml_compressed         (31.610s) 1205509 bytes

Deserialize:
serde_cjson                 (0.510s)
serde_simplejson            (12.370s)
serde_protobuf              (36.740s)
serde_yaml                  [slow, got tired of waiting for it]
</code></pre>
<p>bouncybouncy’s related study also compares with <a href="http://incubator.apache.org/thrift/">thrift</a>, which we do not use.  bouncybouncy finds that thrift is faster than protobuf but slower than cjson.  When we installed thrift (SVN revision 757299) on dormeur, sertest2 thrift routines crashed with the following traceback:</p>
<pre><code>Traceback (most recent call last):
  File "./test_speed.py", line 169, in &lt;module&gt;
    print 'serde_thrift        (%0.3fs)' % t(serde_thrift)[0]
  File "./test_speed.py", line 138, in t
    ret = f()
  File "./test_speed.py", line 108, in serde_thrift
    s = _ser_thrift()
  File "./test_speed.py", line 73, in _ser_thrift
    return thrift_to_bytes(ret)
  File "./test_speed.py", line 59, in thrift_to_bytes
    var.write(protocolOut)
  File "gen-py/passivedns/ttypes.py", line 146, in write
    iter6.write(oprot)
AttributeError: 'str' object has no attribute 'write'
</code></pre>
<p>The results presented in this section, as well as the results of the related studies, matches the relative performance of these libraries on mammouth in our earlier experiments.</p>

<div style="float:left;margin:0px 0px 0px 0px;"></div>]]></content:encoded>
			<wfw:commentRss>http://metaoptimize.com/blog/2009/03/22/fast-deserialization-in-python/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
	</channel>
</rss>

