Fast deserialization in Python

All stan­dard YMMV dis­claimers apply.

Update (20090324−2): Accord­ing to John Mil­likin, the author of json­lib, cjson is buggy and unmain­tained. I will eval­u­ate fur­ther and post a fol­lowup blog entry. My dis­cus­sion with Dan Pascu, the author of cjson, cor­rob­o­rates these claims. I urge read­ers to read John Millikin’s comment.

Sum­mary:

For quickly dese­ri­al­iz­ing data in Python, use cjson.
sim­ple­j­son is mys­te­ri­ously slow on cer­tain installations.

Update (20090324): Accord­ing to Extra Cheese, cjson 1.0.5 has an incom­pat­i­bil­ity with sim­ple­j­son in pro­cess­ing slashes. A fix is avail­able from Matt Bil­len­stein. How­ever, Dan Pascu, the author of cjson, dep­re­cates Matt Billenstein’s cjson 1.0.6 because Matt’s patch parses the JSON twice, which makes it twice as slow. This will still be faster than all alter­na­tives in cer­tain cir­cum­stances. You will not find Matt’s cjson on the cheese­shop, only on Matt’s site.

Abstract:

We were ini­tially using sim­ple­j­son for our work, because the JSON for­mat is human-readable and because anec­do­tal evi­dence from the blo­gos­phere touted simplejson’s new C speedups. We observed that sim­ple­j­son was actu­ally quite slow on one of our instal­la­tion envi­ron­ments. This obser­va­tion prompted to do this study. We found the cjson con­sis­tently achieves the fastest dese­ri­al­iza­tion per­for­mance. We still do not under­stand why sim­ple­j­son is slow in cer­tain instal­la­tion environments.

Approach:

We com­pared the fol­low­ing seri­al­iza­tion approaches:

We have not tried the fol­low­ing seri­al­iza­tion approaches:

  • Python mar­shall, which is sup­pos­edly much faster than Python pickle. On the down­side, the mar­shal for­mat may change between Python versions.
  • Native Python, i.e. read­ing the repr() of the data as a module
  • XML imple­men­ta­tions
  • Face­book thrift
  • Hand-coding C serialization

Exper­i­ments:

Data:

We were work­ing with a data struc­ture we call the “vocab­u­lary”. The vocab­u­lary is a list of vocab­u­lary terms. Each vocab­u­lary term in turn con­tained a list of term forms. An exam­ple vocab­u­lary term is as follows:

{
    "term class": "the propos delet",
    "canonical form": "the proposed deletion",
    "rank": 3590,
    "count": 7180.0,
    "term forms": [
        { "form": "the proposed deletion", "count": 7153.333333333333 },
        { "form": "the proposed deletions", "count": 13.666666666666666 },
        { "form": "The proposed deletion", "count": 12.0 },
        { "form": "the proposed deletes", "count": 1.0 }
    ]
}

We per­form all our dese­ri­al­iza­tion exper­i­ments on a vocab­u­lary file that con­tained 502K fields, as com­puted using:

zcat vocabulary.json.gz | grep ':' | wc -l

We use gzip on all seri­al­ized files, both when writ­ing them and when read­ing them. The size of the vocab­u­lary in dif­fer­ent seri­al­iza­tion for­mats was as follows:

For­mat gzip’ed size
pro­to­buf 1.7 MB
JSON 1.9 MB
pickle (pro­to­col –1) 4.0 MB
pickle (pro­to­col 0) 4.3 MB

gzip’ed JSON only use 10% more disk space than gzip’ed pro­to­buf for­mat, which is the most com­pact seri­al­iza­tion for­mat we tested. JSON has the advan­tage of being human-readable, unlike pro­to­col buffer.

Setup:

We tested on two dif­fer­ent eight core x86-64 Linux instal­la­tion environments.

Name Python ver­sion CPU model name OS ver­sion
dormeur 2.5 Intel® Core™2 Duo CPU E8400 @ 3.00GHz 2.6.23.17–88.fc7
mam­mouth 2.6.1 Intel® Xeon® CPU E5462 @ 2.80GHz 2.6.18–92.1.10.el5_lustre.1.6.6smp

Results:

We read in the vocab­u­lary using a par­tic­u­lar dese­ri­al­iza­tion approach. We mea­sure real time, as well as the com­bined user time and sys­tem time, using the Unix ‘time’ com­mand. For each exper­i­ment, we ran the dese­ri­al­iza­tion of the vocab­u­lary three times, and aver­aged the times over these three runs. Vari­ance appeared to be low, but we did not com­pute it. We present all times in sec­onds. Some exper­i­ments were not per­formed on mammouth.

The first result line in the table, ‘read’, is when we read the vocab­u­lary json.gz file into mem­ory, but do not dese­ri­al­ize it. It pro­vides an upper-bound on the per­for­mance of the deserializer.

The fol­low­ing table presents the results, sorted by real time on dormeur.

dese­ri­al­izer dormeur mam­mouth
real user+sys real user+sys
read 0.76 0.24 0.18 0.18
cjson 2.17 1.04 0.93 0.91
json­lib 7.88 6.59 3.77 3.77
cPickle (pro­to­col –1) 13.3 9.9 10.2 10.2
PySyck 19.1 18.2
sim­ple­j­son 24.7 16.2 1.10 1.04
cPickle (pro­to­col 0) 25.1 20.4 20.7 20.7
pro­to­buf 42.3 32.4
PyYAML 89.3 80.5 319 318

Observe that sim­ple­j­son is more than an order of mag­ni­tude slower on dormeur.

Con­clu­sions:

gzip’ed JSON only use 10% more disk space than the most com­pact seri­al­iza­tion for­mat we tested (gzip’ed pro­to­col buffer). JSON has the advan­tage of being human-readable, unlike pro­to­col buffer.

cjson has the fastest dese­ri­al­iza­tion time of all pack­ages we tested. We have not mea­sured seri­al­iza­tion time in the exper­i­ments above, but we do so in the next section.

We did not real­ize that sim­ple­j­son was far slower on one of our installs until we did speed tests. sim­ple­j­son should be avoided unless you specif­i­cally deter­mine that it is com­pa­ra­ble in speed to cjson. On cer­tain installs, sim­ple­j­son dese­ri­al­iza­tion is as fast as cjson. On other installs, sim­ple­j­son dese­ri­al­iza­tion is an order of mag­ni­tude slower than cjson. On “slow” installs, the user is led to believe that C speedups have been com­piled into sim­ple­j­son. Indeed, evi­dence indi­cates that our “slow” sim­ple­j­son instal­la­tion was, nonethe­less, using C speedups:

>>> simplejson.decoder.make_scanner
<type 'simplejson._speedups.Scanner'>
>>> simplejson.decoder.scanstring is simplejson.decoder.c_scanstring
True

The user might not only detect that sim­ple­j­son is slow with­out using a direct speed com­par­i­son to cjson.

pro­to­buf is inter­est­ing because it requires one to declare the pro­to­col schema. This is use­ful for doc­u­ment­ing your data for­mat. Unfor­tu­nately, the Python imple­men­ta­tion of Google’s Pro­to­col Buffers is very slow because it is pure Python.

Gen­er­at­ing C++ Pro­to­col Buffers and wrap­ping them with swig, as sug­gested by this com­men­ta­tor, might be faster than cjson. Hand-coding C seri­al­iza­tion rou­tines is another option if one must eke out every last bit of speed.

Related work:

This study and this fol­lowup pro­vide sup­port­ing evi­dence that cjson is faster than alter­na­tives. Nei­ther of these stud­ies expe­ri­enced any sim­ple­j­son slowness.

We used bouncybouncy’s sertest2 code code, and mod­i­fied it to CDumper and CLoader (the C libyaml bind­ings) in PyYAML. We mod­i­fied their code to cre­ate 100K records.

Here is the out­put of sertest2 run­ning on dormeur, which we have mod­i­fied slightly for improved readability:

100000 total records        (0.830s)

get_thrift                  (0.300s)
get_protobuf                (5.010s)

Serialize:
ser_cjson                   (0.270s) 6807019 bytes
ser_simplejson              (2.210s) 6807019 bytes
ser_yaml                    (31.590s) 6107019 bytes
ser_protobuf                (19.760s) 1716519 bytes

Serialize to a gzip'ed file:
ser_cjson_compressed        (0.520s) 1245257 bytes
ser_simplejson_compressed   (2.440s) 1245257 bytes
ser_protobuf_compressed     (19.920s) 980508 bytes
ser_yaml_compressed         (31.610s) 1205509 bytes

Deserialize:
serde_cjson                 (0.510s)
serde_simplejson            (12.370s)
serde_protobuf              (36.740s)
serde_yaml                  [slow, got tired of waiting for it]

bouncybouncy’s related study also com­pares with thrift, which we do not use. boun­cy­bouncy finds that thrift is faster than pro­to­buf but slower than cjson. When we installed thrift (SVN revi­sion 757299) on dormeur, sertest2 thrift rou­tines crashed with the fol­low­ing traceback:

Traceback (most recent call last):
  File "./test_speed.py", line 169, in <module>
    print 'serde_thrift        (%0.3fs)' % t(serde_thrift)[0]
  File "./test_speed.py", line 138, in t
    ret = f()
  File "./test_speed.py", line 108, in serde_thrift
    s = _ser_thrift()
  File "./test_speed.py", line 73, in _ser_thrift
    return thrift_to_bytes(ret)
  File "./test_speed.py", line 59, in thrift_to_bytes
    var.write(protocolOut)
  File "gen-py/passivedns/ttypes.py", line 146, in write
    iter6.write(oprot)
AttributeError: 'str' object has no attribute 'write'

The results pre­sented in this sec­tion, as well as the results of the related stud­ies, matches the rel­a­tive per­for­mance of these libraries on mam­mouth in our ear­lier experiments.

  • http://joseph.turian.com Joseph Turian

    This red­dit thread has some good dis­cus­sion of an ear­lier study.

    This author points out that thrift as a net­work pro­to­col is much faster than JSON over HTTP.

    haber­man points out that he is writ­ing C bind­ings for Python protobuf.

  • http://joseph.turian.com Joseph Turian

    An older bench­mark, show­ing that mar­shal might be the fastest.

  • http://joseph.turian.com Joseph Turian

    Accord­ing to Extra Cheese, cjson has an incom­pat­i­bil­ity with sim­ple­j­son in pro­cess­ing slashes. A fix is avail­able from vazor.

  • http://jasper.es/ Jasper Spaans

    Check if the slower sim­ple­j­son install does some­thing with locales? I’ve seen grep go really slow when try­ing to do utf-8 stuff, which dis­ap­peared after set­ting LANG=C / LC_ALL=C…

  • http://www.bouncybouncy.net/ Justin

    Nice writeup :-) Good to see that you get the same results on a more com­pli­cated data structure.

    I still have high hopes for pro­to­buf: it can get faster, but json can’t get any smaller. At some point pro­to­buf will be both the fastest and most com­pact method.

  • http://joseph.turian.com Joseph Turian

    I am excited for a faster pro­to­buf. In par­tic­u­lar, haberman’s C exten­sions look promising.

    Com­pact­ness is very impor­tant for trans­fer­ring data over a net­work.
    How­ever, dur­ing the devel­op­ment cycle, human read­abil­ity is impor­tant and often over­looked. If all you need to do to read your data is type ‘zcat’, you are much more likely to be look­ing at your data, and hence more likely to catch bugs.

  • John Mil­likin

    (repost­ing a com­ment from Hacker News, at Joseph Turian’s request)

    I’m the author of json­lib, and I reg­is­tered specif­i­cally to post this mes­sage. Please, please, please do not use cjson!

    First, it is unmain­tained. The lat­est ver­sion avail­able was posted on August 24, 2007. When you encounter one of its myr­iad bugs, you’ll either have to patch it your­self or pick another JSON library. Just skip the inter­me­di­ate step and use another library to begin with.

    Sec­ond, it is buggy. In some cases, pars­ing text it just gen­er­ated will return a dif­fer­ent value from what you passed in! It’s almost entirely igno­rant of Uni­code, and what lit­tle it tries to parse it gets wrong.

    Third, it’s exceed­ingly non-compliant. The text it parses and gen­er­ates bears only a pass­ing resem­blance to JSON. There are vary­ing degrees of con­for­mance to the spec between libraries, based on per­sonal pref­er­ence of the authors — I pre­fer strict con­for­mance, oth­ers less strict — but cjson is so dif­fer­ent as to be sim­ply unusable.

    Yes, it’s fast. I know. I wrote json­lib partly because I was unsat­is­fied with simplejson’s per­for­mance, and one goal (never truly achieved) was always to sur­pass cjson. How­ever, speed isn’t every­thing. As the say­ing goes, “if I want my math per­formed fast and wrong I’ll ask my cat”.

    In my opin­ion, the only Python JSON libraries worth con­sid­er­ing are:

    * sim­ple­j­son — it’s in the stan­dard library, and should there­fore be con­sid­ered first and most thoroughly.

    * json­lib — it’s fast, well-tested, and standards-compliant.

    * demj­son — has sev­eral options for reli­able pars­ing of invalid input.

    Last time I checked, json­lib and simplejson’s C exten­sions are neck-and-neck performance-wise. In some quick, unsci­en­tific tests, json­lib reads faster and sim­ple­j­son writes faster. How­ever, simplejson’s exten­sions are only used for cer­tain sub­sets of input — if you want to use an uncom­mon fea­ture, per­for­mance will degrade. json­lib has an imple­men­ta­tion in pure C, which avoids this prob­lem at the cost of complexity.

    Apolo­gies for the brain-dump, but even if you skip right over it, please remem­ber: don’t use cjson.

  • http://ianen.org/ John Mil­likin

    (repost­ing a com­ment from Hacker News, at Joseph Turian’s request)

    I’m the author of json­lib, and I reg­is­tered specif­i­cally to post this mes­sage. Please, please, please do not use cjson!

    First, it is unmain­tained. The lat­est ver­sion avail­able was posted on August 24, 2007. When you encounter one of its myr­iad bugs, you’ll either have to patch it your­self or pick another JSON library. Just skip the inter­me­di­ate step and use another library to begin with.

    Sec­ond, it is buggy. In some cases, pars­ing text it just gen­er­ated will return a dif­fer­ent value from what you passed in! It’s almost entirely igno­rant of Uni­code, and what lit­tle it tries to parse it gets wrong.

    Third, it’s exceed­ingly non-compliant. The text it parses and gen­er­ates bears only a pass­ing resem­blance to JSON. There are vary­ing degrees of con­for­mance to the spec between libraries, based on per­sonal pref­er­ence of the authors — I pre­fer strict con­for­mance, oth­ers less strict — but cjson is so dif­fer­ent as to be sim­ply unusable.

    Yes, it’s fast. I know. I wrote json­lib partly because I was unsat­is­fied with simplejson’s per­for­mance, and one goal (never truly achieved) was always to sur­pass cjson. How­ever, speed isn’t every­thing. As the say­ing goes, “if I want my math per­formed fast and wrong I’ll ask my cat”.

    In my opin­ion, the only Python JSON libraries worth con­sid­er­ing are:

    * sim­ple­j­son — it’s in the stan­dard library, and should there­fore be con­sid­ered first and most thoroughly.

    * json­lib — it’s fast, well-tested, and standards-compliant.

    * demj­son — has sev­eral options for reli­able pars­ing of invalid input.

    Last time I checked, json­lib and simplejson’s C exten­sions are neck-and-neck performance-wise. In some quick, unsci­en­tific tests, json­lib reads faster and sim­ple­j­son writes faster. How­ever, simplejson’s exten­sions are only used for cer­tain sub­sets of input — if you want to use an uncom­mon fea­ture, per­for­mance will degrade. json­lib has an imple­men­ta­tion in pure C, which avoids this prob­lem at the cost of complexity.

    Apolo­gies for the brain-dump, but even if you skip right over it, please remem­ber: don’t use cjson.

  • Nir

    Seems that Bob Ippolito fixed sim­ple­j­son slow­ness.
    Retry with lat­est version.

  • http://twitter.com/aigarius Aigars Mahi­novs

    Please try cre­at­ing a cus­tom reader/writer in Python (if you don’t want to bother with C). If your data struc­ture is so lim­ited and is not recur­sive, then you should able to very eas­ily express it in a sim­ple comma sep­a­rated value line (one line per vocab­u­lary term).

    It could look like this:

    the pro­pos delet,the pro­posed deletion,3590,7180.0,the pro­posed deletion,7153.333333333333,the pro­posed deletions,13.666666666666666,The pro­posed deletion,12.0,the pro­posed deletes,1.0

    And that is it — this will be the most com­pact stor­age data for­mat, because all the repeated data, that describes the struc­ture of the dicts and lists inside a term will be con­tained in the code that will parse this. I believe that this might be faster than json read function.

  • http://metaoptimize.com Joseph Turian

    Aigars, the ques­tion is not which is the most com­pact data set, but which is the fastest to read in (dese­ri­al­ize). Text pro­cess­ing with native Python tends to be much slower than using C, so I would be sur­prised if your pro­posal is faster than a JSON library with C imple­men­ta­tion. How­ever, I encour­age you to post bench­marks that prove me wrong!

  • http://twitter.com/ricardobarroso/status/33992166548180992 Ricardo Bar­roso

    Fast Dese­ri­al­iza­tion in Python (Per­for­mance Com­par­i­son): http://bit.ly/hpMVs1
    #python #Web­Dev #JSON #XML (RT @turian)

blog comments powered by Disqus