All standard YMMV disclaimers apply.
Update (20090324−2): According to John Millikin, the author of jsonlib, cjson is buggy and unmaintained. I will evaluate further and post a followup blog entry. My discussion with Dan Pascu, the author of cjson, corroborates these claims. I urge readers to read John Millikin’s comment.
Summary:
For quickly deserializing data in Python, use cjson.
simplejson is mysteriously slow on certain installations.
Update (20090324): According to Extra Cheese, cjson 1.0.5 has an incompatibility with simplejson in processing slashes. A fix is available from Matt Billenstein. However, Dan Pascu, the author of cjson, deprecates Matt Billenstein’s cjson 1.0.6 because Matt’s patch parses the JSON twice, which makes it twice as slow. This will still be faster than all alternatives in certain circumstances. You will not find Matt’s cjson on the cheeseshop, only on Matt’s site.
Abstract:
We were initially using simplejson for our work, because the JSON format is human-readable and because anecdotal evidence from the blogosphere touted simplejson’s new C speedups. We observed that simplejson was actually quite slow on one of our installation environments. This observation prompted to do this study. We found the cjson consistently achieves the fastest deserialization performance. We still do not understand why simplejson is slow in certain installation environments.
Approach:
We compared the following serialization approaches:
- simplejson 2.0.9, with C speedups
- jsonlib 1.3.10
- cjson 1.0.5
- PyYAML 3.05 with libyaml 0.1.1/0.1.2 C bindings. (We used 0.1.1 on dormeur and 0.1.2 on mammouth.)
- PySyck 0.61.2 with syck 0.55 C bindings. Note that PySyck did not compile until we followed the advice in this ticket.
- Google protobuf 2.0.3
- Python pickle, protocol=-1 (binary)
- Python pickle, protocol=0 (text)
We have not tried the following serialization approaches:
- Python marshall, which is supposedly much faster than Python pickle. On the downside, the marshal format may change between Python versions.
- Native Python, i.e. reading the repr() of the data as a module
- XML implementations
- Facebook thrift
- Hand-coding C serialization
Experiments:
Data:
We were working with a data structure we call the “vocabulary”. The vocabulary is a list of vocabulary terms. Each vocabulary term in turn contained a list of term forms. An example vocabulary term is as follows:
{
"term class": "the propos delet",
"canonical form": "the proposed deletion",
"rank": 3590,
"count": 7180.0,
"term forms": [
{ "form": "the proposed deletion", "count": 7153.333333333333 },
{ "form": "the proposed deletions", "count": 13.666666666666666 },
{ "form": "The proposed deletion", "count": 12.0 },
{ "form": "the proposed deletes", "count": 1.0 }
]
}
We perform all our deserialization experiments on a vocabulary file that contained 502K fields, as computed using:
zcat vocabulary.json.gz | grep ':' | wc -l
We use gzip on all serialized files, both when writing them and when reading them. The size of the vocabulary in different serialization formats was as follows:
| Format | gzip’ed size |
| protobuf | 1.7 MB |
| JSON | 1.9 MB |
| pickle (protocol –1) | 4.0 MB |
| pickle (protocol 0) | 4.3 MB |
gzip’ed JSON only use 10% more disk space than gzip’ed protobuf format, which is the most compact serialization format we tested. JSON has the advantage of being human-readable, unlike protocol buffer.
Setup:
We tested on two different eight core x86-64 Linux installation environments.
| Name | Python version | CPU model name | OS version |
| dormeur | 2.5 | Intel® Core™2 Duo CPU E8400 @ 3.00GHz | 2.6.23.17–88.fc7 |
| mammouth | 2.6.1 | Intel® Xeon® CPU E5462 @ 2.80GHz | 2.6.18–92.1.10.el5_lustre.1.6.6smp |
Results:
We read in the vocabulary using a particular deserialization approach. We measure real time, as well as the combined user time and system time, using the Unix ‘time’ command. For each experiment, we ran the deserialization of the vocabulary three times, and averaged the times over these three runs. Variance appeared to be low, but we did not compute it. We present all times in seconds. Some experiments were not performed on mammouth.
The first result line in the table, ‘read’, is when we read the vocabulary json.gz file into memory, but do not deserialize it. It provides an upper-bound on the performance of the deserializer.
The following table presents the results, sorted by real time on dormeur.
| deserializer | dormeur | mammouth | ||
| real | user+sys | real | user+sys | |
| read | 0.76 | 0.24 | 0.18 | 0.18 |
| cjson | 2.17 | 1.04 | 0.93 | 0.91 |
| jsonlib | 7.88 | 6.59 | 3.77 | 3.77 |
| cPickle (protocol –1) | 13.3 | 9.9 | 10.2 | 10.2 |
| PySyck | 19.1 | 18.2 | ||
| simplejson | 24.7 | 16.2 | 1.10 | 1.04 |
| cPickle (protocol 0) | 25.1 | 20.4 | 20.7 | 20.7 |
| protobuf | 42.3 | 32.4 | ||
| PyYAML | 89.3 | 80.5 | 319 | 318 |
Observe that simplejson is more than an order of magnitude slower on dormeur.
Conclusions:
gzip’ed JSON only use 10% more disk space than the most compact serialization format we tested (gzip’ed protocol buffer). JSON has the advantage of being human-readable, unlike protocol buffer.
cjson has the fastest deserialization time of all packages we tested. We have not measured serialization time in the experiments above, but we do so in the next section.
We did not realize that simplejson was far slower on one of our installs until we did speed tests. simplejson should be avoided unless you specifically determine that it is comparable in speed to cjson. On certain installs, simplejson deserialization is as fast as cjson. On other installs, simplejson deserialization is an order of magnitude slower than cjson. On “slow” installs, the user is led to believe that C speedups have been compiled into simplejson. Indeed, evidence indicates that our “slow” simplejson installation was, nonetheless, using C speedups:
>>> simplejson.decoder.make_scanner
<type 'simplejson._speedups.Scanner'>
>>> simplejson.decoder.scanstring is simplejson.decoder.c_scanstring
True
The user might not only detect that simplejson is slow without using a direct speed comparison to cjson.
protobuf is interesting because it requires one to declare the protocol schema. This is useful for documenting your data format. Unfortunately, the Python implementation of Google’s Protocol Buffers is very slow because it is pure Python.
Generating C++ Protocol Buffers and wrapping them with swig, as suggested by this commentator, might be faster than cjson. Hand-coding C serialization routines is another option if one must eke out every last bit of speed.
Related work:
This study and this followup provide supporting evidence that cjson is faster than alternatives. Neither of these studies experienced any simplejson slowness.
We used bouncybouncy’s sertest2 code code, and modified it to CDumper and CLoader (the C libyaml bindings) in PyYAML. We modified their code to create 100K records.
Here is the output of sertest2 running on dormeur, which we have modified slightly for improved readability:
100000 total records (0.830s)
get_thrift (0.300s)
get_protobuf (5.010s)
Serialize:
ser_cjson (0.270s) 6807019 bytes
ser_simplejson (2.210s) 6807019 bytes
ser_yaml (31.590s) 6107019 bytes
ser_protobuf (19.760s) 1716519 bytes
Serialize to a gzip'ed file:
ser_cjson_compressed (0.520s) 1245257 bytes
ser_simplejson_compressed (2.440s) 1245257 bytes
ser_protobuf_compressed (19.920s) 980508 bytes
ser_yaml_compressed (31.610s) 1205509 bytes
Deserialize:
serde_cjson (0.510s)
serde_simplejson (12.370s)
serde_protobuf (36.740s)
serde_yaml [slow, got tired of waiting for it]
bouncybouncy’s related study also compares with thrift, which we do not use. bouncybouncy finds that thrift is faster than protobuf but slower than cjson. When we installed thrift (SVN revision 757299) on dormeur, sertest2 thrift routines crashed with the following traceback:
Traceback (most recent call last):
File "./test_speed.py", line 169, in <module>
print 'serde_thrift (%0.3fs)' % t(serde_thrift)[0]
File "./test_speed.py", line 138, in t
ret = f()
File "./test_speed.py", line 108, in serde_thrift
s = _ser_thrift()
File "./test_speed.py", line 73, in _ser_thrift
return thrift_to_bytes(ret)
File "./test_speed.py", line 59, in thrift_to_bytes
var.write(protocolOut)
File "gen-py/passivedns/ttypes.py", line 146, in write
iter6.write(oprot)
AttributeError: 'str' object has no attribute 'write'
The results presented in this section, as well as the results of the related studies, matches the relative performance of these libraries on mammouth in our earlier experiments.