Speeding up JSON encoding in PyPy
Hi
Recently I spent a bit of effort into speeding up JSON in PyPy. I started with writing a benchmark, which is admittedly not a very good one, but it's better than nothing (suggestions on how to improve it are welcome!).
For this particular benchmark, the numbers are as follow. Note that CPython by default uses the optimized C extension, while PyPy uses the pure Python one. PyPy trunk contains another pure Python version which has been optimized specifically for the PyPy JIT. Detailed optimizations are described later in this post.
The number reported is the time taken for the third run, when things are warmed up. Full session here.
CPython 2.6 | 22s |
CPython 2.7 | 3.7s |
CPython 2.7 no C extension | 44s |
PyPy 1.5 | 34s |
PyPy 1.6 | 22s |
PyPy trunk | 3.3s |
Lessons learned:
Expectations are high
A lot of performance critical stuff in Python world is already written in a hand optimized C. Writing C (especially when you interface with CPython C API) is ugly and takes significant effort. This approach does not scale well when there is a lot of code to be written or when there is a very tight coupling between the part to be rewritten and the rest of the code. Still, people would expect PyPy to be better at "tasks" and not precisely at running equivalent code, hence a comparison between the C extension and the pure python version is sound. Fortunately it's possible to outperform the C extension, but requires a bit of effort on the programmer side as well.
Often interface between the C and Python part is ugly
This is very clear if you look at json module as implemented in CPython's standard library. Not everything is in C (it would probably be just too much effort) and the interface to what is in C is guided via profiling not by what kind of interface makes sense. This especially is evident comparing CPython 2.6 to 2.7. Just adapting the code to an interface with C made the Python version slower. Removing this clutter improves the readability a lot and improves PyPy's version a bit, although I don't have hard numbers.
JitViewer is crucial
In case you're fighting with PyPy's performance, jitviewer is worth a shot. While it's not completely trivial to understand what's going on, it'll definitely show you what kind of loops got compiled and how.
No nice and fast way to build strings in Python
PyPy has a custom thing called __pypy__.builders.StringBuilder. It has a few a features that make it much easier to optimize than other ways like str.join() or cStringIO.
- You can specify the start size, which helps a lot if you can even provide a rough estimate on the size of the string (less copying)
- Only append and build are allowed. While the string is being built you can't seek or do anything else. After it's built you can never append any more.
- Unicode version available as well as __pypy__.builders.UnicodeBuilder.
Method calls are ok, immutable globals are ok
PyPy's JIT seems to be good enough for at least the simple cases. Calling methods for common infrastructure or loading globals (instead of rebinding as locals) is fast enough and improves code readability.
String copying is expensive
Edit: see the comment at the end
If you use re.sub, the current implementation will always create a copy of the string even if there was no match to replace. If you know your regexp is simple, first try to check if there is anything to replace. This is a pretty hard optimization to do automatically -- simply matching the regular expression can be too costly for it to make sense. In our particular example however, the regexp is really simple, checking ranges of characters. It also seems that this is by far the fastest way to escape characters as of now.
Generators are slower than they should be
I changed the entire thing to simply call builder.append instead of yielding to the main loop where it would be gathered. This is kind of a PyPy bug that using generators extensively is slower, but a bit hard to fix. Especially in cases where there is relatively little data being passed around (few bytes), it makes sense to gather it first. If I were to implement an efficient version of iterencode, I would probably handle chunks of predetermined size, about 1000 bytes instead of yielding data every few bytes.
I must admit I worked around PyPy's performance bug
For obscure (although eventually fixable) reasons, this:
for c in s: # s is string del c
is faster than:
for c in s: pass
This is a PyPy performance bug and should be fixed, but on a different branch ;-)
PyPy's JIT is good
I was pretty surprised, but the JIT actually did make stuff work nicely. The changes that were done were relatively minor and straightforward, once the module was cleaned to the normal "pythonic" state. It is worth noting that it's possible to write code in Python and make it run really fast, but you have to be a bit careful. Again, jitviewer is your friend when determining why things are slow. I hope we can write more tools in the future that would more automatically guide people through potential performance pitfals.
Cheers, fijal
Edit: I was wrong about re.sub. It just seems to be that the JIT is figuring match better than sub, will be fixed soon
Comments
It would be neat to get UnicodeBuilder and StringBuilder in to mainline Python. They'd be more efficient in CPython than existing string construction methods and it would be easier to write more performant portable Python.
Can you elaborate a bit on the slowness of generators?
Ian: yes it would, python-ideas/dev has had this discussion many times, if you want to convince them of the merit of this idea, feel free to try, but I've gotten weary of this discussion
This is not meant to derail the rather nice performance numbers, but I wouldn't call the json/simplejson code "pythonic" in the first place.
I wonder if using a constant object to dump in each iteration doesn't skew the benchmark in favor of pypy, whereas the jit couldn't optimize as much with a varying object (which is what usually happens in real-life scenarios).
@Gaetan it certainly could in theory. In practice it does not occur here, but I only know that from looking at traces. However, creating a new object each time would make the benchmark more of an object creation one (probably GC related)
@Maciej: not if you build the list of objects to dump out of the timed loop, or did I miss something?
True, that might be a bit biggish though. Anyway as I said, it's good enough, JIT does not assume such things are constant. In fact it would execute exactly the same code for similarily shaped objects (different if all objects slightly differ in shape though)
Interfacing Python to C isn't ugly if you use Cython.
That is probably a matter of taste which we should not discuss among gentleman, I however find pure python better than Python-Cython-C combination. Also parsing JSON in C is not fun at all.
The guys from ultrajson have a benchmark here https://github.com/esnme/ultrajson/blob/master/python/benchmark.py
and the results are in the README https://github.com/esnme/ultrajson/blob/master/README
would be interesting to run those benchmarks (of course, first warming up the jit), and comparing the results to ultrajson.
feel free leonardo :)
It was just a suggestion on how to improve it, like you asked. If it was just going to be ignored I would not have bothered.