Call for Benchmarks
As you know, a lot of PyPy's recent development effort has gone into speeding up execution of Python programs. However, an additional good property of PyPy's Python interpreter is that most objects are represented in a much more compact way than in CPython. We would like to investigate some more advanced techniques to reduce the memory usage of Python programs further.
To do this it is necessary to investigate the memory behaviour of real programs with large heaps. For speed measurements there are standard benchmarks, but for memory improvements there is nothing comparable, the memory behaviour of large programs is not that well understood. Therefore we are looking for programs that we can study and use as benchmarks.
Specifically we are looking for Python programs with the following properties:
- large heaps of about 10MB-1GB
- should have non-trivial runtime as well (in the range of a few seconds), to judge the speed impact of optimizations
- ideally pure-Python programs that don't use extension modules so that they run under both CPython and PyPy (this is optional, but makes my life much easier).
We are also rather interested in programs that do a lot of string/unicode processing.
We would be grateful for all ideas. Telling us about a program also has the advantage that we will work on optimizing PyPy for it :-).
Comments
I'm not think very much about it. But Zodb, durus or dobbin could be useful.
portage, the official Gentoo Linux package manager, does package dependency resolution and can take a few seconds for large updates. It parses package metadata from text files.
You could run a program that determinizes a large NFA. Given an existing Python program that can determinize an NFA, you could give it an expanded version of the NFA on page 15 here: https://www.springerlink.com/content/cq16j1uv511g793g/fulltext.pdf. Another way is to take some complex NFAs, concatenate them, and determinize.
Bazaar and mercurial take a lot of memory (time as well) when updating/merging etc. large repositories, especially if they contain large files.
Pylint (https://www.logilab.org/project/pylint) could be a nice target. Pure Python, the size of the heap and run time depend on what kind of code you throw at it.
You could try loading and manipulating a large graph with NetworkX. Pure Python, and the size and runtime could be tuned by varying the size of the graph and the algorithms that are run.
The CDPedia creates and manipulates its index with a pure-python inverted index implementation.
It could be extracted and made into a benchmark - there are other pure-python inverted indices around, those could also work.
They do tend to use lots and lots of memory, the CDPedia's implementation uses the builtin array module for byte sequence manipulation and bare strings as data store (it's highly optimized for lowering CPython's memory usage), but there are a few dict-heavy places yet.
Agreed that Bazaar and Mercurial would be interesting use cases, especially for projects with large revision history graphs.
Memory usage analysis has come up recently on the bzr list:
https://lists.ubuntu.com/archives/bazaar/2010q3/069549.html
All great ideas, thanks a lot!
Python Natural Language Toolkit
https://www.nltk.org/
Give a huge corpus (Wikipedia?) and do any operation on it -- nltk will take huge loads of memory in all kinds of custom objects, lists and tuples.
From what I understand, PyExcelerator, a writer/reader for Excel files, takes huge amounts of memory for very large files. It uses pure Python objects for each cell, which kills memory use when you're writing many millions of cells.
A couple of possibilities from my own OSS code:
https://stromberg.dnsalias.org/~strombrg/treap
https://stromberg.dnsalias.org/~strombrg/pyindex.html
I'd most likely be happy to relicense the treap code as needed to facilitate inclusion. The pyindex code is under a UCI (I believe it's BSDish) license, and would probably need to remain so.
I really didn't think about it much, I'm just trying to chew through my RSS backlog, and ran into a post about pkgcore dealing with memory issues just a few minutes after I read this call for benchmarks.
Maybe you could use that.
You might want to lok at MiniLight:
https://www.hxa.name/minilight/#comparison
I'm the author of a scientific application that can be suited to your needs. It runs both with Python 2.x and PyPy, so I bundled a distribution with some example benchmarks if this interests you: https://dl.dropbox.com/u/7931953/pypy-bench.tar.bz2 (see bench.README)
An interesting observation in my opinion is that on small runs, CPython outperforms PyPy but this progressively reverses on longer runs.
@all: thanks for the proposals, I am looking at them.
@Franck: This is probably due to the JIT, which needs some time to compile at the beginning. Later, the assembler exists and executes quickly. Will look at your code, thanks for providing it.
Hello, i am the author of an chess program being written entirely in python. I haven't published it jet, because i am a bit ashame of its poor quality. However it should suffice for the sole purpose of benchmarking. Please drop me a note if you are interested. My email adress is: larudwer at freenet dot de
Some Notes:
The Program is just console mode (UCI), no gui.
it eats up all the memory you have
cpython is almost twice as fast as pypy-1.3 on this program and psyco accelerates it by another factor of two.
You could consider Tahoe-LAFS. A good reason to use it is that it is a practicality-oriented, widely deployed tool with significant memory usage that we routinely spend engineering effort to track and manage.
Here are some graphs of the memory usage of different versions of Tahoe-LAFS over time:
32-bit machine:
https://tahoe-lafs.org/tahoe-figleaf-graph/hanford.allmydata.com-tahoe_memstats.html
64-bit machine:
https://tahoe-lafs.org/tahoe-figleaf-graph/hanford.allmydata.com-tahoe_memstats_64.html
Here are some open tickets about memory usage in our issue tracker:
https://tahoe-lafs.org/trac/tahoe-lafs/query?status=!closed&keywords=~memory&order=priority
The reason not to use Tahoe-LAFS as a subject is that it uses several native-code libraries to for the CPU-intensive inner loops (cryptography, erasure coding). I really want those libraries, and hence Tahoe-LAFS, to be usable with cpyext as soon as possible, but I haven't tried and I assume that cpyext isn't 100% there yet.
By the way the easiest way to measure the performance of Tahoe-LAFS would be to run its unit tests and measure the memory usage and runtime. This is not only the easiest way, but it is also a pressing issue for us! Tahoe-LAFS unit tests take too long to run, and this causes problems for us, and we very much like it if they could run to completion faster.
https://tahoe-lafs.org/trac/tahoe-lafs/ticket/20# unit tests take too long
Here are our buildbots showing unit test runtime among other things:
https://tahoe-lafs.org/buildbot/waterfall?show_events=true&builder=Kyle+OpenBSD-4.6+amd64&builder=hardy-amd64&builder=Arthur+lenny+c7+32bit&builder=Eugen+lenny-amd64&builder=David+A.+OpenSolaris+i386&builder=Ruben+Fedora&builder=Zooko+zomp+Mac-amd64+10.6+py2.6&builder=FreeStorm+WinXP-x86+py2.6&builder=tarballs
rawdog (disclosure of bias: I wrote it) sounds like it might be of use. It's an RSS aggregator that generates static HTML. Pure Python 2, with lots of string processing, mostly in the feedparser module. Memory usage and runtime depends on how many feeds it's reading and how much history it keeps, since it does everything in memory at the moment, using pickle for persistant state. (With my 800-odd feeds and two-month history, writing the entire store to HTML will use a few hundred meg of memory and run for several minutes.)
A future redesign will use a more sensible database-backed approach...
Scapy would be a great one to benchmark. Depending on the size of the packet capture, it can consume quite a bit of proc/mem when loading and dissecting large captures. I run it at work on Cpython and would love to see it running/optimized under pypy. The only problem is that I believe it uses some 2.6 pythonisms.
Thanks again for all the additional pointers. Still investigating all of them.
How about Nucular, a search engine written in python by aaron watters.
https://nucular.sourceforge.net/
In my view, the natural competitors to PyPy (in the domain of fast interpreters for dynamic languages) are Tracemonkey and V8. Therefore, translations of the Sunspider, V8, and Dromaeo benchmarks would be appropriate.
bitbake looks like a good candidate. It's a derivative of portage, and used to crosscompile linux distro for embedded device.
With non trivial distro, it can use up to 400Mb. It already use psyco if available, and can be interesting compare speed/memory usage with pypy.