Improving SyntaxError in PyPy
For the last year, my halftime job has been to teach non-CS uni students to program in Python. While doing that, I have been trying to see what common stumbling blocks exist for novice programmers. There are many things that could be said here, but a common theme that emerges is hard-to-understand error messages. One source of such error messages, particularly when starting out, is SyntaxErrors.
PyPy's parser (mostly following the architecture of CPython) uses a regular-expression-based tokenizer with some cleverness to deal with indentation, and a simple LR(1) parser. Both of these components obviously produce errors for invalid syntax, but the messages are not very helpful. Often, the message is just "invalid syntax", without any hint of what exactly is wrong. In the last couple of weeks I have invested a little bit of effort to make them a tiny bit better. They will be part of the upcoming PyPy 6.0 release. Here are some examples of what changed.
Missing Characters
The first class of errors occurs when a token is missing, often there is only one valid token that the parser expects. This happens most commonly by leaving out the ':' after control flow statements (which is the syntax error I personally still make at least a few times a day). In such situations, the parser will now tell you which character it expected:
>>>> # before >>>> if 1 File "<stdin>", line 1 if 1 ^ SyntaxError: invalid syntax >>>> >>>> # after >>>> if 1 File "<stdin>", line 1 if 1 ^ SyntaxError: invalid syntax (expected ':') >>>>
Another example of this feature:
>>>> # before >>>> def f: File "<stdin>", line 1 def f: ^ SyntaxError: invalid syntax >>>> >>>> # after >>>> def f: File "<stdin>", line 1 def f: ^ SyntaxError: invalid syntax (expected '(') >>>>
Parentheses
Another source of errors are unmatched parentheses. Here, PyPy has always had slightly better error messages than CPython:
>>> # CPython >>> ) File "<stdin>", line 1 ) ^ SyntaxError: invalid syntax >>> >>>> # PyPy >>> ) File "<stdin>", line 1 ) ^ SyntaxError: unmatched ')' >>>>
The same is true for parentheses that are never closed (the call to eval is needed to get the error, otherwise the repl will just wait for more input):
>>> # CPython >>> eval('(') File "<string>", line 1 ( ^ SyntaxError: unexpected EOF while parsing >>> >>>> # PyPy >>>> eval('(') File "<string>", line 1 ( ^ SyntaxError: parenthesis is never closed >>>>
What I have now improved is the case of parentheses that are matched wrongly:
>>>> # before >>>> (1, .... 2, .... ] File "<stdin>", line 3 ] ^ SyntaxError: invalid syntax >>>> >>>> # after >>>> (1, .... 2, .... ] File "<stdin>", line 3 ] ^ SyntaxError: closing parenthesis ']' does not match opening parenthesis '(' on line 1 >>>>
Conclusion
Obviously these are just some very simple cases, and there is still a lot of room for improvement (one huge problem is that only a single SyntaxError is ever shown per parse attempt, but fixing that is rather hard).
If you have a favorite unhelpful SyntaxError message you love to hate, please tell us in the comments and we might try to improve it. Other kinds of non-informative error messages are also always welcome!
Comments
This is great, I've been thinking along these lines when it comes to python errors for a while.
This kind of improvements would be great for the long-suffering python web developers too.
Despite my typo-ridden comment, English is my first language :(
I've seen people struggle with lambda.
>>> lambda x:
File "", line 1
lambda x:
^
SyntaxError: invalid syntax
Upon a syntax error, you might want to scan forward until the next line with the current(ly-broken) statement's indent (or maybe until there's a dedent to below that level (except when already at top level, obviously)), then resume parsing.
I applaud this initiative. This is something that I have attempted to do on https://reeborg.ca/reeborg.html (only for code run in the the editor, not for the repl). I also tried to provide translations when using languages other than English. I think it would be great if you could somehow provide a hook to easily add translations.
Missing commas between elements in data structures is probably my most common syntax error, especially when dealing with nested data structures or structures split across multiple lines. And while they're something I can recognize very easily, the actual error message isn't especially helpful, particularly when the next element after a missing comma is on the following line.
Thanks for the explanation. It all makes sense now that I know Python uses regular expressions in its parser. When Idle points to a random space character within the indentation, off to the left of a code block implemented in compliance with every recognized convention, boldly proclaiming "syntax error", I know precisely which vestigial anti-Pythonic Bell Labs holdover to resent. Again.
Everybody thanks for the suggestions! I've added these to my collections of things I might want to fix.
@smurfix there is a huge amount of scientific papers on approaches how to do stuff like that, I am currently working through them (slowly)
@Unknown do you have an example for this behaviour?
Sorry for the 'unknown' status ... In fact, it happened again today. I can send a screenshot, if that will help, confirming the presence of a red highlighted space, among many seemingly non-offending spaces, within the left margin indentation. Let me see if it is still happening when I try to run that code ... No, that exact SNAFU has moved on, but I now have an example of a syntax error being highlighted within a comment. Is that interesting?
I would love to see this get updated to Python 3.6.5. I'm currently using that for my programs, and even after looking at the changelogs between Python versions, I'm not sure what I'd lose by moving down to 3.5.3 so that I could use PyPy.
I'm also curious about things like IdleX and Anaconda. Would those be, hypothetically speaking, mergeable with PyPy?