Hacking PHP syntax
Have you ever though how to extend the core of PHP? What does it take to create a new keyword or even design a whole new syntax? If you have some basic knowledge about C you shouldn’t have any problem with making small changes. Yes, I know it might be little bit pointless but it doesn’t matter because It’s fun.
Lets create an alternative way to define a class. The simplest class definition allowed in PHP looks like this:
1 2 |
<?php class ClassName {} |
We can simplify the syntax and replace the curly brackets with semicolon.
1 2 |
<?php class ClassName; |
If you try to execute this code it will obviously throw an error. That’s not a problem, we can fix it.
First step is to install some software.
1 |
$ sudo apt-get install bison re2c |
PHP is written in C however the parser is created with Bison. Bison is a parser generator. The home page defines it as: a general-purpose parser generator that converts an annotated context-free grammar into a deterministic LR or generalized LR (GLR) parser employing LALR parser tables.
It’s a very powerful peace of software and one can write a whole book about it. If you would like to learn more I refer you to the documentation. It’s not a very easy read but there is a good example. If you will ever want to create a programming language that might be the good place to start.
Go to the http://php.net and get the latest PHP sources.
1 2 3 4 5 |
$ tar xvjf php-5.4.14.tar.bz2 $ cd php-5.4.14 $ ./configure $ cd Zend $ ls |
Take your hat off. You are looking at the core of PHP. Code in those files powers vast majority of web servers. Lets break it.
A default extension for Bison files is “y”.
1 2 |
$ ls *.y zend_ini_parser.y zend_language_parser.y |
We don’t want to mess with the “ini” syntax so the only choice is “zend_language_parser.y“. Open it with your editor of choice.
If you search for “class” you will find
1 |
%token T_CLASS "class (T_CLASS)" |
Parsers like to operate on tokens. The “class” token is “T_CLASS“. If you search for the “T_CLASS” you will find something like that:
1 2 3 4 5 6 |
class_entry_type: T_CLASS { $$.u.op.opline_num = CG(zend_lineno); $$.EA = 0; } | T_ABSTRACT T_CLASS { $$.u.op.opline_num = CG(zend_lineno); $$.EA = ZEND_ACC_EXPLICIT_ABSTRACT_CLASS; } | T_TRAIT { $$.u.op.opline_num = CG(zend_lineno); $$.EA = ZEND_ACC_TRAIT; } | T_FINAL T_CLASS { $$.u.op.opline_num = CG(zend_lineno); $$.EA = ZEND_ACC_FINAL_CLASS; } ; |
You are looking at four different ways to define a class.
- class
- abstract class
- trait
- final class
In curly brackets you can see some low level assignments. I can only guess what are they for. Lets ignore them
We are on a right track but it’s not exactly what we’re looking for. Search for “class_entry_type” which groups those four definitions.
That takes you to the final destination. It’s easy but not very readable at the beginning.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
unticked_class_declaration_statement: class_entry_type T_STRING extends_from { zend_do_begin_class_declaration(&$1, &$2, &$3 TSRMLS_CC); } implements_list '{' class_statement_list '}' { zend_do_end_class_declaration(&$1, &$3 TSRMLS_CC); } | interface_entry T_STRING { zend_do_begin_class_declaration(&$1, &$2, NULL TSRMLS_CC); } interface_extends_list '{' class_statement_list '}' { zend_do_end_class_declaration(&$1, NULL TSRMLS_CC); } ; |
There are two declarations here. One for a class and one for an interface. We are interested in the first one. It starts with “class_entry_type” which resolves to: class | abstract class | trait | final class. Next element is a token T_STRING. That’s going to be the class name. Another element “extends_from” is a group. It can be “extends T_STRING” or nothing.
After that parser calls the Zend engine to begin class declaration.
1 |
{ zend_do_begin_class_declaration(&$1, &$2, &$3 TSRMLS_CC); } |
You can find this function in zend_compiler.c file.
1 |
void zend_do_begin_class_declaration(const znode *class_token, znode *class_name, const znode *parent_class_name TSRMLS_DC) |
First argument is a class token “class_entry_type“, second is a class name “T_STRING” and the last one is a parent class “extends_from“.
Under that we have another group “implements_list”. I’m sure you can guess it. Yes, it’s for assigning interfaces. Following lines define the mandatory class body: opening bracket “{“, “class_statement_list” group and the closing bracket “}“. Finally the parser informs Zend engine that the class declaration has ended.
1 |
{ zend_do_end_class_declaration(&$1, &$3 TSRMLS_CC); } |
We need to duplicate that code but without class body definition.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
unticked_class_declaration_statement: class_entry_type T_STRING extends_from { zend_do_begin_class_declaration(&$1, &$2, &$3 TSRMLS_CC); } ';' { zend_do_end_class_declaration(&$1, &$3 TSRMLS_CC); | class_entry_type T_STRING extends_from { zend_do_begin_class_declaration(&$1, &$2, &$3 TSRMLS_CC); } implements_list '{' class_statement_list '}' { zend_do_end_class_declaration(&$1, &$3 TSRMLS_CC); } | interface_entry T_STRING { zend_do_begin_class_declaration(&$1, &$2, NULL TSRMLS_CC); } interface_extends_list '{' class_statement_list '}' { zend_do_end_class_declaration(&$1, NULL TSRMLS_CC); } ; |
It was quite simple, wasn’t it? Now you just have to compile it.
1 2 |
$ cd .. $ make |
First compilation is always taking a while.
1 |
$ vim test.php |
Paste the test code.
1 2 3 4 5 6 7 8 |
<?php class FooBar; $a = new FooBar; $a->bar = 10; print_r( $a ); |
Go and test your hack.
1 2 3 4 5 |
$ sapi/cli/php test.php Bar Object ( [bar] => 10 ) |
Well done, you’ve hacked PHP!
Lets add one more thing. In PHP you define a class with the “class” keyword. How about make it shorter? “cls” should do fine.
Look for Lexer files.
1 2 3 |
$ cd Zend/ $ ls *.l zend_ini_scanner.l zend_language_scanner.l |
Bison file was operating on tokens. Lexer allow you to define how to convert a code into the tokens.
Opens zend_language_scanner.l and search for “class“.
1 2 3 |
<ST_IN_SCRIPTING>"class" { return T_CLASS; } |
Duplicate this block and change class to cls.
1 2 3 4 5 6 7 |
<ST_IN_SCRIPTING>"cls" { return T_CLASS; } <ST_IN_SCRIPTING>"class" { return T_CLASS; } |
Job done. Compile the code and you can use “cls” instead of the “class” word.
Wasn’t that fun? I hope you enjoyed it as much as I did. Play around, break it. If you really like it think about closing some bugs on https://bugs.php.net/.
4 Comments
Theodore R. Smith (PHP Experts, Inc.)
02/05/2013What if I wanted to wanted to rename the function “strpos” to “string_position” and wanted to create an alias named “strpos”?
Lukasz Kujawa
02/05/2013Hello Theodore. Your question is more related to extending PHP than hacking the Zend engine. Function “strpos” is part of the standard extension and is defined in “ext/standard/string.c” – grep for “PHP_FUNCTION(strpos)”. In this case I would rather create a new extension, define the “string_position” wrapper and call “php_strpos” from there. Extending PHP is quite well explained at Zend Devzone. Google for “writing php extension”, there are few good articles. You can also find many examples in the “ext/” directory and on the PECL. I hope that answers your question.
solu
10/05/2013It’s fun!
I want to add array slice syntax like Python (list[1:2]),
but I found it’s too hard for me.
Lukasz Kujawa
11/05/2013Thank you for your comment. I agree. It doesn’t sound like a super easy tweak.