Hacking PHP syntax

Have you ever though how to extend the core of PHP? What does it take to create a new keyword or even design a whole new syntax? If you have some basic knowledge about C you shouldn’t have any problem with making small changes. Yes, I know it might be little bit pointless but it doesn’t matter because It’s fun.

Lets create an alternative way to define a class. The simplest class definition allowed in PHP looks like this:

<?php
class ClassName {}

We can simplify the syntax and replace the curly brackets with semicolon.

<?php
class ClassName;

If you try to execute this code it will obviously throw an error. That’s not a problem, we can fix it.

First step is to install some software.

$ sudo apt-get install bison re2c

PHP is written in C however the parser is created with Bison. Bison is a parser generator. The home page defines it as: a general-purpose parser generator that converts an annotated context-free grammar into a deterministic LR or generalized LR (GLR) parser employing LALR parser tables.

It’s a very powerful peace of software and one can write a whole book about it. If you would like to learn more I refer you to the documentation. It’s not a very easy read but there is a good example. If you will ever want to create a programming language that might be the good place to start.

Go to the http://php.net and get the latest PHP sources.

$ tar xvjf php-5.4.14.tar.bz2
$ cd php-5.4.14
$ ./configure
$ cd Zend
$ ls

Take your hat off. You are looking at the core of PHP. Code in those files powers vast majority of web servers. Lets break it.

A default extension for Bison files is “y”.

$ ls *.y
zend_ini_parser.y zend_language_parser.y

We don’t want to mess with the “ini” syntax so the only choice is “zend_language_parser.y“. Open it with your editor of choice.

If you search for “class” you will find

%token T_CLASS      "class (T_CLASS)"

Parsers like to operate on tokens. The “class” token is “T_CLASS“. If you search for the “T_CLASS” you will find something like that:

class_entry_type:
    T_CLASS { $$.u.op.opline_num = CG(zend_lineno); $$.EA = 0; }
    | T_ABSTRACT T_CLASS { $$.u.op.opline_num = CG(zend_lineno); $$.EA = ZEND_ACC_EXPLICIT_ABSTRACT_CLASS; }
    | T_TRAIT { $$.u.op.opline_num = CG(zend_lineno); $$.EA = ZEND_ACC_TRAIT; }
    | T_FINAL T_CLASS { $$.u.op.opline_num = CG(zend_lineno); $$.EA = ZEND_ACC_FINAL_CLASS; }
    ;

You are looking at four different ways to define a class.

  • class
  • abstract class
  • trait
  • final class

In curly brackets you can see some low level assignments. I can only guess what are they for. Lets ignore them😉

We are on a right track but it’s not exactly what we’re looking for. Search for “class_entry_type” which groups those four definitions.

That takes you to the final destination. It’s easy but not very readable at the beginning.

unticked_class_declaration_statement:

    class_entry_type T_STRING extends_from
            { zend_do_begin_class_declaration(&$1, &$2, &$3 TSRMLS_CC); }
            implements_list
            '{'
            class_statement_list
            '}' { zend_do_end_class_declaration(&$1, &$3 TSRMLS_CC); }

    | interface_entry T_STRING
            { zend_do_begin_class_declaration(&$1, &$2, NULL TSRMLS_CC); }
           interface_extends_list
           '{'
          class_statement_list
           '}' { zend_do_end_class_declaration(&$1, NULL TSRMLS_CC); }
    ;

There are two declarations here. One for a class and one for an interface. We are interested in the first one. It starts with “class_entry_type” which resolves to: class | abstract class | trait | final class. Next element is a token T_STRING. That’s going to be the class name. Another element “extends_from” is a group. It can be “extends T_STRING” or nothing.

After that parser calls the Zend engine to begin class declaration.

{ zend_do_begin_class_declaration(&$1, &$2, &$3 TSRMLS_CC); }

You can find this function in zend_compiler.c file.

void zend_do_begin_class_declaration(const znode *class_token, znode *class_name, const znode *parent_class_name TSRMLS_DC)

First argument is a class token “class_entry_type“, second is a class name “T_STRING” and the last one is a parent class “extends_from“.

Under that we have another group “implements_list”. I’m sure you can guess it. Yes, it’s for assigning interfaces. Following lines define the mandatory class body: opening bracket “{“, “class_statement_list” group and the closing bracket “}“. Finally the parser informs Zend engine that the class declaration has ended.

{ zend_do_end_class_declaration(&$1, &$3 TSRMLS_CC); }

We need to duplicate that code but without class body definition.

unticked_class_declaration_statement:

    class_entry_type T_STRING extends_from
            { zend_do_begin_class_declaration(&$1, &$2, &$3 TSRMLS_CC); }
            ';' { zend_do_end_class_declaration(&$1, &$3 TSRMLS_CC); 

    | class_entry_type T_STRING extends_from

            { zend_do_begin_class_declaration(&$1, &$2, &$3 TSRMLS_CC); }
            implements_list
            '{'
            class_statement_list
            '}' { zend_do_end_class_declaration(&$1, &$3 TSRMLS_CC); }

    | interface_entry T_STRING
            { zend_do_begin_class_declaration(&$1, &$2, NULL TSRMLS_CC); }
           interface_extends_list
           '{'
          class_statement_list
           '}' { zend_do_end_class_declaration(&$1, NULL TSRMLS_CC); }
    ;

It was quite simple, wasn’t it? Now you just have to compile it.

$ cd ..
$ make

First compilation is always taking a while.

$ vim test.php

Paste the test code.

bar = 10;

print_r( $a );

Go and test your hack.

$ sapi/cli/php test.php 
Bar Object
(
[bar] => 10
)

Well done, you’ve hacked PHP!

Lets add one more thing. In PHP you define a class with the “class” keyword. How about make it shorter? “cls” should do fine.

Look for Lexer files.

$ cd Zend/
$ ls *.l
zend_ini_scanner.l zend_language_scanner.l

Bison file was operating on tokens. Lexer allow you to define how to convert a code into the tokens.

Opens zend_language_scanner.l and search for “class“.

"class" {
return T_CLASS;
}

Duplicate this block and change class to cls.

"cls" {
return T_CLASS;
}

"class" {
return T_CLASS;
}

Job done. Compile the code and you can use “cls” instead of the “class” word.

Wasn’t that fun? I hope you enjoyed it as much as I did. Play around, break it. If you really like it think about closing some bugs on https://bugs.php.net/.

4 thoughts on “Hacking PHP syntax

    1. Hello Theodore. Your question is more related to extending PHP than hacking the Zend engine. Function “strpos” is part of the standard extension and is defined in “ext/standard/string.c” – grep for “PHP_FUNCTION(strpos)”. In this case I would rather create a new extension, define the “string_position” wrapper and call “php_strpos” from there. Extending PHP is quite well explained at Zend Devzone. Google for “writing php extension”, there are few good articles. You can also find many examples in the “ext/” directory and on the PECL. I hope that answers your question.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s