Parsing binary data in PHP on an example with the PCAP format
Not every problem should be solved with PHP. For certain types of applications it’s simply more appropriate to select a different programming language. You might be glad to hear that parsing binary data, is **NOT** one of them. If you don’t have to worry about microseconds and the biggest concern is development time, PHP might be in fact a very good tool for the job.
There are many different file formats which we potentially could investigate but for this post I will focus only on PCAP. It’s easy, fun to work with and holds an interesting kind of data.
The libpcap file format (full name) is the main capture format used in TcpDump/WinDump, Wireshark/TShark, snort, and many other networking tools. If you are not familiar with any of the above names, those programs are packet analysers (or network sniffers).
The format is well document on the Wireshark website. It consists of 3 different data structures:
Packet data is an array of chars (in PHP it’s a string) and doesn’t have a defined size. It’s encapsulated by a packet header which defines its length.
A very simple (and useless) parser would have to find the first packet header, figure out length of the packet data and move to another packet header (which lays immediately after data section).
Before we get to the first packet header we need to read the global header. It begins at the first byte and holds some basic information about the file.
The global header structure is defined as follows:
1 2 3 4 5 6 7 8 9 |
typedef struct pcap_hdr_s { guint32 magic_number; /* magic number */ guint16 version_major; /* major version number */ guint16 version_minor; /* minor version number */ gint32 thiszone; /* GMT to local correction */ guint32 sigfigs; /* accuracy of timestamps */ guint32 snaplen; /* max length of captured packets, in octets */ guint32 network; /* data link type */ } pcap_hdr_t; |
If you are not familiar with C, please allow me to explain what the above notation means. The simples way to think about a C structure is to imagine a PHP class with public attributes and no methods. A PHP translation can look this:
1 2 3 4 5 6 7 8 9 |
class pcap_hdr_t { public $magic_number; /* magic number */ public $version_major; /* major version number */ public $version_minor; /* minor version number */ public $thiszone; /* GMT to local correction */ public $sigfigs; /* accuracy of timestamps */ public $snaplen; /* max length of captured packets, in octets */ public $network; /* data link type */ } |
The biggest difference is that in PHP our attributes can hold any kind of data while in C they have strictly defined type and length.
- guint32 – Unsigned Integer 4 bytes long (32 bits)
- guint16 – Unsigned integer 2 bytes long (16bits)
- gint32 – Integer 4 bytes long (32 bits)
Based on this information we can be sure that:
- the global header has exactly 24 bytes
- the first 4 bytes are the “magic_number” and should be interpreted as an unsigned integer
- following 2 bytes are the “version_major” and should be interpreted as an unsigned integer
- and so on… I’m sure you can recognise the pattern
Lets create an example pcap file on which we are going to experiment. The easiest way would be running tcpdump:
1 2 |
$ tcpdump -i eth0 -w example.pcap tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes |
(make sure you specify a network interface “-i eth0″, tcpdump can sniff on all interfaces but the output file will be saved in PCAP-NG instead of PCAP)
Generate some network traffic by visiting any website or pinging google.com, and than switch back to the terminal, and press CTRL+C.
1 2 3 |
^C118 packets captured 118 packets received by filter 0 packets dropped by kernel |
Run ls
to make sure there is some data in the file.
1 2 3 4 5 |
$ ls -la total 56 drwxr-xr-x 3 lukasz staff 102 9 Mar 01:39 . drwxr-xr-x 62 lukasz staff 2108 9 Mar 01:38 .. -rw-r--r-- 1 lukasz staff 25689 9 Mar 01:39 example.pcap |
Now lets open the file in PHP and read the first 24 bytes of data.
1 2 3 4 5 6 7 8 9 10 11 |
<?php $fh = fopen( 'example.pcap', 'rb'); if( ! $fh ) { throw new Exception( "Can't opend the PCAP file" ); } $buffer = fread( $fh, 24 ); fclose( $fh ); |
Please notice the ‘b’ flag passed to the fopen() function. It forces binary mode and prevents PHP from being clever about the data.
Now we have a raw data in the buffer and we need to translate it to something what makes sense. In C one could simply cast the buffer on a desired structure (although it’s a discourage practice because it makes the code less portable). In PHP (and PERL from which the function was borrowed) you can use unpack().
Unpack takes a binary string and translates it to an array of values. To start simple lets read only the first 4 bytes which represent the magic_number.
1 2 3 4 5 6 |
<?php /* ... */ $buffer = unpack( "Nmagic_number", fread( $fh, 24 ) ); print_r( $buffer ); |
The function should return something like this:
1 2 3 4 |
Array ( [magic_number] => 3569595041 ) |
You probably noticed the strange “N” character before “magic_number” passed to the unpack(). It tells the function how to parse binary string. “N” stands for “unsigned long (always 32 bit, bigendian byte order)“. You can find all codes under pack() documentation.
Code | Description |
---|---|
a | NUL-padded string |
A | SPACE-padded string |
h | Hex string, low nibble first |
H | Hex string, high nibble first |
c | signed char |
C | unsigned char |
s | signed short (always 16 bit, machine byte order) |
S | unsigned short (always 16 bit, machine byte order) |
n | unsigned short (always 16 bit, big endian byte order) |
v | unsigned short (always 16 bit, little endian byte order) |
i | signed integer (machine dependent size and byte order) |
I | unsigned integer (machine dependent size and byte order) |
l | signed long (always 32 bit, machine byte order) |
L | unsigned long (always 32 bit, machine byte order) |
N | unsigned long (always 32 bit, big endian byte order) |
V | unsigned long (always 32 bit, little endian byte order) |
f | float (machine dependent size and representation) |
d | double (machine dependent size and representation) |
x | NUL byte |
X | Back up one byte |
Z | NUL-padded string (new in PHP 5.5) |
@ | NUL-fill to absolute position |
If you want to unpack more than one value you have to use “/” character as a separator. To read the full header you can do:
1 2 3 4 |
<?php $buffer = unpack( "Nmagic_number/vversion_major/vversion_minor/lthiszone/Vsigfigs/Vsnaplen/Vnetwork", fread( $fh, 24 ) ); print_r( $buffer ); |
That will return all header’s values.
1 2 3 4 5 6 7 8 9 10 |
Array ( [magic_number] => 3569595041 [version_major] => 2 [version_minor] => 4 [thiszone] => 0 [sigfigs] => 0 [snaplen] => 65535 [network] => 1 ) |
You might have noticed that there are 3 different types for a 32 bit long:
- machine byte order
- big endian byte order
- little endian byte order
There are two ways of storing a number which is longer than a byte. For example, number 2864434397 can encoded as: 0xAA 0xBB 0xCC 0xDD (Big-endian) or 0xDD 0xCC 0xBB 0xAA (Little-endian). Different platforms might prefer a different order. If you want to stick to your default settings you can use the “machine byte order” which is relative.
So why did I decide to use Little-endian for all values following the magic number?
The magic number in my PCAP is 3569595041. That’s 0xd4c3b2a1 in hex. PCAP documentation states that:
The writing application writes 0xa1b2c3d4 with it’s native byte ordering format into this field. The reading application will read either 0xa1b2c3d4 (identical) or 0xd4c3b2a1 (swapped). If the reading application reads the swapped 0xd4c3b2a1 value, it knows that all the following fields will have to be swapped too.
A good parser should handle both cases but for the sake of simplicity I will ignore the other possibility.
At this moment you should have a script which can read and understand the first 24 bytes of a PCAP file. The next step is to read the first packet header.
1 2 3 4 5 6 |
typedef struct pcaprec_hdr_s { guint32 ts_sec; /* timestamp seconds */ guint32 ts_usec; /* timestamp microseconds */ guint32 incl_len; /* number of octets of packet saved in file */ guint32 orig_len; /* actual length of packet */ } pcaprec_hdr_t; |
This structure is simpler and is always 16 bytes long. Now we have all information we need to finish the parser.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
<?php $fh = fopen( 'example.pcap', 'rb'); if( ! $fh ) { throw new Exception( "Can't opend the PCAP file" ); } /* Reading global header */ $buffer = unpack( "Nmagic_number/vversion_major/vversion_minor/lthiszone/Vsigfigs/Vsnaplen/Vnetwork", fread( $fh, 24 ) ); printf( "Magic number: 0x%s, Version: %d.%d, Snaplen: %d\n", dechex( $buffer['magic_number']), $buffer['version_major'], $buffer['version_minor'], $buffer['snaplen'] ); /* Reading packets */ $frame = 1; while( ( $data = fread( $fh, 16 ) ) ) { /* Read packet header */ $buffer = unpack( "Vts_sec/Vts_usec/Vincl_len/Vorig_len", $data ); /* Read packet raw data */ $packetData = fread( $fh, $buffer['incl_len'] ); printf( "Frame: %d, Packetlen: %d, Captured: %d\n", $frame, $buffer['orig_len'], $buffer['incl_len'] ); $frame++; } fclose( $fh ); |
Processing PCAP is very straight forward. You have to read 16 bytes of a packet header and get “incl_len” to find out where the next packet header starts.
Although the parser doesn’t do much this example should give you a good understanding of how to deal with binary data in PHP. If you would like to push it further and find out an IP address, or TCP payload open the PCAP file in Wireshark and try to figure it out. If you need some extra help have a look at this great article about programming with libpcap. You will find there all the structures you need (sniff_ethernet, sniff_ip and sniff_tcp) with a solid explanation.
As you can see PHP is capable of doing more than generating a dynamic HTML code. It’s obviously not so as fast as C but there are many occasions when that’s not a big problem.