Even in today’s world, with fast networks and almost unlimited storage, data compression is still relevant, especially for mobile devices and countries with poor Internet connections. This post covers the de-facto lossless compression method for compressing text data in websites: GZIP.

Morse code is one of the first lossless compression standards. More frequent letters get shorter codes

Morse code is one of the first lossless compression standards. More frequent letters get shorter codes

GZIP compression

GZIP provides a lossless compression, that is, we can recover the original data when decompressing it. It is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman coding.

The LZ77 algorithm replaces repeated occurrences of data with references. Each reference has two values: the jump and the length. Let’s see an example:

Original text: "ServerGrove, the PHP hosting company, provides hosting solutions for PHP projects" (81 bytes)
LZ77: "ServerGrove, the PHP hosting company, p<3,32>ides<9,26>solutions for<5,52><3,35>jects" (73 bytes, assuming that each reference is 3 bytes)

As you can see, the strings ” hosting ” and ” PHP ” are repeated, so the second time that the substring is found, is replaced by a reference. There are other repetitions too, like “er”, but as there would not be any gain, the original text is left.

Huffman coding is a variable-length coding method that assigns shorter codes to more frequent “characters”. The problem with variable-length codes is usually that we need a way to know when a code ends and the new one starts to decode it. Huffman coding solves this by creating a prefix code, where no codeword is a prefix of another one. It can be understood more easily by an example:

Original text: "ServerGrove"
ASCII codification: "01010011 01100101 01110010 01110110 01100101 01110010 01000111 01110010 01101111 01110110 01100101" (88 bits)

ASCII is a fixed-length character-encoding scheme, so the letter “e”, which appears three times and is also the most frequent letter in the English language, has the same size as the letter “G”, which only appears once. Using this statistical information, Huffman can create a most optimized scheme:

Huffman: "1110 00 01 10 00 01 1111 01 110 10 00" (27 bits)

The Huffman method allows us to get shorter codes for “e”, “r” and “v”, while “S” and “G” got the longer ones. Explaining how to use the Huffman method is out of the scope of this post, but if you are interested I recommend you to check this great video from Computerphile.

DEFLATE, which is the algorithm used for GZIP compression, is a combination of both these algorithms.

Is GZIP the best compression method?

The answer is NO. There are other compression methods that get higher compression ratios, but there are a few good reasons to use it.

First, even though GZIP is not the best compression method, it provides a good tradeoff between speed and ratio. Compressing and decompressing data with GZIP is fast, and the ratio is quite decent.

Second, it is not easy to add a new global compression method that everyone can use. Browsers would need to be updated, which today is much simpler using self-update mechanisms. However, browsers are not the only problem, Chromium tried to add support for BZIP2, a better compression method based on the Burrows–Wheeler transform, but had to cancel it as some old intermediate proxies corrupted the data as they were not able to understand the bzip2 header and tried to gzip the contents. You can see the bug report here.

GZIP + HTTP

The process between the client (browser) and the server to get the content gzipped is simple. If the browser has support for GZIP/DEFLATE, it lets the server know by the “Accept-Encoding” request header. Then, the server can choose whether sending the contents gzipped or raw.

Screen Shot 2014-04-07 at 09.44.05

Implementations

The DEFLATE specification provides some freedom to developers to implement the algorithm using different approaches, as long as the resulting stream is compatible with the specification.

GNU GZIP

The GNU implementation is the most common and was designed to be a replacement for the compress utility, free from patented algorithms. To compress a file using the GNU GZIP utility:

$ gzip -c file.txt > file.txt.gz

There are 9 levels of compression, being “1″ the fastest with the smallest compression ratio and “9″ the slowest with better compression ratio. By default, “6″ is used. If we want maximum compression at the cost of using more memory and time in the process, the -9 flag (or –best) can be used:

$ gzip -9 -c file.txt > file.txt.gz

7-zip

7-zip implements the DEFLATE algorithm differently and usually achieves higher compression ratios. To compress a file with the maximum compression:

7z a -mx9 file.txt.gz file.txt

7-zip is also available in Windows and provides implementations for other compression methods such as 7z, xz, bzip2, zip and others.

Zopfli

Zopfli is ideal for one-time compression, for example, in build processes when the file is compressed once and served many. It is ~100x slower, but compresses around 5% better than other compressors.

Enabling GZIP compression

Apache

The mod_deflate module provides support for GZIP compression, so the response is compressed on the fly before being sent to the client over the network.

To enable it for text files, add this in your .htaccess file:

AddOutputFilterByType DEFLATE text/plain
AddOutputFilterByType DEFLATE text/html
AddOutputFilterByType DEFLATE text/xml
AddOutputFilterByType DEFLATE text/css
AddOutputFilterByType DEFLATE application/xml
AddOutputFilterByType DEFLATE application/xhtml+xml
AddOutputFilterByType DEFLATE application/rss+xml
AddOutputFilterByType DEFLATE application/javascript
AddOutputFilterByType DEFLATE application/x-javascript

There are some known bugs with older versions of some browsers, so it is also recommended to add these lines in case you support them:

BrowserMatch ^Mozilla/4 gzip-only-text/html
BrowserMatch ^Mozilla/4\.0[678] no-gzip
BrowserMatch \bMSIE !no-gzip !gzip-only-text/html
Header append Vary User-Agent

It is also possible to serve pre-gzipped files instead of doing every time on the fly. This is especially useful for files that don’t change in every request such as JavaScript or CSS files, which can be compressed using a slow algorithm and then served directly. In your .htaccess, include this:

RewriteEngine On
AddEncoding gzip .gz
RewriteCond %{HTTP:Accept-encoding} gzip
RewriteCond %{REQUEST_FILENAME}.gz -f
RewriteRule ^(.*)$ $1.gz [QSA,L]

What we are doing here is telling Apache that files with .gz extensions should be served with the gzip encoding-type (line 2), checking that the browser accepts gzip (line 3) and if the gzipped file exists (line 4), we append .gz to the requested filename.

nginx

The ngx_http_gzip_module module allows compressing files with GZIP on the fly, while the ngx_http_gzip_static_module one allows sending precompressed files with the “.gz” filename extension instead of regular files.

An example configuration looks like this:

gzip            on;
gzip_min_length 1000;
gzip_types      text/plain application/xml;

GZIP + PHP

While it is not usually recommended to compress data using PHP as it will be slower, it is possible to do it from PHP using the zlib module.

For example, to compress the jQuery minified library using the maximum compression:

<?php

$originalFile = __DIR__ . '/jquery-1.11.0.min.js';
$gzipFile = __DIR__ . '/jquery-1.11.0.min.js.gz';

$originalData = file_get_contents($originalFile);

$gzipData = gzencode($originalData, 9);
file_put_contents($gzipFile, $gzipData);

var_dump(filesize($originalFile)); // int(96380)
var_dump(filesize($gzipFile)); // int(33305)

Data can be decompressed with gzdecode(). The zlib module also defines a few stream wrappers to compress data.

More information

Photo: Morse Code Straight Key J-38, by Anthony Catalano