What is GZIP Compression?

Introduction

What is GZIP Compression

GZIP compression is an extremely popular technique used for compressing web content. From web pages alone to videos and photos that are referenced on them, GNU Zip’s lossless compression is used by over fifty percent of all websites on the Internet (source: https://w3techs.com/technologies/details/ce-gzipcompression). Despite GZIP’s current popularity, its compression ratio is often worse than that of Brotli, which offers a modest improvement over its predecessor. Simply put, GZIP’s adoption is slowly trending downward as websites move to more modern technologies.

That is not to say that GZIP, or other compression formats such as bz2 and xz are going away: they have their own unique advantages and/or disadvantages. For example, gzip compressed pages still offer slightly lower decompression times on the client, which could be useful for lower-end devices. Crucially, GZIP is much faster server-side compared to Brotli — lower end devices and/or servers will run better with the older compression technique.

Moving onto bz2 and xz, their disadvantages are not their compression ratios (in fact, they are far better than gzip and brotli); the issue is that they take very long to decompress client-side, which could potentially ruin the end-user experience.

Outside of web technologies, GZIP is also commonly used for transferring files. You may have seen the large number of Linux packages that come packaged in “tarballs.” They also terminate with gz; all of which were compressed by the GNU Zip (GZIP) algorithm.

The Purpose of Web Compression

Latency (latency refers to “page load times” in this context) is an important metric to keep an eye on, particularly because slow websites will (obviously) drive away traffic. Measured in ms, gzip compressed web content will often be an order of magnitude smaller than the original file.

With gzip, or even Brotli, latency to websites tends to be far lower than if they were sending uncompressed data: from reduced SSL overhead to file sizes, it is clear why compression is used, even with rising Internet speeds around the world (i.e. time to first render).

How It Works

Given the nature of gzip’s lossless compression, one popular topic in Computer Science should come up: Huffman Coding. After gzip identifies repetition in content that is being transmitted (through the LZ77 algorithm), a Huffman code (tree) is formed and attempts to further compress the data. An example could be with the small String “Hello, world! Hello, bunny.net!”:

First, the LZ77 algorithm attempts to link repeated occurrences of words back to one single reference:

How does Lossless compression work

(The second reference points back to “hello.”)

Second, gzip’s final processing algorithm (i.e. Huffman coding) can be applied (this is not an accurate representation of how gzip compresses; it is merely a visualization; that is, individual characters won’t be represented in this way):

With that complete, data can begin to be sent to a client (in a binary stream). The client, or browser, will then decode the stream, followed by the tree, and finally, the LZ77-compressed content that will eventually yield the original, untouched content (HTML/images/videos/etc.).

When GZIP Should Be Used

Having mentioned the speed of GZIP’s compression algorithms, it becomes immediately clear that it is both designed to run on virtually any client/server all while providing an acceptable level of compression for static and dynamic content (live streams, etc.). In essence, GZIP works well with all types of content, while technologies such as bz2, xz, and Brotli work well with static content that does not change (i.e. videos, images, CSS, static HTML pages, JS, etc.).

Conclusion

While support for GNU Zip (on the web) is slowly trending downward, it still has many uses that will keep it in use for years to come. Even with newer compression technologies, the fundamental limitations in lossless compression mean that compression ratios will always be a trade-off between server and client-side processing (Brotli performs nearly the same as GZIP for client-side decompression but takes many magnitudes more time to compress server-side, making it only realistic to use with content that is compressed, then cached for subsequent requests).

Glossary

Compression

Compression involves running an algorithm to make a file/image/etc. smaller. There are two modes of compression: lossy and lossless.

GZIP

GZIP stands for GNU Zip.