232 lines
11 KiB
Markdown
232 lines
11 KiB
Markdown
|
# Puffin: A deterministic deflate re-compressor (for patching purposes)
|
|||
|
|
|||
|
## Table of Contents
|
|||
|
|
|||
|
[TOC]
|
|||
|
|
|||
|
## Puffin
|
|||
|
|
|||
|
Puffin is a deterministic deflate recompressor. It is mainly used for patching
|
|||
|
deflate compressed images (zip, gzip, etc.) because patches created from deflate
|
|||
|
files/streams are usually large (deflate has a bit-aligned format, hence,
|
|||
|
changing one byte in the raw data can cause the entire deflate stream to change
|
|||
|
drastically.)
|
|||
|
|
|||
|
Puffin has two tools (operations): `puffdiff` and `puffpatch` (shown
|
|||
|
[here](puffin).) The purpose of `puffdiff` operation is to create a patch
|
|||
|
between a source and a target file with one or both of them having some deflate
|
|||
|
streams. This patch is used in the `puffpatch` operation to generate the target
|
|||
|
file from the source file deterministically. The patch itself is created by
|
|||
|
`bsdiff` library (but can be replaced by any other diffing mechanism). But,
|
|||
|
before it uses `bsdiff` to create the patch, we have to transform both the
|
|||
|
source and target files into a specific format so `bsdiff` can produce smaller
|
|||
|
patches. We define `puff` operation to perform such a transformation. The
|
|||
|
resulting stream is called a `puff` stream. The reverse transformation (from
|
|||
|
`puff` stream to deflate stream) is called a `huff` operation. `huff` is used in
|
|||
|
the client to transform the `puff` stream back into its original deflate stream
|
|||
|
deterministically.
|
|||
|
|
|||
|
![puffin](images/puffin.png "Puffin architecture")
|
|||
|
|
|||
|
For information about deflate format see [RFC 1951].
|
|||
|
|
|||
|
## puff and huff
|
|||
|
|
|||
|
`puff` is an operation that decompresses only the Huffman part of the deflate
|
|||
|
stream and keeps the structure of the LZ77 coding unchanged. This is roughly
|
|||
|
equivalent of decompressing ‘half way’.
|
|||
|
|
|||
|
`huff` is the exact opposite of `puff` and it deterministically converts the
|
|||
|
`puff` stream back to its original deflate stream. This deterministic conversion
|
|||
|
is based on two facts:
|
|||
|
|
|||
|
* There is no need to perform LZ77 algorithm. This means the deflate stream
|
|||
|
could be built by any LZ77 algorithm.
|
|||
|
* The dynamic Huffman tables can be recreated uniquely from the code length
|
|||
|
array stored inside the `puff` stream. This means the deflate stream can be
|
|||
|
reconstructed deterministically using the Huffman tables.
|
|||
|
|
|||
|
The inclusion of Huffman tables in the `puff` stream has minuscule space burden
|
|||
|
(on average maximum 320 bytes for each block. There is roughly one block per
|
|||
|
32KB of uncompressed data).
|
|||
|
|
|||
|
`bsdiff` of two `puffed` streams has much smaller patch in comparison to their
|
|||
|
deflate streams, but it is larger than uncompressed streams.
|
|||
|
|
|||
|
**The major benefits**
|
|||
|
|
|||
|
* Size of the patch is smaller than deflate stream’s patch.
|
|||
|
* `puff` and `huff` are deterministic operations.
|
|||
|
* `huff` is in order of 10X faster than full recompression. It is even faster
|
|||
|
than Huffman algorithm because it already has the Huffman tables and does
|
|||
|
not need to reconstruct it.
|
|||
|
* Both algorithms have very low memory footprint.
|
|||
|
* The underlying LZ77 algorithm can be anything (as long as it is deflate
|
|||
|
compatible). This includes google’s [zopfli]
|
|||
|
|
|||
|
**The drawbacks**
|
|||
|
|
|||
|
* The need to define a file format for the puffed stream and stay with this
|
|||
|
format forever. If format needs to be changed in the future, then some
|
|||
|
versioning mechanism should be there to handle it and backward compatibility
|
|||
|
should be maintained.
|
|||
|
* The need to define a payload format and stay with it forever. Similarly
|
|||
|
needs to be versioned if required later change.
|
|||
|
* Does not reduces the patch size as full recompression.
|
|||
|
|
|||
|
|
|||
|
## puffdiff and puffpatch
|
|||
|
|
|||
|
A deflate compressed file (gzip, zip, etc.) contains multiple deflate streams at
|
|||
|
different locations. When this file is puffed, the resulting `puff` stream
|
|||
|
contains both puffs and the raw data that existed between the deflates in the
|
|||
|
compressed file. When performing `huff` operation, the location of puffs in the
|
|||
|
`puff` stream and deflates in the deflate stream should be passed upon. This is
|
|||
|
necessary as `huff` operation has to know exactly where the locations of both
|
|||
|
puffs and deflates are. (See the following [image](puffin-stream))
|
|||
|
|
|||
|
![puffin-stream](images/puffin-stream.png "Puffin stream")
|
|||
|
|
|||
|
Similarly `puffpatch` requires deflates and puffs locations in both the source
|
|||
|
and target streams. These location information is saved using Protobufs in the
|
|||
|
patch generated by `bsdiff`. One requirement for these two operations are that
|
|||
|
`puffpatch` has to be efficient in the client. Client devices have limited
|
|||
|
memory and CPU bandwidth and it is necessary that each of these operations are
|
|||
|
performed with the most efficiency available. In order to achieve this
|
|||
|
efficiency a new operation can be added to `bspatch`, that reads and writes into
|
|||
|
a `puff` streams using special interfaces for puffing and huffing on the fly.
|
|||
|
|
|||
|
|
|||
|
## Puffin Patch Format
|
|||
|
|
|||
|
* Magic (4 bytes) - The string literal "PUF1".
|
|||
|
* Header Length (4 bytes) - The size of the header (length of the generated
|
|||
|
Protobuf).
|
|||
|
* Header - Lengths and locations of deflate and puffs streams in source and
|
|||
|
target files in a Protobuf format. See [puffin.proto].
|
|||
|
* Patch - This is a binary array directly generated by the `bsdiff` program.
|
|||
|
|
|||
|
## Puffin Stream Format
|
|||
|
|
|||
|
* [Block Header](#block-header-format) (3+ bytes) - Defines the type of the
|
|||
|
block.
|
|||
|
* Data - A mix of literals list and copy length/distances.
|
|||
|
* End of Block (2 bytes) - The end of block symbol. Similar to the symbols for
|
|||
|
copy length/distance but without the distance bytes. The actual copy length
|
|||
|
value is 259 (0x81FF).
|
|||
|
|
|||
|
### Block Header Format
|
|||
|
|
|||
|
* Length (2 Bytes) - The size of the block header excluding the two Length
|
|||
|
bytes itself - 1.
|
|||
|
* Final Block (1 Bit) - Whether the block is the last one or not.
|
|||
|
* 1 - Final block
|
|||
|
* 0 - Middle block
|
|||
|
* Type (2 Bits)
|
|||
|
* 0 - Uncompressed. Immediately after the header, zero or one literals
|
|||
|
list is present which defines the content of the uncompressed blocks.
|
|||
|
* 1 - Fixed Huffman table. The list of literals and length/distances comes
|
|||
|
immediately after the header. Fixed Huffman table is defined the deflate
|
|||
|
specification and will not change.
|
|||
|
* 2 - Dynamic Huffman table. The dynamic Huffman table comes at the end of
|
|||
|
the block header.
|
|||
|
* 3 - Undefined. This is an error.
|
|||
|
* Skip Bits (5 Bits) - Used only for uncompressed blocks (For other types its
|
|||
|
value is zero). In an uncompressed block, the [RFC 1951] skips any bits
|
|||
|
after reading final block and type bits until the byte boundary in the
|
|||
|
input. However, it does not define what the value of these bits should
|
|||
|
be. Most implementations assume 0, but some implementations may use any
|
|||
|
value. This segment contains those bits as a five-bits integer. When writing
|
|||
|
the block header back to the deflate format, the actual number of bits which
|
|||
|
where skipped will be calculated easily.
|
|||
|
* Huffman Table - It only comes for dynamic Huffman table.
|
|||
|
|
|||
|
#### Dynamic Huffman Table Format
|
|||
|
|
|||
|
Depending on the scheme for storing the Huffman tables, the payload size can
|
|||
|
change. We discovered that the following scheme does not produce the smallest
|
|||
|
payload, but it is the most deterministic one. In a deflate stream, Huffman
|
|||
|
tables for literals/length and distances are themselves Huffman coded. In this
|
|||
|
format, we also `puff` the Huffman tables into the `puff` stream instead of
|
|||
|
completely decompressing them.
|
|||
|
|
|||
|
There are three tables stored in this structure very similar to the one defined
|
|||
|
in [RFC 1951]. A Huffman table can be defined as an array of unsigned integer
|
|||
|
code length values. Three Puffed Huffman tables appear like the following
|
|||
|
scheme. The first table (codes) is the Huffman table used to decode the next two
|
|||
|
Huffman tables. The second Huffman table is used to decode literals and lengths,
|
|||
|
and the third Huffman table is used to decode distances.
|
|||
|
|
|||
|
|
|||
|
* Literal/Length Count (1 byte) - Number of alphabets used for literal/length
|
|||
|
Huffman codes - 257.
|
|||
|
* Distance Count (1 byte) - Number of alphabets used for distance Huffman
|
|||
|
codes - 1.
|
|||
|
* Alphabet Count (1 byte) - Number of alphabets for coding two previous
|
|||
|
Huffman tables - 4.
|
|||
|
* Code Lengths ((Alphabet Count + 1) / 2 bytes) - A list of codes for reading
|
|||
|
the next two Huffman tables. Each byte contains two codes. If the number of
|
|||
|
codes is odd, the last four Bits will be zero.
|
|||
|
* Literal/Length/Distance Code Lengths - List of code lengths for encoding
|
|||
|
literals/lengths/distance followed The encoding is as follows:
|
|||
|
* [0..15] - Represent code [0..15]
|
|||
|
* [16..19] - Copy the last code length [3..6] times.
|
|||
|
* [20..27] - Repeat code length of 0 for [3..10] times.
|
|||
|
* [28..155] - Repeat code length of 0 for [11..138] times.
|
|||
|
|
|||
|
### Literals List
|
|||
|
Literals lists are constructed by a “length” value followed by “length” bytes of
|
|||
|
literals. The Puffer should try to merge adjacent literals lists as much as
|
|||
|
possible into one literals list in the `puff` stream. This Is a length value
|
|||
|
followed by length bytes of literals (Even if there is only one literal.)
|
|||
|
|
|||
|
* Tag (1 Bit) - The value is 0 indicating that this is a list of literals (not
|
|||
|
a copy length/distance).
|
|||
|
* Length - The number of literals that would follow in the list.
|
|||
|
* (7 Bits) Length = [1 .. 127], The value is: Length - 1
|
|||
|
* (7 Bits + 2 Bytes) Length = [128 .. 65663], The values are: 127,
|
|||
|
Length - 127.
|
|||
|
Conserves size by using only one byte if the number of upcoming
|
|||
|
literals is smaller or equal to 127 (About 99% of literals length in a
|
|||
|
normal deflate stream fit into this category.) We should never have zero
|
|||
|
length literals. Otherwise it will use three bytes.
|
|||
|
* Literals (Length Bytes) - A sequence of Length number of literals.
|
|||
|
|
|||
|
### Copy Length/Distance
|
|||
|
|
|||
|
This Is a Length value followed by a Distance value.
|
|||
|
|
|||
|
* Tag (1 Bit) - The value is 1 indicating that this is a copy length/distance
|
|||
|
field.
|
|||
|
* Length - Conserves size by using only one byte if the length value is
|
|||
|
smaller or equal to 129. Otherwise two bytes are used.
|
|||
|
* (7 Bits) - Length = [3 .. 129], The value is: Length - 3
|
|||
|
* (7 Bites + 1 Byte) Length = [130 .. 258], The value is: 127, Length -
|
|||
|
130:
|
|||
|
* Distance (2 Bytes) - Distance = [1 .. 32768], The value is:
|
|||
|
Distance - 1. The distance value as an unsigned integer.
|
|||
|
|
|||
|
## Building Puffin
|
|||
|
Currently Puffin is being used in both Android and Chrome OS and is built
|
|||
|
differently in each of them. There is also a Makefile build, but it is not
|
|||
|
comprehensive.
|
|||
|
|
|||
|
## Usage
|
|||
|
|
|||
|
Puffin builds an executable `puffin` which can be used to perform diffing and
|
|||
|
patching algorithms using Puffin format. To get the list of options available
|
|||
|
run:
|
|||
|
|
|||
|
```shell
|
|||
|
puffin --help
|
|||
|
```
|
|||
|
|
|||
|
It can also be used as a library (currently used by update_engine) that provides
|
|||
|
different APIs.
|
|||
|
|
|||
|
## References
|
|||
|
|
|||
|
[RFC 1951]: https://www.ietf.org/rfc/rfc1951.txt
|
|||
|
[puffin.proto]: /src/puffin.proto
|
|||
|
[zopfli]: https://en.wikipedia.org/wiki/Zopfli
|