Protocol Definition for HTTP rsync Encoding

Martin Pool

rproxy.samba.org


Introduction

HTTP rsync encoding is a set of compatible extensions to HTTP/1.1 that allow the efficient caching and update of frequently-changed or dynamicallly generated resources. This protocol description document should be read in conjunction with RFC2616, and the rsync in http paper.

The major protocol extension is a new transfer coding identified by the token "rsync", used in the Transfer-Encoding and TE headers. This encoding has some associated encoding parameters and implied behaviour.

As for standard HTTP, rsync encoding is concerned with the transfer of resources, comprising content and metadata. These are normally retrieved from servers to to clients (the GET and POST methods), but in some cases may be submitted from clients to servers (the PUT method.)

The purpose of rsync encoding is to allow more efficient download of entity bodies when the recipient already has a related basis instance. The rsync algorithm is used to compute a binary delta that describes the changes from a signature of the basis instance to the new instance.

The technique performs particularly well against HTML pages where the content is dynamically generated within an unchanging framework, such as news portals or bug-tracking databases. It is applicable to any kind of content, and works well for resources that are in fact unchanged but not known to be so, and for interrupted transfers of dynamic or pseudo-dynamic data.

HTTP Transfer-Encoding is always performed hop-by-hop.

For a typical HTTP GET request, rsync encoding is used as follows.

  1. The client checks its cache for an old version of the requested document to be used as a basis, or one suitably related according to a client-defined heuristic.
  2. The client calculates the signature of the basis, or retrieves it from the cache.
  3. The client makes an HTTP request as usual, include the TE: rsync line.
  4. The upstream server, which may be either a proxy or an origin server, prepares the response. The upstream server rsync-encodes the response body relative to the supplied signature, and sends the delta back as the body of the request, with a Transfer-Encoding: rsync header.

Integration with HTTP

rsync coding in HTTP requires that the immediate client understand HTTP/1.1, as transfer-coding is only supported in that version of the protocol. However, proxies may support transfer coding on behalf of clients that do not themselves understand the encoding.

rproxy fits into the RFC2616 definition of a ``non-transparent proxy'', in that it acts as both a client and a server, and modifies the requests and responses passing through it.

Methods

The it method of the HTTP request is identified in the first line. The most common values are GET, POST, and PUT, but other methods are described in RFC2616.

In general, HTTP methods are passed through to the upstream server because they do not need intervention from the proxy server.

One exception is the CONNECT method, which requires the proxy to specifically implement connection tunnelling. This method is not supported at the moment, even if there is an upstream proxy.

Transfer-Encoding

rproxy implements a Transfer-Encoding, rather than a Content-Encoding, because it is an encoding transformation applied to the content for carriage through the network and not part of the content itself. (rfc2616 s3.6).

As noted in rfc2616 s3.6, the set of transfer-codings applied to a message body MUST include "chunked" unless the message is terminated by closing the connections. Chunked transfer coding SHOULD be used in conjunction with rsync coding.

Clients that wish to receive rsync-coded responses should include the request-header:

TE: rsync, chunked

rsync-coded responses should include a header such as

Transfer-Encoding: chunked, rsync

Request from decoder

The decoder must specify that it will accept the ``rsync'' transfer-encoding format, with a header like this:

Accept-encoding: rsync

If the client has a cache file suitable for use as a basis, then it attaches the signature of that basis file in a base64-encoded header. It need not specify the URL of the basis file.

Rsync-signature: BASE64-DATA

The client may of course also accept other encodings, such as gzip. The server gets to decide what encoding it will use; for this example we assume it uses rsync.

The signature is visible to all servers in the chain, including the origin server. This means that programs running on the server have the chance to directly rsync-encode their output if they wish.

Proxy

The rsync proxy rproxy transparently integrates rsync encoding into an existing network. Communications between pairs of rproxies are rsync-encoded, so the client and server at either end need not know the protocol.

rproxy normally acts as an http proxy similar to squid: it takes requests from ``downstream'' clients, and passes them through to an upstream server. The proxy can either forward requests to an upstream proxy, or can send them direct to the origin server. Requests to the proxy must specify the host to which they are directed.

rproxy can also act as an ``http accelerator'', where it stands directly in front of the origin server. The hostname in this case is implied: rproxy always forwards to the host it's configured to accelerate.

Direct connection

rproxy has to do a bit more work when it's making a direct connection. Firstly, it has to extract the hostname from the URL and look up it's address. Secondly, it has to transform the full URL, like this

GET http://fox.uq.net.au/foo.txt HTTP/1.0

into a request for just the path:

GET /foo.txt HTTP/1.0

HTTP/1.1 asks that all servers be able to handle full URLs for local request, but it seems less than half the servers currently on the net can do this. Many tend to give ``file not found'' errors if presented with the former request.

Chains of proxies

rproxy should play nicely with other proxies, and in fact will be happier this way. Other proxies can handle translation between http and ftp, for example, whereas rproxy will simply refuse to access ftp directories.

As a convenience, rproxy can run without an upstream proxy, and will enter this mode by default if no -u option is specified.

Interaction with other caching mechanisms

If-Modified-Since

The most common caching mechanism in use today is the If-Modified-Since (IMS) header, which specifies the date and size of the file in the proxy cache. For example,

hdr> If-Modified-Since: Thu, 16 Sep 1999 22:19:42 GMT; length=3085

rproxy never generates, modifies, or removes this header. It may be passed through from some other server in the chain to the origin server. If the origin server determines that the document has not been modified it will send back a

304 Not Modified

response, which passes through rproxy.

If the cache that matched is upstream of the rproxy chain, then the response back through to the client will be rsync-encoded as usual.

No-Cache

rproxy ignores no-cache directives, because the situations in which they are normally used do not apply to rproxy: rproxy will always forward a request upstream even if it has the resource in cache, and rproxy guarantees always to return to the user-agent a document identical to that provided by the origin server.

Server generated signatures

When a server sends down an rsync-encoded file, it also includes the signature of that file as it should be returned to the server on future requests.

The prototype of rproxy left the responsibility for generating signatures with the client. This may cause patent problems, so the released versions generates signatures on the server.

If the client indicates that it can accept the ``rsync'' transfer-encoding, then the server can decide to send content back in that encoding. If the client has an appropriate file

The server might also be intelligent about deciding what files to encode. For example, PNG and JPG image files are already compressed, and probably (yet to be tested) don't often change in a way that rsync can handle. Therefore, the server can know never to send these files rsync-encoded, or to put a signature in them.

The server might also be able to cache signatures for resources that change infrequently: pages that are updated every day, for example.

In general the resource being sent by the server may be dynamically generated, so the server will not know its signature until it's seen the whole resource. Since the resource may be of arbitrary length.

It would not work well just to put the signature at the end, either: we want the client to be able to resume the transfer if it is interrupted part-way through.

Because signatures are always sent after the data they describe, the client always has at least as much data in its cache as it has signatures for. It should therefore be able to handle any response encoded relative to that signature.

Non-terminating responses should be handled OK by this approach.

It seems like the right place to put intelligence about signatures is upstream, and on the origin server if possible: it will know more than anyone about the structure and content of its resources.

The choice of which signature to send up is still left to the client. In future versions it might be useful for the server to give the client brief advice about this: perhaps an ``Rsync-Affinity'' header that suggests for what URLs this cached file should be used as an old-file.

Format

Two binary formats are used in rproxy; both are carried within standard HTTP transactions. The rproxy signature is a digest of the file that supports generating differences; signatures are stored in the cache in the decoder.

The rproxy encoded format is returned from the encoder to the decoder. In conjunction with the cached file, the encoded message regenerates the new file; it is conceptually similar to a ``diff'' or ``patch''.

Signatures

signature    := HS_SIG_MAGIC block_len strong_len block_sig* 
HS_SIG_MAGIC := 0x72730136 /* r s 1 6 */
block_len := uint32
block_sig := weak_sum strong_sum
weak_sum := uint32
block_sig := byte[strong_len]

Choosing a signature size

The current implementation always limits the signature size to 512 bytes (before base64 encoding), by choosing a block size such that the cache file can be completely described in less than 512 bytes.

Choosing a signature size and block size involves all these factors:

With server-generated signatures the server may not know the length of the content until it's seen all of it. If it receives a Content-Length header, it can use that value to choose an optimal block length. Otherwise, it will have to just use a default block length.

Encoded Differences

The encoded difference stream contains 32-bit integers and bytes. These integers are delta-encoded: each is written into the stream as it's true value minus the last ID value written. (TODO: give an example; explain why this is useful.)

The encoded stream begins with an integer giving the protocol version number. The current version is 666.

The second integer is the block size.

The remainder of the stream is a sequence of variable-length it packets introduced by a 32-bit integer it type.

If the type is positive, the block from the original file with that number is inserted.

If the type is negative, it's absolute value gives the length of a chunk of literal data following.

If the integer is zero, it is followed by literal data to be appended to the signature for this resource. The integer immediately following is the length of the signature chunk. This is followed by literal bytes of signature data

Any number of signature chunks can appear in the stream in any positions convenient to the server. The client should concatenate the signature data and send it back unchanged on requests where it intends to use this instance as a basis.

The stream of differences is terminated by the end of the underlying protocol. This could be the end of the TCP connection in a simple HTTP request, or the end signified by a multipart/chunked or other encoded message.

Differences from rsync

rsync always generates signatures on the client, and therefore does not need to mix the signature in with the encoded differences.

rsync uses the token 0 to indicate the end of the file, rather than signature data. There is no need to do this in the rsync Transfer-Encoding, because RFC2616 requires that the encoding be terminated by the end of the connection.

rsync compares a strong checksum of the entire file on completion, and restarts the transmission if they mismatch. This makes the program even more safe against checksum collisions, but is not viable for dynamically generated content, as we can't restart a web transaction.


$Id: protocol.latte,v 1.7 2001/02/26 12:50:21 mbp Exp $

Copyright (C) 1999-2001 by Martin Pool.