The rproxy protocol Martin Pool $Date: 2000/08/04 07:19:38 $ 11.. IInnttrroodduuccttiioonn// rproxy is a set of compatible extensions to HTTP/1.1 that allow the efficient caching and update of frequently-changed or dynamicallly generated resources. The extensions are: +o A new encoding type rsync, used in the Transfer-Encoding and TE headers. +o A binary encoding format for transferring signatures and differences relative to a cached file. +o A new request header, Rsync-signature. +o A new status code, 266 Delta, indicating that the response contains the encoded delta between and old instance and the new instace. 22.. PPrriinncciipplleess ooff OOppeerraattiioonn// Four parties communicate over HTTP in the rproxy protocol: +o user agent +o decoding proxy +o encod proxy +o origin server These parties may be merged in some cases: for example, decoding may be handled by code inside the client, or encoding by code inside the origin server. For the time being the most common configuration will have them distinct. HTTP proxies may be interposed at any point. Complying HTTP proxies that do not understand the rproxy extensions will transparently pass through rproxy requests. +o Client sends an HTTP request to a local rproxy. +o rproxy checks whether it has an old version of that document, or one like it, in cache. +o The local rproxy sends a `signature' of the old version across the Internet to a remote rproxy. +o The remote rproxy forwards the request to the origin server, which generates the content. +o The origin server returns the full content back to the rproxy. +o rproxy computes the differences between the signature of the old cached document and the new version. It transmits a compressed description of those changes back to the client's rproxy. +o The client's rproxy applies the change list to it's old cached copy, updates the cache, and forwards the regenerated document back to the client. The result is that only a description of the old document and of the changes need be transferred across the slow network. The technique performs particularly well against web sites where the content is dynamically generated within an unchanging framework, such as news portals or bug-tracking databases. 33.. SSeerrvveerr ggeenneerraatteedd ssiiggnnaattuurreess// When a server sends down an rsync-encoded file, it also includes the signature of that file as it should be returned to the server on future requests. The prototype of rproxy left the responsibility for generating signatures with the client. This may cause patent problems, so the released versions generates signatures on the server. If the client indicates that it can accept the ``rsync'' transfer- encoding, then the server can decide to send content back in that encoding. If the client has an appropriate file The server might also be intelligent about deciding what files to encode. For example, PNG and JPG image files are already compressed, and probably (yet to be tested) don't often change in a way that rsync can handle. Therefore, the server can know never to send these files rsync-encoded, or to put a signature in them. The server might also be able to cache signatures for resources that change infrequently: pages that are updated every day, for example. In general the resource being sent by the server may be dynamically generated, so the server will not know its signature until it's seen the whole resource. Since the resource may be of arbitrary length. It would not work well just to put the signature at the end, either: we want the client to be able to resume the transfer if it is interrupted part-way through. Because signatures are always sent after the data they describe, the client always has at least as much data in its cache as it has signatures for. It should therefore be able to handle any response encoded relative to that signature. Non-terminating responses should be handled OK by this approach. It seems like the right place to put intelligence about signatures is upstream, and on the origin server if possible: it will know more than anyone about the structure and content of its resources. The choice of _w_h_i_c_h signature to send up is still left to the client. In future versions it might be useful for the server to give the client brief advice about this: perhaps an ``Rsync-Affinity'' header that suggests for what URLs this cached file should be used as an old- file. 44.. FFoorrmmaatt// Two binary formats are used in rproxy; both are carried within standard HTTP transactions. The rproxy signature is a digest of the file that supports generating differences; signatures are stored in the cache in the decoder. The rproxy encoded format is returned from the encoder to the decoder. In conjunction with the cached file, the encoded message regenerates the new file; it is conceptually similar to a ``diff'' or ``patch''. 44..11.. SSiiggnnaattuurreess 44..11..11.. CChhoooossiinngg aa ssiiggnnaattuurree ssiizzee// The current implementation always limits the signature size to 512 bytes (before base64 encoding), by choosing a block size such that the cache file can be completely described in less than 512 bytes. Choosing a signature size and block size involves all these factors: +o Longer signatures take up more space in the request. In particular, causing an additional packet in the request will increase latency. +o Smaller blocks are not much worse than larger blocks at encoding unchanged files because a sequence of blocks is very compressible. With server-generated signatures the server may not know the length of the content until it's seen all of it. If it receives a Content- Length header, it can use that value to choose an optimal block length. Otherwise, it will have to just use a default block length. 44..22.. EEnnccooddeedd DDiiffffeerreenncceess// The Encoded Difference format is combined with the body of an old instance to generate a new instance and a new server-generated signature. The format is similar to the gdiff generic diff format, but is extended to allow the server-generated signature to be intermingled with the differences. The signature is opaque to the client. The client need only store the signature and echo it back to the server on the next request in which it wishes to use this instance as the basis instance. The encoded difference stream begins with a 32-bit magic version number. For the current version of the protocol, this is "gd\0x00\0x01". The remainder of the encoded stream is a sequence of variable-length command packets. These can either contain inline code or signature data; an instruction to copy an extent of the basis instace; or a marker for the end of file. When executed in sequence by the decoding client, the commands reproduce the new instance as sent into the encoder. Numbers are always transferred in network byte order. 55.. IInntteeggrraattiioonn wwiitthh HHTTTTPP// As of early 2000, the Internet as a whole is using something in between HTTP/1.0 and HTTP/1.1. Therefore rproxy is designed to only require HTTP/1.0 features, but to fit cleanly into the HTTP/1.1 draft. rproxy fits into the RFC2616 definition of a ``non-transparent proxy'', in that it acts as both a client and a server, and modifies the requests and responses passing through it. 55..11.. MMeetthhooddss// The method of the HTTP request is identified in the first line. The most common values are GET, POST, and PUT, but other methods are described in RFC2616. In general, HTTP methods are passed through to the upstream server because they do not need intervention from the proxy server. One exception is the CONNECT method, which requires the proxy to specifically implement connection tunnelling. This method is not supported at the moment, even if there is an upstream proxy. 55..22.. TTrraannssffeerr--EEnnccooddiinngg// rproxy implements a Transfer-Encoding, rather than a Content-Encoding, because it is an encoding transformation applied to the content for carriage through the network and not part of the content itself. (rfc2616 s3.6). 55..33.. VViiaa RFC2616 requires that the Via general-header field MUST be used by gateways and proxies to indicate the intermediate protocols and recipients between the user agent and the server on requests, and between the origin server and the client on responses. rproxy adds a Via header to both responses and requests to show that it has handled the request. One instance is added for each proxy the request and response passes through. For example: ______________________________________________________________________ Via: 1.0 sanguine.linuxcare.com.au:3128 (rproxy/0.2pre) ______________________________________________________________________ 55..44.. RReeqquueesstt ffrroomm ddeeccooddeerr// The decoder must specify that it will accept the ``rsync'' transfer- encoding format, with a header like this: ______________________________________________________________________ Accept-encoding: rsync ______________________________________________________________________ If the client has a cache file suitable for use as a basis, then it attaches the signature of that basis file in a base64-encoded header. It need not specify the URL of the basis file. ______________________________________________________________________ Rsync-signature: BASE64-DATA ______________________________________________________________________ The client may of course also accept other encodings, such as gzip. The server gets to decide what encoding it will use; for this example we assume it uses rsync. The signature is visible to all servers in the chain, including the origin server. This means that programs running on the server have the chance to directly rsync-encode their output if they wish. 66.. PPrrooxxyy The rsync proxy rproxy transparently integrates rsync encoding into an existing network. Communications between pairs of rproxies are rsync- encoded, so the client and server at either end need not know the protocol. rproxy normally acts as an http proxy similar to Squid: it takes requests from ``downstream'' clients, and passes them through to an upstream server. The proxy can either forward requests to an upstream proxy, or can send them direct to the origin server. Requests to the proxy must specify the host to which they are directed. rproxy can also act as an ``http accelerator'', where it stands directly in front of the origin server. The hostname in this case is implied: rproxy always forwards to the host it's configured to accelerate. 66..11.. DDiirreecctt ccoonnnneeccttiioonn// rproxy has to do a bit more work when it's making a direct connection. Firstly, it has to extract the hostname from the URL and look up it's address. Secondly, it has to transform the full URL, like this ______________________________________________________________________ GET http://fox.uq.net.au/foo.txt HTTP/1.0 ______________________________________________________________________ into a request for just the path: ______________________________________________________________________ GET /foo.txt HTTP/1.0 ______________________________________________________________________ HTTP/1.1 asks that all servers be able to handle full URLs for local request, but it seems less than half the servers currently on the net can do this. Many tend to give ``file not found'' errors if presented with the former request. 66..22.. CChhaaiinnss ooff pprrooxxiieess// rproxy should play nicely with other proxies, and in fact will be happier this way. Other proxies can handle translation between http and ftp, for example, whereas rproxy will simply refuse to access ftp directories. As a convenience, rproxy can run without an upstream proxy, and will enter this mode by default if no -u option is specified. 66..33.. IInntteerraaccttiioonn wwiitthh ccaacchhiinngg mmeecchhaanniissmmss// 66..33..11.. IIff--MMooddiiffiieedd--SSiinnccee The most common caching mechanism in use today is the If-Modified- Since (IMS) header, which specifies the date and size of the file in the proxy cache. For example, ______________________________________________________________________ hdr> If-Modified-Since: Thu, 16 Sep 1999 22:19:42 GMT; length=3085 ______________________________________________________________________ rproxy never generates, modifies, or removes this header. It may be passed through from some other server in the chain to the origin server. If the origin server determines that the document has not been modified it will send back a 304 Not Modified response, which passes through rproxy. If the cache that matched is upstream of the rproxy chain, then the response back through to the client will be rsync-encoded as usual. 66..33..22.. NNoo--CCaacchhee// rproxy ignores no-cache directives, because the situations in which they are normally used do not apply to rproxy: rproxy will always forward a request upstream even if it has the resource in cache, and rproxy guarantees always to return to the user-agent a document identical to that provided by the origin server.