DELTA HTTP AND RPROXY Martin Pool $Id: delta-http.txt,v 1.2 2000/12/15 00:05:45 mbp Exp $ A few other people are working on http extensions with similar goals. This is documented in internet-drafts/draft-mogul-http-delta-02.txt (or more recent versions of that document.) They have a good summary of issues to be considered: * Resources may have multiple variants which contain different content (e.g. different languages, different content-encodings) and it would be terrible to confuse them. rproxy probably has to be more careful to fit into the spirit of http by the time it's properly released and documented. They make the useful definition of instance The entity that would be returned in a status-200 response to a GET request, at the current time, for the selected variant of the specified resource, but without the application of any content-coding or transfer-coding. in addition to the standard definitions from HTTP/1.0: resource A network data object or service that can be identified by a URI, as defined in section 3.2. Resources may be available in multiple representations (e.g. multiple languages, data formats, size, resolutions) or vary in other ways. entity The information transferred as the payload of a request or response. An entity consists of metainformation in the form of entity-header fields and content in the form of an entity-body, as described in section 7. variant A resource may have one, or more than one, representation(s) associated with it at any given instant. Each of these representations is termed a `variant.' Use of the term `variant' does not necessarily imply that the resource is subject to content negotiation. The paper ``Potential Benefits of Delta Encoding and Data Compression for HTTP'' is a good summary of why one might want to delta encoding, and of what's wrong with simple all-or-nothing caches. They refer to Banga et al., who tried to keep caches on both ends; and WebExpress. There are some interesting refernces: Stephen Williams, Marc Abrams, Charles R. Standridge, Ghaleb Abdulla, and Edward A. Fox . Removal Policies in Network Caches for World-Wide Web Documents. In Proc. SIGCOMM 96, pages 293-305. Stanford, CA, August, 1996. Barron C. Housel and David B. Lindquist. WebExpress: A System for Optimizing Web Browsing in a Wireless Environment. In Proc. 2nd Annual Intl. Conf. on Mobile Computing and Networking, pages 108-116. ACM, Rye, New York, November, 1996. http://www.networking.ibm.com/art/artwewp.htm. Gaurav Banga, Fred Douglis, and Michael Rabinovich. Optimistic Deltas for WWW Latency Reduction. In Proc. 1997 USENIX Technical Conference, pages 289-303. Anaheim, CA, January, 1997. draft-mogul-*-02 reckons that the only way to make sure that HTTP/1.0 caches don't get the delta confused with the real thing is to use a new 200-series response code. It's an interesting idea, but it is a shame to bend the server's response any more than is necessary. On the other hand we already have response codes for partial content or not-modified, so it's OK to put caching information in there. They prefer the `vdelta' differencing algorithm. Perhaps it has ideas that rsync can use. We should perhaps listen to the ``Cache-Control'' header for information about how long to retain the page. But perhaps not: people using that header today do so assuming all-or-nothing replacement, and so they might not work well with rproxy. They reckon the server should use a tag like DCluster: "//bar.example.net/foo?" To advise of clustering information; i.e. which cache entry to use. What if people try to use this to push other servers out of the cache? Could they even use it to interfere with content from other sites? They have to worry about unique ETags, and we don't. Perhaps the server would like to fix the base instance, though it's not terribly clear if they'll ever know enough to do that. The most recently used data is probably enough. DTemplate: "http://bar.example.net/foo.tplt" What if we did caching per block, and kept the blocks that were most useful, even if they span several entities or instances? That would be pretty cool, though it's hard to see how we could build that into server-side signatures. We can't escape having naïve caches in the middle. Consider people who're required by their ISP or employer to use a proxy, but who also want to run rproxy on their workstation. If the response is cacheable then we would like to let proxies outside of the rproxy chain cache it. They suggest representing the gzip encoding as a separate encoding transformation, rather than implying it in the diff encoding. This might be a good idea: it more easily lets the upstream decide whether to gzip or not. For example: * don't compress compressed data * don't compress if the load average is too high * choose the best compression algorithm understood by the client How is multipart-byteranges to be handled? gdiff: somewhat cleaner, but probably less compressible because they don't fold sequential blocks into constants.