-*- indented-text -*- $Id: squid.txt,v 1.1 2000/08/04 07:19:38 mbp Exp $ by Martin Pool There's a pretty decent Programming Guide in the Squid CVS site. Overall, I think we're well on track to integrate nicely with Squid. However this will not be trivial, largely because the Squid code is quite complicated though clean. Therefore I think we might have to plan to use a stand-alone rproxy for a while longer. It should be complete, reliable and functional, although it need not be tuned for performance. Here are some observations relevant to integrating libhsync: * It's spelt `Squid' :) * Squid is a single-process server based on select(2). It does not use threads. The code is often difficult to follow because there are no explicit state variables for the active requests. Instead, thread execution progresses as a sequence of ``callback functions'' which get executed when I/O is ready to occur, or some other event has happened. As a callback function completes, it is responsible for registering the next callback function for subsequent I/O. * The current SourceForge CVS head version (2.4DEVEL2) doesn't seem to include any space for rewriting bodies, but perhaps it's just well hidden. Alternatively, perhaps the support is just on another development branch. * I wonder about interposing rproxy as an external process -- for Apache 1.3 this might be much easier than trying to filter the output from the inside. * Squid must already filter data passing through to some extent, otherwise it couldn't save it to a cache. * Squid seems to have a (soft?) 4kB limit on headers. In general we need to worry a lot about the expansion in headers caused by the signature. This is configurable, and over-long headers are written out to an error file for analysis (nice!) * The callback-based structure should allow us to nicely hook in at various points; it's more flexible than the simpler design we use of just storing state. * InvokeHandlers is called when data is received into a StoreEntry. This would be the point to hook in, I suppose * Squid uses separate callbacks for each FD, so we'll need to do something intelligent for both read and write operations. There's also a `defer' callback, which gives the chance to avoid reading from an FD that has data, which we can use when data is arriving faster than it is being handled. Does this mean therefore that we need two different ways to invoke the nad-encoding callback? One will try to accept more input and the other will try to produce more output. I wonder if that's too complicated. An alternative, I suppose, is to drive everything off input, and just call back for output to let things flush out... but does this become eventually the same thing? Hmm. * The most important obstacle to integrating into Squid at the moment is that nad output is blocking, and that decoding is completely blocking. Both of these will need to change. Using blocking IO won't *stop* us putting it into Squid, it's just that performance will suck because Squid will block while processing completes. However they won't simply work because Squid will leave the FDs in non-blocking mode, so we need to change something at least. * Also, we'll need to store the basis cache on disk, and perhaps integrate it with the Squid cache. Squid cache objects contain both the object (body and headers) and Squid metadata. Perhaps the signature can be stored as part of the metadata, but we won't be able to determine it until the whole body is received. I'm not sure if this is allowed. This would be much better than implementing our own cache. However, I think Squid wants to write the metadata before the content, so we probably can't get away with this. * As you'd expect, the code for handling HTTP and in particular headers is quite nice. We might borrow part of the design. * Well-known HTTP headers are identified by a numeric ID as well as their string name. Since most code should only operate on `known' headers, most functions take an ID as a parameter. Also, Squid knows what type (string, integer, list, cache-control, range, date, ...) a header is meant to be, so we get a validity check and centralization. * Oh, and this lets them do httpHeaderHas as a bit-mask lookup, which is very fast. Unnecessary for rproxy, but nice. * The Packer object is like a hs_litbuf_t, but cuter. It's a general way to accumulate data before sending it out. LocalWords: nad LocalWords FD FDs