-*- indented-text -*-
$Id: squid.txt,v 1.1 2000/08/04 07:19:38 mbp Exp $
by Martin Pool <mbp@linuxcare.com>

There's a pretty decent Programming Guide in the Squid CVS site.  

Overall, I think we're well on track to integrate nicely with Squid.
However this will not be trivial, largely because the Squid code is
quite complicated though clean.  Therefore I think we might have to
plan to use a stand-alone rproxy for a while longer.  It should be
complete, reliable and functional, although it need not be tuned for
performance.

Here are some observations relevant to integrating libhsync:

 * It's spelt `Squid' :)

 * Squid is a single-process server based on select(2).  It does not
   use threads.

	The code is often difficult to follow because there are no
	explicit state variables for the active requests.  Instead,
	thread execution progresses as a sequence of ``callback
	functions'' which get executed when I/O is ready to occur, or
	some other event has happened. As a callback function
	completes, it is responsible for registering the next callback
	function for subsequent I/O.

 * The current SourceForge CVS head version (2.4DEVEL2) doesn't seem
   to include any space for rewriting bodies, but perhaps it's just
   well hidden.  Alternatively, perhaps the support is just on another
   development branch.

 * I wonder about interposing rproxy as an external process -- for
   Apache 1.3 this might be much easier than trying to filter the
   output from the inside.

 * Squid must already filter data passing through to some extent,
   otherwise it couldn't save it to a cache.

 * Squid seems to have a (soft?) 4kB limit on headers.  In general we
   need to worry a lot about the expansion in headers caused by the
   signature.  This is configurable, and over-long headers are written
   out to an error file for analysis (nice!)

 * The callback-based structure should allow us to nicely hook in at
   various points; it's more flexible than the simpler design we use
   of just storing state.

 * InvokeHandlers is called when data is received into a StoreEntry.
   This would be the point to hook in, I suppose

 * Squid uses separate callbacks for each FD, so we'll need to do
   something intelligent for both read and write operations.  There's
   also a `defer' callback, which gives the chance to avoid reading
   from an FD that has data, which we can use when data is arriving
   faster than it is being handled.

   Does this mean therefore that we need two different ways to invoke
   the nad-encoding callback?  One will try to accept more input and
   the other will try to produce more output.  I wonder if that's too
   complicated.

   An alternative, I suppose, is to drive everything off input, and
   just call back for output to let things flush out... but does this
   become eventually the same thing?  Hmm.

 * The most important obstacle to integrating into Squid at the moment
   is that nad output is blocking, and that decoding is completely
   blocking.  Both of these will need to change.

   Using blocking IO won't *stop* us putting it into Squid, it's just
   that performance will suck because Squid will block while
   processing completes.  However they won't simply work because Squid
   will leave the FDs in non-blocking mode, so we need to change
   something at least.

 * Also, we'll need to store the basis cache on disk, and perhaps
   integrate it with the Squid cache.  Squid cache objects contain
   both the object (body and headers) and Squid metadata.  Perhaps the
   signature can be stored as part of the metadata, but we won't be
   able to determine it until the whole body is received.  I'm not
   sure if this is allowed.  This would be much better than
   implementing our own cache.

   However, I think Squid wants to write the metadata before the
   content, so we probably can't get away with this.

 * As you'd expect, the code for handling HTTP and in particular
   headers is quite nice.  We might borrow part of the design.

 * Well-known HTTP headers are identified by a numeric ID as well as
   their string name.  Since most code should only operate on `known'
   headers, most functions take an ID as a parameter.  Also, Squid
   knows what type (string, integer, list, cache-control, range, date,
   ...) a header is meant to be, so we get a validity check and
   centralization.  

 * Oh, and this lets them do httpHeaderHas as a bit-mask lookup, which
   is very fast.  Unnecessary for rproxy, but nice.

 * The Packer object is like a hs_litbuf_t, but cuter.  It's a general
   way to accumulate data before sending it out.

LocalWords:  nad LocalWords FD FDs